
Exploratory Data Analysis
Turn an unknown scientific or tabular file into a structured EDA report with quality checks before modeling or pipeline work.
Overview
Exploratory Data Analysis is an agent skill most often used in Validate (also Build backend, Grow analytics) that produces a structured EDA report with quality and statistical sections for a target filename.
Install
npx skills add https://github.com/k-dense-ai/scientific-agent-skills --skill exploratory-data-analysisWhat is this skill?
- Full report skeleton: executive summary, basic file metadata, and format-specific library hints.
- Dedicated sections for data structure, dimensions, dtypes, and completeness coverage.
- Quality assessment blocks for range, format compliance, consistency, and corruption checks.
- Statistical summary slots for numerical, categorical, and distribution views.
- Temporal and domain-specific characteristics sections for time-series or scientific files.
- Report template with executive summary plus six major analysis sections (basic info through statistical summary)
Adoption & trust: 603 installs on skills.sh; 27.6k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You inherited a scientific or messy data file and need a repeatable quality and structure narrative before modeling or productizing metrics.
Who is it for?
Indie ML or data-product builders who want agent-driven dataset triage with consistent report headings.
Skip if: Production monitoring dashboards or automated drift alerting without a human-readable analysis pass.
When should I use this skill?
A new dataset file needs structured exploration, quality assessment, and library guidance before modeling or pipeline implementation.
What do I get? / Deliverables
You receive a filled EDA report with metadata, quality assessment, and statistical placeholders ready for downstream modeling or pipeline specs.
- Markdown EDA report with filename, quality, and statistical sections populated
- Recommendations for downstream analysis based on format and integrity checks
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
EDA is the canonical first shelf in Validate when you are prototyping with real datasets; the same report supports Build data work. Prototype subphase is where you inspect formats, missingness, and distributions before committing to features or models.
Where it fits
Profile a new lab export CSV before choosing features for a proof-of-concept model.
Document dtypes and missingness to draft API schema and ingestion validations.
Summarize distributions for a metrics deck after shipping an analytics slice.
How it compares
Structured EDA report template—not a one-click AutoML or hosted notebook replacement.
Common Questions / FAQ
Who is exploratory-data-analysis for?
Solo builders and small teams doing scientific or tabular data work who need their agent to document dataset shape and quality before coding pipelines.
When should I use exploratory-data-analysis?
In Validate while prototyping on new files; in Build when defining backend schemas; in Grow when summarizing analytics datasets for stakeholders.
Is exploratory-data-analysis safe to install?
Check the Security Audits panel on this Prism page; the skill reads local files you point at—avoid paths with secrets or PII you do not want in reports.
SKILL.md
READMESKILL.md - Exploratory Data Analysis
# Exploratory Data Analysis Report: {FILENAME} **Generated:** {TIMESTAMP} --- ## Executive Summary This report provides a comprehensive exploratory data analysis of the file `{FILENAME}`. The analysis includes file type identification, format-specific metadata extraction, data quality assessment, and recommendations for downstream analysis. --- ## Basic Information - **Filename:** `{FILENAME}` - **Full Path:** `{FILEPATH}` - **File Size:** {FILE_SIZE_HUMAN} ({FILE_SIZE_BYTES} bytes) - **Last Modified:** {MODIFIED_DATE} - **Extension:** `.{EXTENSION}` - **Format Category:** {CATEGORY} --- ## File Type Details ### Format Description {FORMAT_DESCRIPTION} ### Typical Data Content {TYPICAL_DATA} ### Common Use Cases {USE_CASES} ### Python Libraries for Reading {PYTHON_LIBRARIES} --- ## Data Structure Analysis ### Overview {DATA_STRUCTURE_OVERVIEW} ### Dimensions {DIMENSIONS} ### Data Types {DATA_TYPES} --- ## Quality Assessment ### Completeness - **Missing Values:** {MISSING_VALUES} - **Data Coverage:** {COVERAGE} ### Validity - **Range Check:** {RANGE_CHECK} - **Format Compliance:** {FORMAT_COMPLIANCE} - **Consistency:** {CONSISTENCY} ### Integrity - **Checksum/Validation:** {VALIDATION} - **File Corruption Check:** {CORRUPTION_CHECK} --- ## Statistical Summary ### Numerical Variables {NUMERICAL_STATS} ### Categorical Variables {CATEGORICAL_STATS} ### Distributions {DISTRIBUTIONS} --- ## Data Characteristics ### Temporal Properties (if applicable) - **Time Range:** {TIME_RANGE} - **Sampling Rate:** {SAMPLING_RATE} - **Missing Time Points:** {MISSING_TIMEPOINTS} ### Spatial Properties (if applicable) - **Dimensions:** {SPATIAL_DIMENSIONS} - **Resolution:** {SPATIAL_RESOLUTION} - **Coordinate System:** {COORDINATE_SYSTEM} ### Experimental Metadata (if applicable) - **Instrument:** {INSTRUMENT} - **Method:** {METHOD} - **Sample Info:** {SAMPLE_INFO} --- ## Key Findings 1. **Data Volume:** {DATA_VOLUME_FINDING} 2. **Data Quality:** {DATA_QUALITY_FINDING} 3. **Notable Patterns:** {PATTERNS_FINDING} 4. **Potential Issues:** {ISSUES_FINDING} --- ## Visualizations ### Distribution Plots {DISTRIBUTION_PLOTS} ### Correlation Analysis {CORRELATION_PLOTS} ### Time Series (if applicable) {TIMESERIES_PLOTS} --- ## Recommendations for Further Analysis ### Immediate Actions 1. {RECOMMENDATION_1} 2. {RECOMMENDATION_2} 3. {RECOMMENDATION_3} ### Preprocessing Steps - {PREPROCESSING_1} - {PREPROCESSING_2} - {PREPROCESSING_3} ### Analytical Approaches {ANALYTICAL_APPROACHES} ### Tools and Methods - **Recommended Software:** {RECOMMENDED_SOFTWARE} - **Statistical Methods:** {STATISTICAL_METHODS} - **Visualization Tools:** {VIZ_TOOLS} --- ## Data Processing Workflow ``` {WORKFLOW_DIAGRAM} ``` --- ## Potential Challenges 1. **Challenge:** {CHALLENGE_1} - **Mitigation:** {MITIGATION_1} 2. **Challenge:** {CHALLENGE_2} - **Mitigation:** {MITIGATION_2} --- ## References and Resources ### Format Specification - {FORMAT_SPEC_LINK} ### Python Libraries Documentation - {LIBRARY_DOCS} ### Related Analysis Examples - {EXAMPLE_LINKS} --- ## Appendix ### Complete File Metadata ```json {COMPLETE_METADATA} ``` ### Analysis Parameters ```json {ANALYSIS_PARAMETERS} ``` ### Software Versions - Python: {PYTHON_VERSION} - Key Libraries: {LIBRARY_VERSIONS} --- *This report was automatically generated by the exploratory-data-analysis skill.* *For questions or issues, refer to the skill documentation.* # Bioinformatics and Genomics File Formats Reference This reference covers file formats used in genomics, transcriptomics, sequence analysis, and related bioinformatics applications. ## Sequence Data Formats ### .fasta / .fa / .fna - FASTA Format **Description:** Text-based format for nucleotide or protein sequences **Typical Data:** DNA, RNA, or protein sequences with headers **Use Cases:** Sequence storage, BLAST searches, alignments **Python Libraries:** - `Biopython`: `SeqIO.parse('file.fasta', 'fa