
Harvard Art Museums Etl Pipeline
Stand up an end-to-end Harvard Art Museums API ETL plus SQL storage and a Streamlit analytics dashboard as a solo data project.
Install
npx skills add https://github.com/aradotso/data-skills --skill harvard-art-museums-etl-pipelineWhat is this skill?
- Paginated extraction from the Harvard Art Museums REST API into Python pipelines
- Transforms nested artifact JSON into normalized relational schemas for MySQL/TiDB Cloud
- Batch load patterns and reusable SQL query sets for collection analytics
- Streamlit app layer with interactive Plotly charts over loaded museum metadata
- End-to-end data-engineering reference from extract through dashboard visualization
Adoption & trust: 1 installs on skills.sh; 1 GitHub stars; trending (+100% hot-view momentum).
Recommended Skills
Paper Context Resolverlllllllama/ai-paper-reproduction-skill
Repo Intake And Planlllllllama/ai-paper-reproduction-skill
Env And Assets Bootstraplllllllama/ai-paper-reproduction-skill
Minimal Run And Auditlllllllama/ai-paper-reproduction-skill
Analyze Projectlllllllama/rigorpilot-skills
Ai Research Reproductionlllllllama/rigorpilot-skills
Journey fit
Primary fit
Canonical shelf is Build because the skill walks through extracting API data, modeling relational schemas, loading batches, and shipping an interactive dashboard—not idea research or production ops. Backend fits the core work: pagination against the museums API, JSON normalization, MySQL/TiDB loads, and predefined SQL analytics feeding Plotly visualizations.
SKILL.md
READMESKILL.md - Harvard Art Museums Etl Pipeline
# Harvard Art Museums ETL Pipeline & Analytics > Skill by [ara.so](https://ara.so) — Data Skills collection. This project provides an end-to-end data engineering solution for extracting, transforming, and analyzing art museum data from the Harvard Art Museums API. It demonstrates ETL pipeline development, SQL database design, and interactive analytics dashboards using Streamlit. ## What This Project Does The Harvard Artifacts Collection Data Engineering & Analytics App: - **Extracts** artifact data from Harvard Art Museums API with pagination - **Transforms** nested JSON into normalized relational database schemas - **Loads** data into MySQL/TiDB Cloud with batch operations - **Analyzes** data using predefined SQL queries - **Visualizes** insights through interactive Plotly charts in Streamlit **Architecture Flow**: API → ETL → SQL → Analytics → Visualization ## Installation ```bash # Clone the repository git clone https://github.com/Manali0711/Harvard-Artifacts-Collection-Data-Engineering-Analytics-App.git cd Harvard-Artifacts-Collection-Data-Engineering-Analytics-App # Install dependencies pip install -r requirements.txt ``` **Key Dependencies**: ``` streamlit pandas requests mysql-connector-python plotly python-dotenv ``` ## Configuration ### Environment Variables Create a `.env` file in the project root: ```env # Harvard Art Museums API HARVARD_API_KEY=your_api_key_here # Database Configuration DB_HOST=your_database_host DB_PORT=3306 DB_USER=your_db_user DB_PASSWORD=your_db_password DB_NAME=harvard_artifacts ``` ### Get Harvard API Key 1. Visit [Harvard Art Museums API](https://docs.harvardartmuseums.org/api/) 2. Request an API key (free) 3. Add to your `.env` file ### Database Setup ```sql -- Create database CREATE DATABASE harvard_artifacts; -- Create artifact metadata table CREATE TABLE artifactmetadata ( id INT PRIMARY KEY, title VARCHAR(500), culture VARCHAR(200), period VARCHAR(200), century VARCHAR(100), classification VARCHAR(200), division VARCHAR(200), department VARCHAR(200), dated VARCHAR(200), medium VARCHAR(500), dimensions VARCHAR(500), creditline TEXT, accessionyear INT, technique VARCHAR(300), verificationlevel INT ); -- Create artifact media table CREATE TABLE artifactmedia ( media_id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, iiifbaseuri VARCHAR(500), baseimageurl VARCHAR(500), publiccaption TEXT, FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); -- Create artifact colors table CREATE TABLE artifactcolors ( color_id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, color VARCHAR(100), spectrum VARCHAR(100), hue VARCHAR(100), percent FLOAT, FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); ``` ## Running the Application ```bash streamlit run app.py ``` The Streamlit app will open in your browser at `http://localhost:8501`. ## Key Components ### 1. API Data Extraction **Fetching Artifacts with Pagination**: ```python import requests import os from dotenv import load_dotenv load_dotenv() API_KEY = os.getenv('HARVARD_API_KEY') def fetch_artifacts(num_pages=5, page_size=100): """Fetch artifacts from Harvard Art Museums API with pagination""" bas