
Harvard Art Museums Data Pipeline
Stand up an end-to-end Harvard Art Museums API ETL flow with SQL analytics and a Streamlit dashboard for a portfolio or internal research app.
Overview
Harvard Art Museums Data Pipeline is an agent skill for the Build phase that builds ETL, SQL analytics, and Streamlit visualization on Harvard Art Museums API data.
Install
npx skills add https://github.com/aradotso/data-skills --skill harvard-art-museums-data-pipelineWhat is this skill?
- End-to-end flow: Harvard Art Museums API fetch with pagination
- Transforms nested JSON into normalized relational tables
- Loads into MySQL or TiDB Cloud with analytical SQL on collections
- Streamlit plus Plotly dashboards for interactive museum insights
- Frontmatter lists eight concrete install and trigger phrases for discovery
- Eight frontmatter trigger phrases for Harvard Art Museums data engineering tasks
- Pipeline stages cover ETL, SQL analytics, and Streamlit visualization
Adoption & trust: 1 installs on skills.sh; 1 GitHub stars; 0/1 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
What problem does it solve?
You want a real data-engineering portfolio piece but do not know how to wire museum API pagination, normalization, and dashboards together.
Who is it for?
Solo builders learning or showcasing data engineering with a well-documented public museum API and Python stack.
Skip if: Products that only need a one-off CSV download with no database or dashboard, or teams standardized on Spark-only batch infra with no Streamlit surface.
When should I use this skill?
When building ETL, SQL analytics, or Streamlit dashboards on Harvard Art Museums API artifact data per SKILL.md triggers.
What do I get? / Deliverables
You leave with a repeatable ETL into MySQL or TiDB, analytical queries on art metadata, and a Streamlit dashboard you can demo or extend.
- Normalized relational schema and loaded artifact tables
- Analytical SQL queries over the collection
- Streamlit dashboard with Plotly visualizations
Recommended Skills
Journey fit
Full pipeline construction—extract, transform, load, and visualize—is core Build work before the product is shippable. Backend subphase fits API pagination, relational modeling, and database loads rather than frontend polish alone.
How it compares
Skill-guided project workflow for a specific museum API—not a generic dbt or Airflow MCP integration.
Common Questions / FAQ
Who is harvard-art-museums-data-pipeline for?
Indie developers and data practitioners who want agent help building a Harvard Art Museums-powered ETL, SQL layer, and Streamlit app as a ship-ready demo or product core.
When should I use harvard-art-museums-data-pipeline?
Use it during Build when scoping a data app, wiring API extraction and relational models, or standing up analytics dashboards—especially when triggers mention Harvard museum ETL, Streamlit, or SQL on artifact collections.
Is harvard-art-museums-data-pipeline safe to install?
You will handle API keys and database credentials; review the Security Audits panel on this Prism page, clone from the referenced GitHub repo, and never commit secrets to the agent context.
SKILL.md
READMESKILL.md - Harvard Art Museums Data Pipeline
# Harvard Art Museums Data Pipeline Skill > Skill by [ara.so](https://ara.so) — Data Skills collection. This skill enables you to build complete data engineering pipelines using the Harvard Art Museums API. The project demonstrates ETL workflows, relational database design, SQL analytics, and interactive visualization with Streamlit. ## What It Does The Harvard Artifacts Collection Data Engineering Analytics App: - Fetches artifact data from Harvard Art Museums API with pagination - Transforms nested JSON into normalized relational tables - Loads data into MySQL/TiDB Cloud databases - Executes analytical SQL queries on museum collections - Visualizes insights using Plotly charts in Streamlit dashboards ## Installation ```bash # Clone the repository git clone https://github.com/Manali0711/Harvard-Artifacts-Collection-Data-Engineering-Analytics-App.git cd Harvard-Artifacts-Collection-Data-Engineering-Analytics-App # Install dependencies pip install -r requirements.txt ``` **Required packages:** ```txt streamlit pandas requests mysql-connector-python plotly python-dotenv ``` ## Configuration ### API Key Setup Get your Harvard Art Museums API key from: https://www.harvardartmuseums.org/collections/api Store credentials in environment variables or `.env` file: ```bash # .env HARVARD_API_KEY=your_api_key_here DB_HOST=your_database_host DB_USER=your_db_username DB_PASSWORD=your_db_password DB_NAME=harvard_artifacts ``` ### Database Configuration Create the required database schema: ```sql CREATE DATABASE harvard_artifacts; USE harvard_artifacts; CREATE TABLE artifactmetadata ( id INT PRIMARY KEY, title VARCHAR(500), culture VARCHAR(200), period VARCHAR(200), century VARCHAR(100), classification VARCHAR(200), department VARCHAR(200), dated VARCHAR(200), accession_number VARCHAR(100), rank INT, url VARCHAR(500) ); CREATE TABLE artifactmedia ( media_id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, base_image_url VARCHAR(500), image_count INT, FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); CREATE TABLE artifactcolors ( color_id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, color VARCHAR(50), spectrum VARCHAR(50), percentage FLOAT, FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); ``` ## Core ETL Pipeline ### Extract: Fetch Data from API ```python import requests import os from dotenv import load_dotenv load_dotenv() def fetch_artifacts(page=1, size=100): """Fetch artifacts from Harvard Art Museums API with pagination""" api_key = os.getenv('HARVARD_API_KEY') base_url = "https://api.harvardartmuseums.org/object" params = { 'apikey': api_key, 'page': page, 'size': size, 'hasimage': 1 # Only artifacts with images } response = requests.get(base_url, params=params) response.raise_for_status() data = response.json() return data['records'], data['info'] # Fetch multiple pages def fetch_all_artifacts(max_pages=10): all_artifacts = [] for page in range(1, max_pages + 1): records, info = fetch_artifacts(page=page) all_artifacts.extend(records) print(f"Fetched page {page}/{info['pages']}") if page >= info['pages']: break return all_artifacts ```