
Harvard Art Museum Data Pipeline
Scaffold Harvard Art Museums API ETL, relational storage, SQL analytics, and a Streamlit dashboard for collection insights.
Overview
harvard-art-museum-data-pipeline is an agent skill for the Build phase that builds ETL, SQL analytics, and Streamlit dashboards on Harvard Art Museums API data.
Install
npx skills add https://github.com/aradotso/data-skills --skill harvard-art-museum-data-pipelineWhat is this skill?
- Harvard Art Museums API integration with pagination and rate limiting
- ETL flow that flattens nested JSON into normalized relational tables
- 20+ predefined SQL analytical queries for artifact insights
- Streamlit dashboard with Plotly visualizations
- Normalized database schema for museum artifact metadata
- ETL pipeline with pagination and rate limiting on Harvard Art Museums API
Adoption & trust: 1 installs on skills.sh; 1 GitHub stars; 2/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
What problem does it solve?
You want a museum collection analytics app but do not have a proven ETL design, schema, or query library for the Harvard API’s nested payloads.
Who is it for?
Solo builders prototyping data apps, portfolio pipelines, or internal dashboards on open cultural heritage APIs.
Skip if: Production deployments that need unrelated museums, real-time streaming ingest, or zero Python/SQL setup.
When should I use this skill?
Triggers include building Harvard Art Museums ETL, Streamlit analytics apps, SQL artifact analysis, and museum API database projects.
What do I get? / Deliverables
You get a working pipeline into relational tables, ready-made analytical SQL, and a Streamlit Plotly dashboard you can extend or deploy.
- Normalized relational schema and loaded artifact tables
- ETL scripts with API pagination handling
- Streamlit analytics app with Plotly charts
Recommended Skills
Journey fit
End-to-end pipeline and app construction is core Build work once you have an API-backed product or internal analytics tool to ship. backend fits because the skill centers on ETL, schema design, SQL, and API integration rather than UI polish alone.
How it compares
Use as a full-stack data blueprint—not a single REST client snippet or static CSV import.
Common Questions / FAQ
Who is harvard-art-museum-data-pipeline for?
It is for solo and indie developers who want agent help building Python ETL, SQL analytics, and Streamlit visualization on Harvard Art Museums collection data.
When should I use harvard-art-museum-data-pipeline?
Use it in Build/backend when you need to query the Harvard Art Museums API, store artifacts in a database, run analytical SQL, or stand up a Streamlit dashboard—as listed in the skill triggers.
Is harvard-art-museum-data-pipeline safe to install?
Review the Security Audits panel on this Prism page; the skill uses external API access and local database/filesystem work, so confirm keys, rate limits, and dependencies in the upstream repo.
SKILL.md
READMESKILL.md - Harvard Art Museum Data Pipeline
# Harvard Art Museum Data Pipeline Skill > Skill by [ara.so](https://ara.so) — Data Skills collection. This skill enables AI coding agents to help developers build end-to-end data engineering and analytics applications using the Harvard Art Museums API. The project demonstrates ETL pipelines, SQL database design, analytics queries, and interactive Streamlit dashboards for museum artifact data. ## What This Project Does The Harvard Artifacts Collection Data Engineering Analytics App provides: - **API Integration**: Fetch artifact data from Harvard Art Museums API with pagination and rate limiting - **ETL Pipeline**: Extract, transform, and load nested JSON into relational database tables - **SQL Analytics**: 20+ predefined analytical queries for artifact insights - **Interactive Dashboard**: Streamlit-based visualization with Plotly charts - **Database Design**: Normalized schema with artifact metadata, media, and color tables ## Installation ```bash # Clone the repository git clone https://github.com/Manali0711/Harvard-Artifacts-Collection-Data-Engineering-Analytics-App.git cd Harvard-Artifacts-Collection-Data-Engineering-Analytics-App # Install dependencies pip install -r requirements.txt # Set up environment variables # Create .env file with: # HARVARD_API_KEY=your_api_key_here # DB_HOST=your_database_host # DB_USER=your_database_user # DB_PASSWORD=your_database_password # DB_NAME=harvard_artifacts ``` ## Configuration ### API Key Setup Get your API key from [Harvard Art Museums API](https://www.harvardartmuseums.org/collections/api): ```python import os from dotenv import load_dotenv load_dotenv() API_KEY = os.getenv('HARVARD_API_KEY') ``` ### Database Configuration The project supports MySQL and TiDB Cloud: ```python import mysql.connector import os db_config = { 'host': os.getenv('DB_HOST'), 'user': os.getenv('DB_USER'), 'password': os.getenv('DB_PASSWORD'), 'database': os.getenv('DB_NAME'), 'port': int(os.getenv('DB_PORT', 3306)) } conn = mysql.connector.connect(**db_config) ``` ## Database Schema ### Create Tables ```sql -- Artifact Metadata Table CREATE TABLE artifactmetadata ( id INT PRIMARY KEY, title VARCHAR(500), culture VARCHAR(200), century VARCHAR(100), dated VARCHAR(200), classification VARCHAR(200), department VARCHAR(200), division VARCHAR(200), technique VARCHAR(500), medium VARCHAR(500), period VARCHAR(200), provenance TEXT, description TEXT, url VARCHAR(500), totalpageviews INT, totaluniquepageviews INT ); -- Artifact Media Table CREATE TABLE artifactmedia ( id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, baseimageurl VARCHAR(500), primaryimageurl VARCHAR(500), imagecount INT, videocount INT, FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); -- Artifact Colors Table CREATE TABLE artifactcolors ( id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, color VARCHAR(50), spectrum VARCHAR(50), hue VARCHAR(50), percent FLOAT, FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); ``` ## ETL Pipeline Implementation ### Extract: Fetch Data from API ```python import requests import pandas as pd import time def fetch_artifacts(api_key, num_records=100, page_size=100): """ Fetch artifact data from Harvard Art Museums API """ base_url = "https: