
Harvard Art Museum Etl Analytics
Stand up an end-to-end Harvard Art Museums API pipeline with relational storage, SQL analytics, and a Streamlit dashboard as a portfolio-grade data project.
Overview
Harvard Art Museum ETL Analytics is an agent skill for the Build phase that builds an API-to-warehouse-to-dashboard pipeline for Harvard Art Museums collection data using Python, SQL, and Streamlit.
Install
npx skills add https://github.com/aradotso/data-skills --skill harvard-art-museum-etl-analyticsWhat is this skill?
- Harvard Art Museums API extract with nested JSON flattened for relational tables
- MySQL or TiDB Cloud load path with analytical SQL over collection artifacts
- Architecture path API → ETL → SQL → Analytics → Visualization
- Streamlit plus Plotly dashboard layer for collection insights
- Clone-and-pip install workflow from the reference repository
- Five-step architecture: API → ETL → SQL → Analytics → Visualization
Adoption & trust: 1 installs on skills.sh; 1 GitHub stars; trending (+100% hot-view momentum).
What problem does it solve?
You want museum collection analytics but only have raw API JSON and no relational schema, ETL job, or dashboard to show the work.
Who is it for?
Solo builders shipping a data-engineering portfolio piece or internal prototype that must ingest a real public arts API and surface SQL-backed insights.
Skip if: Teams that only need a one-off chart without ETL, or products that cannot use external museum APIs and managed SQL hosts.
When should I use this skill?
You need to extract Harvard Art Museums data, load relational tables, run collection SQL, or ship a Streamlit analytics UI.
What do I get? / Deliverables
You get a documented API → ETL → SQL → analytics → Streamlit path with loadable schemas and query patterns you can extend or deploy as a demo app.
- Relational schema and ETL load for museum artifacts
- Analytical SQL queries over the collection
- Streamlit dashboard with Plotly visualizations
Recommended Skills
Journey fit
Canonical shelf is Build because the skill delivers extraction, transformation, loading, and query layers—not just one-off charts. Backend fits ETL, SQL schema design, and database loads; visualization is the downstream consumer of that stack.
How it compares
Use this workflow skill for a full pipeline narrative; use a chart-only skill when visualization is the only deliverable.
Common Questions / FAQ
Who is harvard-art-museum-etl-analytics for?
Indie developers and data-curious solo builders who want Harvard Art Museums API experience with ETL, SQL, and Streamlit in one guided project.
When should I use harvard-art-museum-etl-analytics?
During Build when you are wiring integrations and backends—e.g. designing artifact schemas, loading TiDB or MySQL, or standing up a Streamlit collection dashboard—or when Validate needs a prototype that proves real API data can power analytics.
Is harvard-art-museum-etl-analytics safe to install?
Review the Security Audits panel on this Prism page and inspect the cloned repo and API credentials handling before running ETL against production databases.
SKILL.md
READMESKILL.md - Harvard Art Museum Etl Analytics
# Harvard Art Museum ETL Analytics > Skill by [ara.so](https://ara.so) — Data Skills collection. ## Overview This project is an end-to-end data engineering and analytics application that demonstrates how to: - Extract artifact data from the Harvard Art Museums API - Transform nested JSON into relational database structures - Load data into MySQL/TiDB Cloud databases - Run analytical SQL queries on art collection data - Visualize insights using Streamlit and Plotly The architecture follows: **API → ETL → SQL → Analytics → Visualization** ## Installation ```bash # Clone the repository git clone https://github.com/Manali0711/Harvard-Artifacts-Collection-Data-Engineering-Analytics-App.git cd Harvard-Artifacts-Collection-Data-Engineering-Analytics-App # Install dependencies pip install -r requirements.txt ``` ### Required Dependencies ```txt streamlit pandas requests mysql-connector-python plotly python-dotenv ``` ## Configuration ### API Key Setup 1. Get a free API key from [Harvard Art Museums API](https://harvardartmuseums.org/collections/api) 2. Store it securely using environment variables: ```python import os from dotenv import load_dotenv load_dotenv() API_KEY = os.getenv('HARVARD_API_KEY') ``` ### Database Configuration ```python import mysql.connector db_config = { 'host': os.getenv('DB_HOST'), 'user': os.getenv('DB_USER'), 'password': os.getenv('DB_PASSWORD'), 'database': os.getenv('DB_NAME'), 'port': int(os.getenv('DB_PORT', 3306)) } connection = mysql.connector.connect(**db_config) ``` ## Database Schema ### Create Tables ```sql -- Artifact Metadata Table CREATE TABLE artifactmetadata ( id INT PRIMARY KEY, title VARCHAR(500), culture VARCHAR(200), century VARCHAR(100), classification VARCHAR(200), department VARCHAR(200), division VARCHAR(200), dated VARCHAR(200), period VARCHAR(200), technique VARCHAR(500), medium VARCHAR(500), dimensions VARCHAR(500), url VARCHAR(500) ); -- Artifact Media Table CREATE TABLE artifactmedia ( id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, iiifbaseuri VARCHAR(500), baseimageurl VARCHAR(500), primaryimageurl VARCHAR(500), FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); -- Artifact Colors Table CREATE TABLE artifactcolors ( id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, color VARCHAR(50), spectrum VARCHAR(50), hue VARCHAR(50), percent FLOAT, FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); ``` ## ETL Pipeline ### Extract: Fetch Data from API ```python import requests import pandas as pd def fetch_artifacts(api_key, page=1, size=100): """Fetch artifacts from Harvard Art Museums API with pagination""" base_url = "https://api.harvardartmuseums.org/object" params = { 'apikey': api_key, 'page': page, 'size': size, 'hasimage': 1 # Only artifacts with images } response = requests.get(base_url, params=params) if response.status_code == 200: data = response.json() return data['records'], data['info'] else: raise Exception(f"API Error: {response.status_code}") # Usage with pagination all_artifacts = [] page = 1 max_pages = 5 while page <= max_pages: records, info = fetch_art