
Harvard Artifacts Collection Analytics Pipeline
Stand up a Harvard Art Museums API ETL job, SQL analytics layer, and Streamlit dashboard for collection exploration without designing the pipeline from scratch.
Overview
Harvard Artifacts Collection Analytics Pipeline is an agent skill for the Build phase that implements Harvard Art Museums API ETL, SQL analytics, and a Streamlit visualization app.
Install
npx skills add https://github.com/aradotso/data-skills --skill harvard-artifacts-collection-analytics-pipelineWhat is this skill?
- End-to-end flow: Harvard Art Museums API → ETL → SQL → analytics → Streamlit visualization
- ETL for artifact metadata, media, and color fields into MySQL/TiDB Cloud
- 20+ analytical SQL queries for collection insights
- Interactive Streamlit dashboard with Plotly charts
- Clone-and-install workflow with requirements.txt (streamlit, pandas, requests, etc.)
- Architecture chain: API → ETL → SQL → Analytics → Visualization
Adoption & trust: 1 installs on skills.sh; 1 GitHub stars; 1/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
What problem does it solve?
You want museum collection analytics but lack a wired path from the Harvard API through a database to queries and a dashboard.
Who is it for?
Solo data engineers and builders prototyping analytics products on open museum APIs with Python, SQL, and Streamlit.
Skip if: Production Harvard API compliance reviews, non-Python stacks only, or fiction/creative-writing workflows.
When should I use this skill?
Build a data pipeline for Harvard Art Museums API; create ETL for artifacts; set up collection analytics with Streamlit; implement SQL analytics for art artifacts.
What do I get? / Deliverables
You replicate a documented ETL plus 20+ SQL analyses and an interactive Streamlit front end you can extend or deploy.
- Populated SQL database with artifact, media, and color datasets
- Runnable SQL analytics query set
- Streamlit interactive dashboard with Plotly visualizations
Recommended Skills
Journey fit
Build phase is where API ingestion, database modeling, queries, and an interactive app are assembled into one deliverable. Backend subphase covers ETL, SQL storage (MySQL/TiDB), and the analytics serving path before you ship dashboards to users.
How it compares
A full reference pipeline skill and repo pattern, not a single MCP tool or one-off SQL snippet.
Common Questions / FAQ
Who is harvard-artifacts-collection-analytics-pipeline for?
Indie builders and data practitioners who need a worked example of museum API ingestion, relational analytics, and Streamlit delivery in one project.
When should I use harvard-artifacts-collection-analytics-pipeline?
During Build backend work when you are asked to create ETL for Harvard artifact data, SQL collection analytics, or an interactive dashboard for art metadata.
Is harvard-artifacts-collection-analytics-pipeline safe to install?
The skill points at an external GitHub clone and pip installs; review the Security Audits panel on this Prism page and vet dependencies and API keys before running ETL against live services.
SKILL.md
READMESKILL.md - Harvard Artifacts Collection Analytics Pipeline
# Harvard Artifacts Collection Analytics Pipeline > Skill by [ara.so](https://ara.so) — Data Skills collection. ## Overview This project provides a complete data engineering solution for the Harvard Art Museums API, featuring: - ETL pipeline for artifact metadata, media, and color data - SQL database storage (MySQL/TiDB Cloud) - 20+ analytical SQL queries - Interactive Streamlit dashboard with Plotly visualizations The architecture follows: **API → ETL → SQL → Analytics → Visualization** ## Installation ```bash # Clone the repository git clone https://github.com/Manali0711/Harvard-Artifacts-Collection-Data-Engineering-Analytics-App.git cd Harvard-Artifacts-Collection-Data-Engineering-Analytics-App # Install dependencies pip install -r requirements.txt ``` ### Required Dependencies ```python # requirements.txt typically includes: streamlit pandas requests mysql-connector-python plotly python-dotenv ``` ## Configuration ### Environment Variables Create a `.env` file in the project root: ```bash # Harvard Art Museums API HARVARD_API_KEY=your_api_key_here # MySQL/TiDB Cloud Connection DB_HOST=your_database_host DB_PORT=3306 DB_USER=your_username DB_PASSWORD=your_password DB_NAME=harvard_artifacts ``` ### Database Setup ```python import mysql.connector from mysql.connector import Error def create_database_connection(): """Establish MySQL/TiDB connection""" try: connection = mysql.connector.connect( host=os.getenv('DB_HOST'), port=os.getenv('DB_PORT'), user=os.getenv('DB_USER'), password=os.getenv('DB_PASSWORD'), database=os.getenv('DB_NAME') ) return connection except Error as e: print(f"Database connection error: {e}") return None def create_tables(connection): """Create database schema""" cursor = connection.cursor() # Artifact Metadata Table cursor.execute(""" CREATE TABLE IF NOT EXISTS artifactmetadata ( artifact_id INT PRIMARY KEY, title VARCHAR(500), culture VARCHAR(200), century VARCHAR(100), classification VARCHAR(200), department VARCHAR(200), dated VARCHAR(200), period VARCHAR(200), technique VARCHAR(500), medium VARCHAR(500), dimensions VARCHAR(500), creditline TEXT, url VARCHAR(500) ) """) # Artifact Media Table cursor.execute(""" CREATE TABLE IF NOT EXISTS artifactmedia ( media_id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, image_url VARCHAR(1000), caption TEXT, FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(artifact_id) ) """) # Artifact Colors Table cursor.execute(""" CREATE TABLE IF NOT EXISTS artifactcolors ( color_id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, color_hex VARCHAR(10), color_name VARCHAR(100), percentage FLOAT, FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(artifact_id) ) """) connection.commit() cursor.close() ``` ## ETL Pipeline ### Extract: Fetch Data from Harvard API ```python import requests import time def fetch_artifacts_