
Harvard Art Museums Data Engineering Pipeline
Install to stand up a Harvard Art Museums API → ETL → SQL → Streamlit analytics stack as a portfolio-grade data engineering reference.
Overview
Harvard Art Museums Data Engineering Pipeline is an agent skill for the Build phase that walks you through API extraction, Python ETL, SQL storage, and Streamlit analytics on museum artifact data.
Install
npx skills add https://github.com/aradotso/data-skills --skill harvard-art-museums-data-engineering-pipelineWhat is this skill?
- End-to-end flow: Harvard Art Museums API → ETL → SQL database → analytics → Streamlit visualization
- Demonstrates relational database design and SQL analytics on museum artifact collections
- Python batch extraction and transformation with environment-driven API and DB configuration
- Interactive Streamlit dashboard layer for exploring stored artifact data
- Clone-and-run project structure with requirements.txt and documented env vars (HARVARD_API_KEY, DB_HOST)
- Five-stage architecture flow: API → ETL → SQL Database → Analytics → Visualization
Adoption & trust: 1 installs on skills.sh; 1 GitHub stars; trending (+100% hot-view momentum).
What problem does it solve?
You want a credible data-engineering demo but lack a wired example from a real API through a database to a dashboard you can show recruiters or users.
Who is it for?
Solo builders learning or showcasing ETL plus SQL plus lightweight BI on open cultural data with a public REST API.
Skip if: Builders who only need a one-off CSV chart without a database, or teams that require enterprise orchestration (Airflow/K8s) out of the box.
When should I use this skill?
User asks to build an ETL pipeline with Harvard Art Museums data, create analytics dashboards, extract/transform API data, or design SQL for museum artifacts.
What do I get? / Deliverables
You get a runnable pipeline and visualization app with documented setup so artifact data flows from the Harvard API into SQL and Streamlit-backed analytics.
- Running ETL jobs loading artifact tables
- SQL analytics layer on museum data
- Streamlit dashboard for collection exploration
Recommended Skills
Journey fit
End-to-end pipeline construction and relational storage land squarely in Build when you are implementing backend data systems, not when you are only ideating on datasets. Backend subphase fits batch ETL, SQL modeling, and serving analytics—core deliverables of this skill rather than agent tooling or frontend-only UI polish.
How it compares
Full vertical slice data project skill—not a generic SQL tutor or a single-chart generator.
Common Questions / FAQ
Who is harvard-art-museums-data-engineering-pipeline for?
Solo developers and indie builders assembling a data portfolio piece with Python ETL, relational analytics, and Streamlit on museum collection APIs.
When should I use harvard-art-museums-data-engineering-pipeline?
Use in Build when implementing batch pipelines, designing artifact schemas, or standing up a Streamlit analytics dashboard fed by the Harvard Art Museums API.
Is harvard-art-museums-data-engineering-pipeline safe to install?
It drives external API calls and database credentials you configure locally; review the Security Audits panel on this page and rotate Harvard API keys and DB secrets you export in your environment.
SKILL.md
READMESKILL.md - Harvard Art Museums Data Engineering Pipeline
# Harvard Art Museums Data Engineering Pipeline > Skill by [ara.so](https://ara.so) — Data Skills collection. ## Overview This project provides a complete data engineering solution for collecting, transforming, storing, and analyzing artifact data from the Harvard Art Museums API. It demonstrates real-world ETL patterns, relational database design, SQL analytics, and interactive visualization using Streamlit. **Architecture Flow:** API → ETL → SQL Database → Analytics → Visualization ## Installation ```bash # Clone the repository git clone https://github.com/Manali0711/Harvard-Artifacts-Collection-Data-Engineering-Analytics-App.git cd Harvard-Artifacts-Collection-Data-Engineering-Analytics-App # Install dependencies pip install -r requirements.txt # Set up environment variables export HARVARD_API_KEY="your_api_key_here" export DB_HOST="your_database_host" export DB_USER="your_database_user" export DB_PASSWORD="your_database_password" export DB_NAME="your_database_name" ``` ### Required Dependencies ```txt streamlit>=1.28.0 pandas>=2.0.0 requests>=2.31.0 mysql-connector-python>=8.1.0 plotly>=5.17.0 python-dotenv>=1.0.0 ``` ## Configuration ### API Setup Get your Harvard Art Museums API key from: https://www.harvardartmuseums.org/collections/api ```python import os from dotenv import load_dotenv load_dotenv() API_KEY = os.getenv('HARVARD_API_KEY') BASE_URL = "https://api.harvardartmuseums.org/object" ``` ### Database Configuration ```python import mysql.connector db_config = { 'host': os.getenv('DB_HOST'), 'user': os.getenv('DB_USER'), 'password': os.getenv('DB_PASSWORD'), 'database': os.getenv('DB_NAME') } def get_db_connection(): return mysql.connector.connect(**db_config) ``` ## Database Schema ### Table Definitions ```sql -- Artifact Metadata Table CREATE TABLE artifactmetadata ( id INT PRIMARY KEY, title VARCHAR(500), culture VARCHAR(255), century VARCHAR(100), classification VARCHAR(255), division VARCHAR(255), department VARCHAR(255), dated VARCHAR(255), accessionyear INT, totalpageviews INT, totaluniquepageviews INT, technique VARCHAR(500), creditline TEXT, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- Artifact Media Table CREATE TABLE artifactmedia ( media_id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, baseimageurl VARCHAR(1000), iiifbaseuri VARCHAR(1000), imagecopyright TEXT, renditionnumber VARCHAR(50), FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); -- Artifact Colors Table CREATE TABLE artifactcolors ( color_id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, color VARCHAR(50), spectrum VARCHAR(50), hue VARCHAR(50), percent DECIMAL(5,2), FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); ``` ## ETL Pipeline Implementation ### Extract: Fetch Data from API ```python import requests import time def fetch_artifacts(api_key, page=1, size=100): """ Fetch artifacts from Harvard Art Museums API with pagination """ params = { 'apikey': api_key, 'page': page, 'size': size } try: response = requests.get(BASE_URL, params=params) response.raise_for_status() data = response.json() return data.get('records', []), data.get('info', {}) except request