
Harvard Artifacts Data Engineering Pipeline
Spin up a solo learning or portfolio project that ingests Harvard Art Museums data, lands it in SQL, and ships a Streamlit analytics UI.
Install
npx skills add https://github.com/aradotso/data-skills --skill harvard-artifacts-data-engineering-pipelineWhat is this skill?
- End-to-end architecture: Harvard Art Museums API → ETL → SQL → analytics → Streamlit visualization
- API client patterns with pagination and rate limiting against a public museum REST API
- Normalizes nested JSON into relational tables with foreign keys for MySQL or TiDB Cloud
- Runs 20+ predefined SQL queries for collection insights
- Interactive Plotly charts inside a Streamlit dashboard for artifact exploration
Adoption & trust: 1 installs on skills.sh; 1 GitHub stars; trending (+100% hot-view momentum).
Recommended Skills
Paper Context Resolverlllllllama/ai-paper-reproduction-skill
Repo Intake And Planlllllllama/ai-paper-reproduction-skill
Env And Assets Bootstraplllllllama/ai-paper-reproduction-skill
Minimal Run And Auditlllllllama/ai-paper-reproduction-skill
Analyze Projectlllllllama/rigorpilot-skills
Ai Research Reproductionlllllllama/rigorpilot-skills
Journey fit
Primary fit
The skill walks through implementing extraction, transformation, storage, and serving logic—the core backend/data layer of a product. ETL design, relational schema, SQL analytics, and API client code are backend engineering work, even though Streamlit provides the read layer.
SKILL.md
READMESKILL.md - Harvard Artifacts Data Engineering Pipeline
# Harvard Artifacts Data Engineering Pipeline > Skill by [ara.so](https://ara.so) — Data Skills collection This project demonstrates an end-to-end data engineering pipeline that extracts artifact data from the Harvard Art Museums API, transforms it into structured formats, loads it into a SQL database, and provides interactive analytics through a Streamlit dashboard. ## What It Does - **API Integration**: Fetches artifact data from Harvard Art Museums API with pagination and rate limiting - **ETL Pipeline**: Transforms nested JSON responses into normalized relational tables - **SQL Storage**: Stores data in MySQL/TiDB Cloud with proper schema design and foreign keys - **Analytics**: Executes 20+ predefined SQL queries for insights - **Visualization**: Interactive Plotly charts and Streamlit dashboard **Architecture**: `API → ETL → SQL → Analytics → Visualization` ## Installation ```bash # Clone the repository git clone https://github.com/Manali0711/Harvard-Artifacts-Collection-Data-Engineering-Analytics-App.git cd Harvard-Artifacts-Collection-Data-Engineering-Analytics-App # Install dependencies pip install -r requirements.txt ``` ### Required Dependencies ```txt streamlit pandas requests mysql-connector-python plotly python-dotenv ``` ## Configuration ### Environment Variables Create a `.env` file or use Streamlit secrets: ```python # .env HARVARD_API_KEY=your_api_key_here DB_HOST=your_database_host DB_PORT=3306 DB_USER=your_db_user DB_PASSWORD=your_db_password DB_NAME=harvard_artifacts ``` ### API Key Setup Get your API key from [Harvard Art Museums API](https://www.harvardartmuseums.org/collections/api): ```python import os from dotenv import load_dotenv load_dotenv() API_KEY = os.getenv('HARVARD_API_KEY') ``` ## Database Schema The pipeline creates three main tables: ```sql -- Artifact Metadata CREATE TABLE artifactmetadata ( id INT PRIMARY KEY, title VARCHAR(500), culture VARCHAR(255), century VARCHAR(100), classification VARCHAR(255), department VARCHAR(255), technique VARCHAR(255), medium VARCHAR(500), dated VARCHAR(100), accessionyear INT ); -- Artifact Media CREATE TABLE artifactmedia ( media_id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, baseimageurl VARCHAR(500), imagecount INT, FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); -- Artifact Colors CREATE TABLE artifactcolors ( color_id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, color VARCHAR(50), spectrum VARCHAR(50), percentage FLOAT, FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); ``` ## Core ETL Functions ### Extract: Fetch Data from API ```python import requests import pandas as pd def fetch_artifacts(api_key, num_pages=5): """Fetch artifacts from Harvard Art Museums API with pagination""" base_url = "https://api.harvardartmuseums.org/object" all_artifacts = [] for page in range(1, num_pages + 1): params = { 'apikey': api_key, 'size': 100, # Max per page 'page': page } response = requests.get(base_url, params=params) if response.status_code == 200: data = response.json() all_artifacts.extend(data.get('records', [])) else: print(f"Error fetching page {page}: {response.status_code}") break