
Harvard Artifacts Data Pipeline
Stand up a Harvard Art Museums API → ETL → SQL → Streamlit analytics stack for museum artifact exploration and dashboards.
Install
npx skills add https://github.com/aradotso/data-skills --skill harvard-artifacts-data-pipelineWhat is this skill?
- End-to-end flow: Harvard Art Museums API → ETL → SQL → analytics → Streamlit visualization
- Python stack with pandas, requests, mysql-connector, plotly, and python-dotenv
- Nested JSON to relational tables for queryable museum artifact data
- MySQL or TiDB load targets with interactive Streamlit dashboards
- Clone-and-requirements install path from the reference repository
Adoption & trust: 1 installs on skills.sh; 1 GitHub stars; trending (+100% hot-view momentum).
Recommended Skills
Paper Context Resolverlllllllama/ai-paper-reproduction-skill
Repo Intake And Planlllllllama/ai-paper-reproduction-skill
Env And Assets Bootstraplllllllama/ai-paper-reproduction-skill
Minimal Run And Auditlllllllama/ai-paper-reproduction-skill
Analyze Projectlllllllama/rigorpilot-skills
Ai Research Reproductionlllllllama/rigorpilot-skills
Journey fit
Primary fit
Build is the right shelf because the skill orchestrates extraction, transformation, loading, and dashboard code—not idea validation or growth campaigns. Backend covers ETL jobs, SQL schemas, API ingestion, and the data layer feeding Streamlit rather than pure frontend styling.
SKILL.md
READMESKILL.md - Harvard Artifacts Data Pipeline
# Harvard Artifacts Data Pipeline Skill > Skill by [ara.so](https://ara.so) — Data Skills collection. ## Overview The Harvard Artifacts Collection Data Engineering & Analytics App is an end-to-end data pipeline that demonstrates real-world ETL patterns. It extracts artifact data from the Harvard Art Museums API, transforms nested JSON into relational tables, loads data into SQL databases (MySQL/TiDB), and provides interactive analytics dashboards using Streamlit. **Architecture Flow:** API → ETL → SQL → Analytics → Visualization ## Installation ```bash # Clone the repository git clone https://github.com/Manali0711/Harvard-Artifacts-Collection-Data-Engineering-Analytics-App.git cd Harvard-Artifacts-Collection-Data-Engineering-Analytics-App # Install dependencies pip install -r requirements.txt ``` ### Required Dependencies ```txt streamlit pandas requests mysql-connector-python plotly python-dotenv ``` ## Configuration ### Environment Variables Create a `.env` file in the project root: ```env HARVARD_API_KEY=your_api_key_here DB_HOST=your_database_host DB_USER=your_database_user DB_PASSWORD=your_database_password DB_NAME=harvard_artifacts ``` ### Get Harvard Art Museums API Key 1. Visit https://www.harvardartmuseums.org/collections/api 2. Register for a free API key 3. Add to `.env` file ### Database Setup The application uses three main tables: ```sql CREATE TABLE artifactmetadata ( id INT PRIMARY KEY, title VARCHAR(500), culture VARCHAR(200), century VARCHAR(100), classification VARCHAR(200), department VARCHAR(200), dated VARCHAR(200), url TEXT ); CREATE TABLE artifactmedia ( id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, media_type VARCHAR(100), baseimageurl TEXT, renditionnumber VARCHAR(50), FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); CREATE TABLE artifactcolors ( id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, color VARCHAR(50), spectrum VARCHAR(50), percentage DECIMAL(5,2), FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); ``` ## Running the Application ```bash # Start the Streamlit app streamlit run app.py ``` The app will be available at `http://localhost:8501` ## Code Examples ### API Data Extraction ```python import requests import os from dotenv import load_dotenv load_dotenv() def fetch_artifacts(page=1, size=100): """Fetch artifacts from Harvard Art Museums API""" api_key = os.getenv('HARVARD_API_KEY') base_url = "https://api.harvardartmuseums.org/object" params = { 'apikey': api_key, 'page': page, 'size': size, 'hasimage': 1 # Only artifacts with images } response = requests.get(base_url, params=params) response.raise_for_status() data = response.json() return data['records'], data['info'] # Fetch first 100 artifacts artifacts, info = fetch_artifacts(page=1, size=100) print(f"Total artifacts: {info['totalrecords']}") print(f"Pages: {info['pages']}") ``` ### ETL Pipeline Implementation ```python import pandas as pd import mysql.connector from mysql.connector import Error def transform_artifact_data(raw_artifacts): """Transform nested JSON to relational format""" metadata = [] media = [] colors = [] for artifact in raw_artifacts: # Extract metadata metadata.append({ 'id': artifact.get('id'), 'title': artifact.get('title'),