
Harvard Art Museum Data Engineering
Stand up an end-to-end Harvard Art Museums API ETL into SQL plus a Streamlit analytics dashboard for collection research or portfolio demos.
Install
npx skills add https://github.com/aradotso/data-skills --skill harvard-art-museum-data-engineeringWhat is this skill?
- Full pipeline: Harvard Art Museums API → transform → MySQL/TiDB → Streamlit dashboards
- Python stack with pandas, requests, mysql-connector-python, plotly, and python-dotenv
- Clone-and-run project layout with requirements.txt and .env-based API configuration
- Interactive visualization and artifact analytics on loaded relational tables
- Documented architecture flow from ingestion through analytics to visualization
Adoption & trust: 1 installs on skills.sh; 1 GitHub stars; trending (+100% hot-view momentum).
Recommended Skills
Paper Context Resolverlllllllama/ai-paper-reproduction-skill
Repo Intake And Planlllllllama/ai-paper-reproduction-skill
Env And Assets Bootstraplllllllama/ai-paper-reproduction-skill
Minimal Run And Auditlllllllama/ai-paper-reproduction-skill
Analyze Projectlllllllama/rigorpilot-skills
Ai Research Reproductionlllllllama/rigorpilot-skills
Journey fit
Primary fit
Canonical shelf is Build because the skill delivers extraction, relational modeling, load, and dashboard code—not launch SEO or production ops runbooks. Backend fits the SQL warehouse, Python ETL, and API integration work that feeds the analytics layer.
SKILL.md
READMESKILL.md - Harvard Art Museum Data Engineering
# Harvard Art Museum Data Engineering Skill > Skill by [ara.so](https://ara.so) — Data Skills collection. ## Overview This project provides an end-to-end data engineering solution for the Harvard Art Museums API. It implements a complete ETL pipeline that extracts artifact data, transforms it into relational tables, loads it into SQL databases (MySQL/TiDB), and provides interactive analytics dashboards using Streamlit. **Architecture Flow**: API → ETL → SQL → Analytics → Visualization ## Installation ```bash # Clone the repository git clone https://github.com/Manali0711/Harvard-Artifacts-Collection-Data-Engineering-Analytics-App.git cd Harvard-Artifacts-Collection-Data-Engineering-Analytics-App # Install dependencies pip install -r requirements.txt # Required packages pip install streamlit pandas requests mysql-connector-python plotly python-dotenv ``` ## Configuration ### Environment Variables Create a `.env` file in the project root: ```bash # Harvard API Configuration HARVARD_API_KEY=your_api_key_here # Database Configuration DB_HOST=your_database_host DB_PORT=3306 DB_USER=your_username DB_PASSWORD=your_password DB_NAME=harvard_artifacts ``` ### Getting Harvard API Key 1. Visit: https://docs.api.harvardartmuseums.org/ 2. Register for a free API key 3. Store it in your `.env` file ## Database Schema The project uses three main tables with relational structure: ```sql -- Artifact Metadata Table CREATE TABLE artifactmetadata ( artifact_id INT PRIMARY KEY, title VARCHAR(500), culture VARCHAR(200), century VARCHAR(100), classification VARCHAR(200), department VARCHAR(200), division VARCHAR(200), technique VARCHAR(300), period VARCHAR(200), people VARCHAR(500), url TEXT, last_updated DATETIME ); -- Artifact Media Table CREATE TABLE artifactmedia ( media_id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, image_url TEXT, media_type VARCHAR(50), FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(artifact_id) ); -- Artifact Colors Table CREATE TABLE artifactcolors ( color_id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, color_hex VARCHAR(10), color_percent DECIMAL(5,2), FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(artifact_id) ); ``` ## Core Usage Patterns ### 1. API Data Extraction ```python import requests import os from dotenv import load_dotenv load_dotenv() def fetch_artifacts(page=1, size=100): """Fetch artifacts from Harvard API with pagination""" api_key = os.getenv('HARVARD_API_KEY') base_url = "https://api.harvardartmuseums.org/object" params = { 'apikey': api_key, 'page': page, 'size': size, 'hasimage': 1 # Only artifacts with images } response = requests.get(base_url, params=params) if response.status_code == 200: data = response.json() return data['records'], data['info'] else: raise Exception(f"API Error: {response.status_code}") # Fetch multiple pages def collect_artifacts(total_records=500): """Collect artifacts with pagination handling""" all_artifacts = [] page = 1 page_size = 100 while len(all_artifacts) < total_records: records, info = fetch_artifacts(page=page, size=page_size) all_artifacts.extend(records) if page >= info['pages']: break page += 1 return all_artifacts[:total_records] ```