
Harvard Artifacts Data Engineering Analytics
Spin up a reference ETL plus Streamlit analytics app on Harvard Art Museums API data when learning or demoing data engineering.
Install
npx skills add https://github.com/aradotso/data-skills --skill harvard-artifacts-data-engineering-analyticsWhat is this skill?
- End-to-end flow documented as API → ETL → SQL → Analytics → Visualization
- Harvard Art Museums API extraction with pagination and rate limiting
- Nested JSON transformed into relational tables with batch load to MySQL/TiDB Cloud
- Analytical SQL over artifacts, media, and color metadata
- Interactive Plotly charts inside a Streamlit dashboard
Adoption & trust: 1 installs on skills.sh; 1 GitHub stars; trending (+100% hot-view momentum).
Recommended Skills
Paper Context Resolverlllllllama/ai-paper-reproduction-skill
Repo Intake And Planlllllllama/ai-paper-reproduction-skill
Env And Assets Bootstraplllllllama/ai-paper-reproduction-skill
Minimal Run And Auditlllllllama/ai-paper-reproduction-skill
Analyze Projectlllllllama/rigorpilot-skills
Ai Research Reproductionlllllllama/rigorpilot-skills
Journey fit
Primary fit
The skill teaches building production-shaped pipelines and dashboards—core backend and data product work in the build phase. Extraction, relational modeling, SQL loads, and API pagination belong on the backend/data shelf rather than frontend polish alone.
SKILL.md
READMESKILL.md - Harvard Artifacts Data Engineering Analytics
# Harvard Artifacts Data Engineering & Analytics > Skill by [ara.so](https://ara.so) — Data Skills collection. ## Overview This project provides an end-to-end data engineering and analytics application for working with the Harvard Art Museums API. It demonstrates production-grade ETL pipelines, SQL database design, and interactive visualization using Streamlit. The architecture follows: **API → ETL → SQL → Analytics → Visualization**. Key capabilities: - Extract artifact data from Harvard Art Museums API with pagination and rate limiting - Transform nested JSON into relational database tables - Load data into MySQL/TiDB Cloud with optimized batch inserts - Execute analytical SQL queries on artifact metadata, media, and color data - Visualize results with interactive Plotly charts in Streamlit ## Installation ```bash # Clone the repository git clone https://github.com/Manali0711/Harvard-Artifacts-Collection-Data-Engineering-Analytics-App.git cd Harvard-Artifacts-Collection-Data-Engineering-Analytics-App # Install dependencies pip install -r requirements.txt # Required packages include: # - streamlit # - pandas # - requests # - mysql-connector-python # - plotly ``` ## Configuration ### API Key Setup Get your Harvard Art Museums API key from: https://docs.api.harvardartmuseums.org/ Store credentials using environment variables: ```bash export HARVARD_API_KEY="your_api_key_here" ``` ### Database Configuration Set up MySQL/TiDB Cloud connection parameters: ```python import os DB_CONFIG = { 'host': os.getenv('DB_HOST', 'localhost'), 'port': int(os.getenv('DB_PORT', 3306)), 'user': os.getenv('DB_USER'), 'password': os.getenv('DB_PASSWORD'), 'database': os.getenv('DB_NAME', 'harvard_artifacts') } ``` Environment variables: ```bash export DB_HOST="your_database_host" export DB_PORT="3306" export DB_USER="your_db_user" export DB_PASSWORD="your_db_password" export DB_NAME="harvard_artifacts" ``` ## Running the Application ```bash # Start the Streamlit app streamlit run app.py # Access at http://localhost:8501 ``` ## Database Schema The ETL pipeline creates three relational tables: ```sql -- Artifact Metadata CREATE TABLE artifactmetadata ( id INT PRIMARY KEY, title VARCHAR(500), culture VARCHAR(200), period VARCHAR(200), century VARCHAR(100), classification VARCHAR(200), department VARCHAR(200), technique VARCHAR(500), accessionyear INT, totalpageviews INT, totaluniquepageviews INT ); -- Artifact Media CREATE TABLE artifactmedia ( media_id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, baseimageurl VARCHAR(500), format VARCHAR(50), description TEXT, iiifbaseuri VARCHAR(500), FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); -- Artifact Colors CREATE TABLE artifactcolors ( color_id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, color VARCHAR(50), spectrum VARCHAR(100), hue VARCHAR(100), percent FLOAT, FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); ``` ## ETL Pipeline Code Examples ### Extract: Fetching Data from API ```python import requests import pandas as pd import os def fetch_artifacts(api_key, num_records=100): """ Extract artifact data from Harvard Art Museums API with pagination """ base_url = "https://api.ha