
Harvard Artifacts Collection Data Engineering Analytics
Spin up a Harvard Art Museums API ETL into SQL plus a Streamlit analytics dashboard as a reference data-engineering build.
Install
npx skills add https://github.com/aradotso/data-skills --skill harvard-artifacts-collection-data-engineering-analyticsWhat is this skill?
- End-to-end ETL from Harvard Art Museums API with pagination and rate limiting
- Nested JSON normalized into relational schemas with batch SQL loads
- 20+ analytical SQL queries for collection insights
- Interactive Streamlit dashboard with Plotly visualizations
- Targets MySQL / TiDB Cloud storage in documented installation path
Adoption & trust: 1 installs on skills.sh; 1 GitHub stars; 1/2 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
Recommended Skills
Journey fit
Triggers explicitly target building pipelines, warehouses, and visualization apps—canonical home is Build backend/data implementation. Backend subphase covers ETL extraction, relational modeling, batch loads, and serving analytics—not frontend marketing or launch distribution.
Common Questions / FAQ
Is Harvard Artifacts Collection Data Engineering Analytics safe to install?
skills.sh reports 1 of 2 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Harvard Artifacts Collection Data Engineering Analytics
# Harvard Artifacts Collection Data Engineering Analytics > Skill by [ara.so](https://ara.so) — Data Skills collection. ## Overview This project provides a complete data engineering and analytics solution for the Harvard Art Museums API. It demonstrates production-grade ETL pipelines that extract artifact metadata, transform nested JSON into relational schemas, load into SQL databases (MySQL/TiDB Cloud), and visualize insights through an interactive Streamlit dashboard. The application handles: - API pagination and rate limiting - Nested JSON transformation into normalized tables - Batch SQL operations for performance - 20+ analytical queries for artifact insights - Real-time interactive visualizations with Plotly ## Installation ```bash # Clone the repository git clone https://github.com/Manali0711/Harvard-Artifacts-Collection-Data-Engineering-Analytics-App.git cd Harvard-Artifacts-Collection-Data-Engineering-Analytics-App # Install dependencies pip install -r requirements.txt ``` **Required dependencies:** ```txt streamlit pandas requests mysql-connector-python plotly ``` ## Configuration ### API Key Setup Obtain a Harvard Art Museums API key from: https://www.harvardartmuseums.org/collections/api Configure the API key in your environment: ```bash export HARVARD_API_KEY="your_api_key_here" ``` Or store in Streamlit secrets (`.streamlit/secrets.toml`): ```toml HARVARD_API_KEY = "your_api_key_here" ``` ### Database Setup Configure MySQL/TiDB Cloud connection: ```python import os DB_CONFIG = { 'host': os.getenv('DB_HOST', 'localhost'), 'user': os.getenv('DB_USER', 'root'), 'password': os.getenv('DB_PASSWORD'), 'database': os.getenv('DB_NAME', 'harvard_artifacts'), 'port': int(os.getenv('DB_PORT', 3306)) } ``` ## Project Structure ``` ├── app.py # Main Streamlit application ├── etl_pipeline.py # ETL logic for data extraction/transformation ├── db_utils.py # Database connection and operations ├── sql_queries.py # Analytical SQL queries ├── requirements.txt # Python dependencies └── .streamlit/ └── secrets.toml # API keys and DB credentials ``` ## Database Schema ### Table: artifactmetadata ```sql CREATE TABLE artifactmetadata ( id INT PRIMARY KEY, title VARCHAR(500), culture VARCHAR(255), period VARCHAR(255), century VARCHAR(100), dated VARCHAR(255), classification VARCHAR(255), department VARCHAR(255), division VARCHAR(255), technique VARCHAR(500), medium VARCHAR(500), dimensions VARCHAR(500), creditline TEXT, provenance TEXT, copyright TEXT, url VARCHAR(500) ); ``` ### Table: artifactmedia ```sql CREATE TABLE artifactmedia ( id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, image_url VARCHAR(1000), thumbnail_url VARCHAR(1000), media_type VARCHAR(100), FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); ``` ### Table: artifactcolors ```sql CREATE TABLE artifactcolors ( id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, color VARCHAR(50), spectrum VARCHAR(50), hue VARCHAR(50), percent FLOAT, FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); ``` ## Core ETL Pipeline ### Extract: Fetch Data from API ```python import requests import time def fetch_artifacts(api_key, num_record