
Harvard Artifacts Collection Data Engineering
Stand up a Harvard Art Museums API ETL pipeline with SQL analytics and Streamlit dashboards for artifact exploration.
Overview
Harvard Artifacts Collection Data Engineering is an agent skill for the Build phase that implements Harvard Art Museums API ETL, SQL analytics, and Streamlit visualization end to end.
Install
npx skills add https://github.com/aradotso/data-skills --skill harvard-artifacts-collection-data-engineeringWhat is this skill?
- Harvard Art Museums API integration with pagination and rate limiting
- ETL that flattens nested JSON into relational tables: artifactmetadata, artifactmedia, artifactcolors
- 20+ predefined SQL analytics queries over the loaded warehouse
- Streamlit interactive dashboards on top of the SQL layer
- Documented architecture: API → ETL → SQL → Analytics → Visualization
- Three relational tables: artifactmetadata, artifactmedia, artifactcolors
- Five-stage architecture: API → ETL → SQL → Analytics → Visualization
Adoption & trust: 1 installs on skills.sh; 1 GitHub stars; trending (+100% hot-view momentum).
What problem does it solve?
You need a real-world data engineering sample but lack a working pattern for museum API ingestion, relational modeling, and interactive analytics.
Who is it for?
Indie builders learning ETL plus lightweight BI who want a public API dataset with SQL and Streamlit already wired.
Skip if: Production teams that need unrelated CRM or payments pipelines, or projects without tolerance for external API keys and rate limits.
When should I use this skill?
Triggers include building museum ETL, connecting to Harvard Art Museums API, batch loading artifacts, SQL analysis, or Streamlit visualization of collection metadata.
What do I get? / Deliverables
You get a runnable pipeline from Harvard API extracts through SQL tables to Streamlit dashboards plus reusable query patterns for collection insights.
- Populated SQL tables for artifact metadata, media, and colors
- Batch ETL jobs with pagination and rate-limit handling
- Streamlit dashboards and 20+ reusable analytical SQL queries
Recommended Skills
Journey fit
End-to-end data engineering—from external API to warehouse tables to dashboards—is classic Build work for a demo app or internal analytics product. The skill centers on HTTP API integration, pagination, and rate limiting before SQL and visualization layers.
How it compares
Full sample data-app workflow, not a single-purpose SQL linter or generic REST MCP wrapper.
Common Questions / FAQ
Who is harvard-artifacts-collection-data-engineering for?
Solo developers and students building data portfolios who want Harvard Art Museums data loaded into SQL with Streamlit exploration out of the box.
When should I use harvard-artifacts-collection-data-engineering?
During Build integrations when you are implementing batch loads from the Harvard API, designing artifact warehouse tables, or standing up museum analytics dashboards.
Is harvard-artifacts-collection-data-engineering safe to install?
Review the Security Audits panel on this Prism page; the skill calls an external museum API and runs ETL locally—protect API keys and only use trusted environments.
SKILL.md
READMESKILL.md - Harvard Artifacts Collection Data Engineering
# Harvard Artifacts Collection Data Engineering > Skill by [ara.so](https://ara.so) — Data Skills collection. This project provides an end-to-end data engineering and analytics application built on the Harvard Art Museums API. It demonstrates real-world ETL pipelines, SQL database design, analytical queries, and interactive visualization using Streamlit. The architecture follows: API → ETL → SQL → Analytics → Visualization. ## What This Project Does - **API Integration**: Fetches artifact data from Harvard Art Museums API with pagination and rate limiting - **ETL Pipeline**: Extracts, transforms, and loads nested JSON into relational database tables - **SQL Database**: Stores structured data across `artifactmetadata`, `artifactmedia`, and `artifactcolors` tables - **Analytics**: Executes 20+ predefined SQL queries for insights - **Visualization**: Interactive dashboards using Plotly and Streamlit ## Installation ```bash # Clone the repository git clone https://github.com/Manali0711/Harvard-Artifacts-Collection-Data-Engineering-Analytics-App.git cd Harvard-Artifacts-Collection-Data-Engineering-Analytics-App # Install dependencies pip install -r requirements.txt ``` **Required packages:** ```txt streamlit pandas requests mysql-connector-python plotly ``` ## Configuration ### Environment Variables Set up your configuration before running: ```bash # Harvard Art Museums API Key (get from https://www.harvardartmuseums.org/collections/api) export HARVARD_API_KEY="your_api_key_here" # Database credentials export DB_HOST="your_database_host" export DB_PORT="3306" export DB_USER="your_database_user" export DB_PASSWORD="your_database_password" export DB_NAME="harvard_artifacts" ``` ### Database Setup Create the required tables: ```sql CREATE DATABASE harvard_artifacts; USE harvard_artifacts; CREATE TABLE artifactmetadata ( id INT PRIMARY KEY, title VARCHAR(500), culture VARCHAR(255), century VARCHAR(100), classification VARCHAR(255), division VARCHAR(255), department VARCHAR(255), technique VARCHAR(500), period VARCHAR(255), dated VARCHAR(255), url TEXT, lastupdate DATETIME ); CREATE TABLE artifactmedia ( media_id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, baseimageurl TEXT, iiifbaseuri TEXT, FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); CREATE TABLE artifactcolors ( color_id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, color VARCHAR(50), spectrum VARCHAR(50), percentage FLOAT, FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); ``` ## Key Components and Usage ### 1. API Data Extraction ```python import requests import os class HarvardAPIClient: def __init__(self): self.api_key = os.getenv('HARVARD_API_KEY') self.base_url = "https://api.harvardartmuseums.org/object" def fetch_artifacts(self, page=1, size=100): """Fetch artifacts with pagination""" params = { 'apikey': self.api_key, 'page': page, 'size': size, 'hasimage': 1 # Only artifacts with images } response = requests.get(self.base_url, params=params) response.raise_for_status() return response.json() def fetch_multiple_pages(self, num_pages=10): """Fetch multiple pages of artifacts""" al