Harvard Art Museums Data Engineering App

Name: Harvard Art Museums Data Engineering App
Author: aradotso

aradotso/data-skills

1.5k installs
4 repo stars
Updated July 18, 2026
aradotso/data-skills

harvard-art-museums-data-engineering-app is a data-skills agent workflow that bootstraps Harvard Art Museums API ETL into SQL with a Streamlit analytics dashboard and Plotly interactive charts.

About

harvard-art-museums-data-engineering-app is a skill from aradotso/data-skills (ara.so Data Skills collection) for developers building museum-metadata analytics pipelines. The skill defines eight trigger phrases covering ETL setup, SQL analytics, artifact collection, and Streamlit dashboards, and delivers an end-to-end flow from Harvard Art Museums API ingestion through SQL storage to Plotly visualizations in Streamlit. Use harvard-art-museums-data-engineering-app when bootstrapping a teaching demo, portfolio data app, or internal collection explorer without designing pipeline scaffolding from scratch. The skill fits data engineers and backend developers who want a concrete API-to-dashboard reference architecture for cultural-heritage metadata.

End-to-end flow: Harvard Art Museums API → ETL → SQL → Streamlit analytics → Plotly visualizations
Relational warehouse pattern with MySQL/TiDB Cloud for artifact collections
Streamlit UI for interactive exploration of museum metadata
Trigger phrases cover ETL setup, SQL querying, and collection pattern analysis
Clone-and-install path with requirements.txt from the reference GitHub project

Harvard Art Museums Data Engineering App by the numbers

1,549 all-time installs (skills.sh)
+2 installs in the week ending Jul 28, 2026 (Skillselion tracking)
Ranked #143 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/aradotso/data-skills --skill harvard-art-museums-data-engineering-app

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/aradotso/data-skills/harvard-art-museums-data-engineering-app.svg)](https://skillselion.com/skills/aradotso/data-skills/harvard-art-museums-data-engineering-app)

Installs	1.5k
repo stars	★ 4
Security audit	1 / 3 scanners passed
Last updated	July 18, 2026
Repository	aradotso/data-skills ↗

How do you build an ETL pipeline for museum API data?

Bootstrap an end-to-end Harvard Art Museums ETL into SQL with a Streamlit analytics dashboard and Plotly charts.

Who is it for?

Data engineers prototyping museum or cultural-metadata pipelines who want API-to-SQL-to-Streamlit scaffolding with interactive Plotly charts.

Skip if: Production museum CMS replacements, real-time streaming ingestion at scale, or projects unrelated to the Harvard Art Museums API.

When should I use this skill?

User asks to build an ETL pipeline for museum artifact data, SQL analytics on art collections, or a Streamlit dashboard for Harvard Art Museums API data.

What you get

ETL ingestion scripts, SQL analytics schema, Streamlit dashboard app, and Plotly visualization charts for museum artifacts.

ETL ingestion scripts
SQL analytics schema
Streamlit dashboard with Plotly charts

By the numbers

Documents eight trigger phrases for ETL, SQL, and Streamlit tasks
Pipeline spans Harvard Art Museums API ingestion through Plotly charts

Files

SKILL.mdMarkdownGitHub ↗

Harvard Art Museums Data Engineering App

Skill by ara.so — Data Skills collection.

This project provides an end-to-end data engineering and analytics application built on the Harvard Art Museums API. It demonstrates real-world ETL pipelines, SQL database design, analytical queries, and interactive visualization using Streamlit.

What This Project Does

The application follows a complete data pipeline: API → ETL → SQL → Analytics → Visualization

Collects artifact data from Harvard Art Museums API with pagination and rate limiting
Transforms nested JSON into normalized relational tables
Loads data into MySQL/TiDB Cloud databases
Analyzes with 20+ predefined SQL queries
Visualizes results through interactive Plotly dashboards in Streamlit

Installation

# Clone the repository
git clone https://github.com/Manali0711/Harvard-Artifacts-Collection-Data-Engineering-Analytics-App.git
cd Harvard-Artifacts-Collection-Data-Engineering-Analytics-App

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
export HARVARD_API_KEY="your_api_key_here"
export DB_HOST="your_database_host"
export DB_USER="your_database_user"
export DB_PASSWORD="your_database_password"
export DB_NAME="harvard_artifacts"

Required dependencies:

streamlit
pandas
requests
mysql-connector-python
plotly
python-dotenv

Configuration

API Key Setup

Get a free API key from Harvard Art Museums API:

import os
from dotenv import load_dotenv

load_dotenv()
API_KEY = os.getenv('HARVARD_API_KEY')
BASE_URL = "https://api.harvardartmuseums.org/object"

Database Connection

import mysql.connector
import os

db_config = {
    'host': os.getenv('DB_HOST'),
    'user': os.getenv('DB_USER'),
    'password': os.getenv('DB_PASSWORD'),
    'database': os.getenv('DB_NAME')
}

connection = mysql.connector.connect(**db_config)
cursor = connection.cursor()

Database Schema

The application uses three normalized tables:

-- Artifact metadata table
CREATE TABLE artifactmetadata (
    id INT PRIMARY KEY,
    title VARCHAR(500),
    culture VARCHAR(200),
    century VARCHAR(100),
    classification VARCHAR(200),
    department VARCHAR(200),
    dated VARCHAR(200),
    medium VARCHAR(500),
    technique VARCHAR(500),
    period VARCHAR(200),
    primaryimageurl TEXT,
    verificationlevel INT,
    accesslevel INT
);

-- Artifact media table
CREATE TABLE artifactmedia (
    id INT AUTO_INCREMENT PRIMARY KEY,
    artifact_id INT,
    baseimageurl TEXT,
    format VARCHAR(50),
    height INT,
    width INT,
    FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id)
);

-- Artifact colors table
CREATE TABLE artifactcolors (
    id INT AUTO_INCREMENT PRIMARY KEY,
    artifact_id INT,
    color VARCHAR(50),
    spectrum VARCHAR(50),
    percentage FLOAT,
    FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id)
);

ETL Pipeline

Extract: Fetch Data from API

import requests
import time

def fetch_artifacts(api_key, num_records=100, page_size=100):
    """
    Fetch artifacts from Harvard Art Museums API with pagination
    """
    artifacts = []
    url = f"{BASE_URL}?apikey={api_key}&size={page_size}"
    
    pages_needed = (num_records + page_size - 1) // page_size
    
    for page in range(1, pages_needed + 1):
        try:
            response = requests.get(f"{url}&page={page}")
            response.raise_for_status()
            data = response.json()
            
            artifacts.extend(data.get('records', []))
            
            # Rate limiting
            time.sleep(0.5)
            
            if len(artifacts) >= num_records:
                break
                
        except requests.exceptions.RequestException as e:
            print(f"Error fetching page {page}: {e}")
            break
    
    return artifacts[:num_records]

Transform: Normalize JSON Data

import pandas as pd

def transform_artifact_data(artifacts):
    """
    Transform nested JSON into normalized dataframes
    """
    metadata_list = []
    media_list = []
    colors_list = []
    
    for artifact in artifacts:
        # Extract metadata
        metadata = {
            'id': artifact.get('id'),
            'title': artifact.get('title'),
            'culture': artifact.get('culture'),
            'century': artifact.get('century'),
            'classification': artifact.get('classification'),
            'department': artifact.get('department'),
            'dated': artifact.get('dated'),
            'medium': artifact.get('medium'),
            'technique': artifact.get('technique'),
            'period': artifact.get('period'),
            'primaryimageurl': artifact.get('primaryimageurl'),
            'verificationlevel': artifact.get('verificationlevel'),
            'accesslevel': artifact.get('accesslevel')
        }
        metadata_list.append(metadata)
        
        # Extract images/media
        images = artifact.get('images', [])
        for img in images:
            media = {
                'artifact_id': artifact.get('id'),
                'baseimageurl': img.get('baseimageurl'),
                'format': img.get('format'),
                'height': img.get('height'),
                'width': img.get('width')
            }
            media_list.append(media)
        
        # Extract colors
        colors = artifact.get('colors', [])
        for color in colors:
            color_entry = {
                'artifact_id': artifact.get('id'),
                'color': color.get('color'),
                'spectrum': color.get('spectrum'),
                'percentage': color.get('percent')
            }
            colors_list.append(color_entry)
    
    return (
        pd.DataFrame(metadata_list),
        pd.DataFrame(media_list),
        pd.DataFrame(colors_list)
    )

Load: Insert into Database

def load_to_database(metadata_df, media_df, colors_df, connection):
    """
    Load dataframes into SQL database with batch inserts
    """
    cursor = connection.cursor()
    
    # Insert metadata
    metadata_query = """
    INSERT INTO artifactmetadata 
    (id, title, culture, century, classification, department, dated, 
     medium, technique, period, primaryimageurl, verificationlevel, accesslevel)
    VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
    ON DUPLICATE KEY UPDATE title=VALUES(title)
    """
    
    metadata_values = metadata_df.values.tolist()
    cursor.executemany(metadata_query, metadata_values)
    
    # Insert media
    media_query = """
    INSERT INTO artifactmedia (artifact_id, baseimageurl, format, height, width)
    VALUES (%s, %s, %s, %s, %s)
    """
    
    media_values = media_df.values.tolist()
    cursor.executemany(media_query, media_values)
    
    # Insert colors
    colors_query = """
    INSERT INTO artifactcolors (artifact_id, color, spectrum, percentage)
    VALUES (%s, %s, %s, %s)
    """
    
    colors_values = colors_df.values.tolist()
    cursor.executemany(colors_query, colors_values)
    
    connection.commit()
    cursor.close()

Streamlit Application

Main App Structure

import streamlit as st
import plotly.express as px

st.set_page_config(page_title="Harvard Art Analytics", layout="wide")

# Sidebar for data collection
with st.sidebar:
    st.header("Data Collection")
    num_records = st.slider("Number of records", 10, 500, 100)
    
    if st.button("Fetch & Load Data"):
        with st.spinner("Fetching artifacts..."):
            artifacts = fetch_artifacts(API_KEY, num_records)
            metadata_df, media_df, colors_df = transform_artifact_data(artifacts)
            load_to_database(metadata_df, media_df, colors_df, connection)
            st.success(f"Loaded {len(metadata_df)} artifacts!")

# Main dashboard
st.title("🎨 Harvard Art Museums Analytics Dashboard")

# Analytics section
st.header("SQL Analytics")

# Sample queries
queries = {
    "Artifacts by Culture": """
        SELECT culture, COUNT(*) as count 
        FROM artifactmetadata 
        WHERE culture IS NOT NULL 
        GROUP BY culture 
        ORDER BY count DESC 
        LIMIT 10
    """,
    "Artifacts by Century": """
        SELECT century, COUNT(*) as count 
        FROM artifactmetadata 
        WHERE century IS NOT NULL 
        GROUP BY century 
        ORDER BY count DESC
    """,
    "Media Availability": """
        SELECT 
            CASE WHEN primaryimageurl IS NOT NULL THEN 'Has Image' ELSE 'No Image' END as status,
            COUNT(*) as count
        FROM artifactmetadata
        GROUP BY status
    """,
    "Top Colors Used": """
        SELECT color, COUNT(*) as count, AVG(percentage) as avg_percentage
        FROM artifactcolors
        GROUP BY color
        ORDER BY count DESC
        LIMIT 10
    """,
    "Artifacts by Department": """
        SELECT department, COUNT(*) as count
        FROM artifactmetadata
        WHERE department IS NOT NULL
        GROUP BY department
        ORDER BY count DESC
    """
}

selected_query = st.selectbox("Select Analysis", list(queries.keys()))

if st.button("Run Analysis"):
    cursor = connection.cursor(dictionary=True)
    cursor.execute(queries[selected_query])
    results = cursor.fetchall()
    cursor.close()
    
    df_results = pd.DataFrame(results)
    
    col1, col2 = st.columns([1, 1])
    
    with col1:
        st.dataframe(df_results, use_container_width=True)
    
    with col2:
        if len(df_results.columns) >= 2:
            fig = px.bar(
                df_results, 
                x=df_results.columns[0], 
                y=df_results.columns[1],
                title=selected_query
            )
            st.plotly_chart(fig, use_container_width=True)

Common Patterns

Running the Complete ETL Pipeline

def run_etl_pipeline(api_key, db_config, num_records=100):
    """
    Execute complete ETL pipeline
    """
    # Extract
    print("Extracting data from API...")
    artifacts = fetch_artifacts(api_key, num_records)
    
    # Transform
    print("Transforming data...")
    metadata_df, media_df, colors_df = transform_artifact_data(artifacts)
    
    # Load
    print("Loading to database...")
    connection = mysql.connector.connect(**db_config)
    load_to_database(metadata_df, media_df, colors_df, connection)
    connection.close()
    
    print(f"ETL pipeline complete: {len(metadata_df)} artifacts processed")
    return metadata_df, media_df, colors_df

Incremental Data Loading

def get_latest_artifact_id(connection):
    """
    Get the highest artifact ID in database
    """
    cursor = connection.cursor()
    cursor.execute("SELECT MAX(id) FROM artifactmetadata")
    result = cursor.fetchone()
    cursor.close()
    return result[0] if result[0] else 0

def incremental_load(api_key, db_config):
    """
    Load only new artifacts
    """
    connection = mysql.connector.connect(**db_config)
    last_id = get_latest_artifact_id(connection)
    
    # Fetch artifacts with ID filter
    url = f"{BASE_URL}?apikey={api_key}&after={last_id}"
    # Continue with ETL...

Key Commands

Run Streamlit App

streamlit run app.py

Database Setup

# Create database
mysql -u $DB_USER -p -e "CREATE DATABASE harvard_artifacts;"

# Run schema creation
mysql -u $DB_USER -p harvard_artifacts < schema.sql

Troubleshooting

API Rate Limiting

If you hit rate limits, increase the delay:

time.sleep(1)  # Increase from 0.5 to 1 second

Database Connection Issues

# Test connection
try:
    connection = mysql.connector.connect(**db_config)
    print("Connection successful!")
    connection.close()
except mysql.connector.Error as e:
    print(f"Error: {e}")

Missing Data Handling

# Handle None values before inserting
def safe_value(val, default=''):
    return val if val is not None else default

metadata = {
    'title': safe_value(artifact.get('title')),
    'culture': safe_value(artifact.get('culture'))
}

Memory Issues with Large Datasets

# Process in chunks
chunk_size = 100
for i in range(0, total_records, chunk_size):
    chunk = artifacts[i:i+chunk_size]
    process_chunk(chunk)

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Use this skill for a complete museum API demo pipeline; use generic ETL skills when the data source is not Harvard Art Museums.

FAQ

What stack does harvard-art-museums-data-engineering-app use?

harvard-art-museums-data-engineering-app builds on the Harvard Art Museums API with ETL ingestion into SQL, analytics queries, and a Streamlit front end using Plotly charts. The ara.so skill targets end-to-end artifact collection and visualization workflows.

When should developers use this museum data skill?

harvard-art-museums-data-engineering-app fits requests to build ETL pipelines, SQL analytics, or Streamlit dashboards for Harvard Art Museums collection data. The skill lists eight trigger phrases covering ingestion, queries, and interactive charts.

Is Harvard Art Museums Data Engineering App safe to install?

skills.sh reports 1 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLpipelinesanalytics

About

Harvard Art Museums Data Engineering App by the numbers

Add your badge

How do you build an ETL pipeline for museum API data?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Harvard Art Museums Data Engineering App

What This Project Does

Installation

Configuration

API Key Setup

Database Connection

Database Schema

ETL Pipeline

Extract: Fetch Data from API

Transform: Normalize JSON Data

Load: Insert into Database

Streamlit Application

Main App Structure

Common Patterns

Running the Complete ETL Pipeline

Incremental Data Loading

Key Commands

Run Streamlit App

Database Setup

Troubleshooting

API Rate Limiting

Database Connection Issues

Missing Data Handling

Memory Issues with Large Datasets

Related skills

How it compares

FAQ

What stack does harvard-art-museums-data-engineering-app use?

When should developers use this museum data skill?

Is Harvard Art Museums Data Engineering App safe to install?

This week in AI coding