
Harvard Art Museums Data Engineering App
Bootstrap an end-to-end Harvard Art Museums ETL into SQL with a Streamlit analytics dashboard and Plotly charts.
Install
npx skills add https://github.com/aradotso/data-skills --skill harvard-art-museums-data-engineering-appWhat is this skill?
- End-to-end flow: Harvard Art Museums API → ETL → SQL → Streamlit analytics → Plotly visualizations
- Relational warehouse pattern with MySQL/TiDB Cloud for artifact collections
- Streamlit UI for interactive exploration of museum metadata
- Trigger phrases cover ETL setup, SQL querying, and collection pattern analysis
- Clone-and-install path with requirements.txt from the reference GitHub project
Adoption & trust: 1 installs on skills.sh; 1 GitHub stars; 1/3 security scanners passed (skills.sh audits); trending (+100% hot-view momentum).
Recommended Skills
Journey fit
Shelf is Build because the skill delivers a full data-engineering application—extract, transform, load, and visualize—not discovery or launch distribution. Backend fits API extraction, relational modeling, MySQL/TiDB storage, and pipeline logic that feeds the dashboard.
Common Questions / FAQ
Is Harvard Art Museums Data Engineering App safe to install?
skills.sh reports 1 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.
SKILL.md
READMESKILL.md - Harvard Art Museums Data Engineering App
# Harvard Art Museums Data Engineering App > Skill by [ara.so](https://ara.so) — Data Skills collection. ## Overview This project is an end-to-end data engineering and analytics application that demonstrates real-world ETL pipelines using the Harvard Art Museums API. It extracts artifact data, transforms it into relational database tables, stores it in MySQL/TiDB Cloud, and provides interactive analytics through a Streamlit dashboard with Plotly visualizations. **Architecture Flow:** API → ETL → SQL → Analytics → Visualization ## Installation ```bash # Clone the repository git clone https://github.com/Manali0711/Harvard-Artifacts-Collection-Data-Engineering-Analytics-App.git cd Harvard-Artifacts-Collection-Data-Engineering-Analytics-App # Install dependencies pip install -r requirements.txt ``` ### Required Dependencies ```txt streamlit pandas requests mysql-connector-python plotly python-dotenv ``` ## Configuration ### Environment Variables Create a `.env` file in the project root: ```bash # Harvard Art Museums API HARVARD_API_KEY=your_api_key_here # Database Configuration DB_HOST=your_database_host DB_PORT=3306 DB_USER=your_username DB_PASSWORD=your_password DB_NAME=harvard_artifacts ``` ### Database Schema The application uses three main tables with foreign key relationships: ```sql -- Artifact Metadata Table CREATE TABLE artifactmetadata ( id INT PRIMARY KEY, title VARCHAR(500), culture VARCHAR(200), period VARCHAR(200), century VARCHAR(100), classification VARCHAR(200), department VARCHAR(200), dated VARCHAR(200), technique VARCHAR(500), medium VARCHAR(500), dimensions VARCHAR(500), credit_line TEXT, accession_number VARCHAR(100), verificationlevel INT, totalpageviews INT, totaluniquepageviews INT ); -- Artifact Media Table CREATE TABLE artifactmedia ( media_id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, image_url TEXT, base_image_url TEXT, width INT, height INT, FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); -- Artifact Colors Table CREATE TABLE artifactcolors ( color_id INT AUTO_INCREMENT PRIMARY KEY, artifact_id INT, color VARCHAR(50), spectrum VARCHAR(50), hue VARCHAR(50), percent FLOAT, FOREIGN KEY (artifact_id) REFERENCES artifactmetadata(id) ); ``` ## Running the Application ```bash # Start the Streamlit app streamlit run app.py ``` The application will be available at `http://localhost:8501` ## Core Components ### 1. API Integration ```python import requests import os from dotenv import load_dotenv load_dotenv() def fetch_artifacts(page=1, size=100): """Fetch artifacts from Harvard Art Museums API""" api_key = os.getenv('HARVARD_API_KEY') url = f"https://api.harvardartmuseums.org/object" params = { 'apikey': api_key, 'page': page, 'size': size, 'hasimage': 1 # Only artifacts with images } response = requests.get(url, params=params) if response.status_code == 200: data = response.json() return data['records'], data['info'] else: raise Exception(f"API Error: {response.status_code}") # Fetch multiple pages with rate limiting import time def fetch_all_artifacts(max_pages=10): """Fetch artifacts with pagination""" all_artifacts = [] for page