
Vaex
Load and explore billion-row tabular datasets with lazy, out-of-core Vaex DataFrames instead of RAM-bound pandas workflows.
Overview
Vaex is an agent skill for the Build phase that teaches lazy, out-of-core Vaex DataFrame loading and querying for large tabular datasets.
Install
npx skills add https://github.com/k-dense-ai/scientific-agent-skills --skill vaexWhat is this skill?
- Lazy evaluation and out-of-core processing so data need not fit in RAM
- vaex.open() for HDF5, Arrow, Parquet, CSV, FITS, and wildcard multi-file loads
- Virtual columns with no extra memory overhead for derived fields
- Billion-row-per-second throughput via optimized C++ backend
- Format-specific loaders and guidance to convert CSV to HDF5 for repeat use
- Billion-row-per-second processing cited for optimized C++ backend
- Lazy CSV loading without full RAM load since Vaex 4.14
Adoption & trust: 527 installs on skills.sh; 27.6k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need to explore or process very large CSV, Parquet, or HDF5 tables without loading everything into RAM or waiting on slow pandas loads.
Who is it for?
Solo builders and indie teams building scientific, analytics, or ML backends where tabular data is huge and HDF5/Arrow are acceptable interchange formats.
Skip if: Small in-memory-only datasets where pandas suffices, or builders who need a hosted warehouse connector skill instead of local file-oriented Vaex APIs.
When should I use this skill?
Implementing or refactoring Python code that loads, filters, or aggregates very large tabular files with Vaex instead of pandas.
What do I get? / Deliverables
Your agent opens and manipulates billion-scale tables with vaex.open(), virtual columns, and format-appropriate loaders ready for pipeline or analysis code.
- Vaex DataFrame load patterns and format-specific open calls
- Virtual column and lazy-query snippets suitable for pipeline or notebook code
Recommended Skills
Journey fit
Vaex is used while implementing data-heavy backends, research pipelines, and analytics features where large tables must be queried and transformed. Backend and data layers are where file formats (HDF5, Arrow, Parquet), virtual columns, and lazy evaluation are wired into product or research code.
How it compares
Use for out-of-core columnar analytics instead of defaulting every large table to pandas in memory.
Common Questions / FAQ
Who is vaex for?
Developers and data-focused solo builders who ship Python analytics, research tooling, or backends that must handle large HDF5, Arrow, Parquet, or CSV files without RAM limits.
When should I use vaex?
During Build when wiring data loaders, ETL, or exploratory analysis on files that should memory-map (HDF5/Arrow) or load lazily (CSV 4.14+), especially before committing to a pandas-only stack.
Is vaex safe to install?
Treat it as documentation and code patterns for the Vaex library; review the Security Audits panel on this skill page and your dependency supply chain before running generated code on production data.
SKILL.md
READMESKILL.md - Vaex
# Core DataFrames and Data Loading This reference covers Vaex DataFrame basics, loading data from various sources, and understanding the DataFrame structure. ## DataFrame Fundamentals A Vaex DataFrame is the central data structure for working with large tabular datasets. Unlike pandas, Vaex DataFrames: - Use **lazy evaluation** - operations are not executed until needed - Work **out-of-core** - data doesn't need to fit in RAM - Support **virtual columns** - computed columns with no memory overhead - Enable **billion-row-per-second** processing through optimized C++ backend ## Opening Existing Files ### Primary Method: `vaex.open()` The most common way to load data: ```python import vaex # Works with multiple formats df = vaex.open('data.hdf5') # HDF5 (recommended) df = vaex.open('data.arrow') # Apache Arrow (recommended) df = vaex.open('data.parquet') # Parquet df = vaex.open('data.csv') # CSV (lazy since 4.14; convert to HDF5 for repeated use) df = vaex.open('data.fits') # FITS (astronomy) # Can open multiple files as one DataFrame df = vaex.open('data_*.hdf5') # Wildcards supported ``` **Key characteristics:** - **Instant for HDF5/Arrow** - Memory-maps files, no loading time - **Lazy CSV (4.14+)** - `vaex.open('file.csv')` reads CSV lazily without loading all data into RAM - **Returns immediately** - Lazy evaluation means no computation until needed ### Format-Specific Loaders ```python # Lazy CSV (preferred for exploration since vaex 4.14) df = vaex.open('large_file.csv') # CSV with conversion to HDF5 (preferred for repeated use) df = vaex.from_csv( 'large_file.csv', convert='large_file.hdf5', # or convert=True chunk_size=5_000_000, # Process in chunks during conversion copy_index=False # Don't copy pandas index if present ) # To load entire CSV into memory instead of lazy open: # df = vaex.from_csv('large_file.csv') # Apache Arrow df = vaex.open('data.arrow') # Native support, very fast # HDF5 (optimal format) df = vaex.open('data.hdf5') # Instant loading via memory mapping ``` ## Creating DataFrames from Other Sources ### From Pandas ```python import pandas as pd import vaex # Convert pandas DataFrame pdf = pd.read_csv('data.csv') df = vaex.from_pandas(pdf, copy_index=False) # Warning: This loads entire pandas DataFrame into memory # For large data, prefer vaex.from_csv() directly ``` ### From NumPy Arrays ```python import numpy as np import vaex # Single array x = np.random.rand(1_000_000) df = vaex.from_arrays(x=x) # Multiple arrays x = np.random.rand(1_000_000) y = np.random.rand(1_000_000) df = vaex.from_arrays(x=x, y=y) ``` ### From Dictionaries ```python import vaex # Dictionary of lists/arrays data = { 'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35], 'salary': [50000, 60000, 70000] } df = vaex.from_dict(data) ``` ### From Arrow Tables ```python import pyarrow as pa import vaex # From Arrow Table arrow_table = pa.table({ 'x': [1, 2, 3], 'y': [4, 5, 6] }) df = vaex.from_arrow_table(arrow_table) ``` ## Example Datasets Vaex provides built-in example datasets for testing: ```python import vaex # NYC taxi dataset (~1GB, 11 million rows) df = vaex.example() # Smaller datasets df = vaex.datasets.titanic() df = vaex.datasets.iris() ``` ## Inspecting DataFrames ### Basic Information ```python # Display first and last rows print(df) # Shape (rows, columns) print(df.shape) # Returns (row_count, column_count) print(len(df)) # Row count # Column names print(df.columns) print(df.column_names) # Data types print(df.dtypes) # Memory usage (for materialized columns) df.byte_size() ``` ### Statistical Summary ```python # Quick statistics for all numeric columns df.describe() # Single column statistics df.x.mean() df.x.std() df.x.min() df.x.max() df.x.sum() df.x.count() # Quantiles df.x.quantile(0.5) # Median df.x.quantile([0.25, 0.5, 0.75]) # Multiple quantiles ``` ### Viewing Data ```python