Ray Data

Name: Ray Data
Author: orchestra-research

orchestra-research/ai-research-skills

443 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

ray-data is an orchestra-research agent skill that shows developers how to wire Ray Data parquet loads into Ray Train and PyTorch or TensorFlow loops for distributed ML without custom shard glue.

About

ray-data is an orchestra-research ai-research-skills guide for integrating Ray Data with Ray Train and ML frameworks. It demonstrates `ray.data.read_parquet` for train and validation paths, `TorchTrainer` with `ScalingConfig`, and `ray.train.get_dataset_shard` inside `train_func` to iterate `iter_batches` per epoch. Developers adopt ray-data when S3 or parquet-backed datasets must shard across workers without hand-written partition logic between Ray and PyTorch or TensorFlow. The skill focuses on backend data plumbing—launching trainers, fetching dataset shards, and batching—rather than model architecture design. Examples use Ray Train torch integrations with explicit batch_size configuration in training loops.

TorchTrainer example with named train/val dataset shards and multi-worker GPU ScalingConfig
to_torch, iter_torch_batches, and TensorFlow to_tf batch iteration paths
S3 parquet read patterns for train and validation splits
Ray Train get_dataset_shard usage inside train_func

Ray Data by the numbers

443 all-time installs (skills.sh)
+32 installs in the week ending Jul 26, 2026 (Skillselion tracking)
Ranked #460 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill ray-data

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/ray-data.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/ray-data)

Installs	443
repo stars	★ 11.2k
Security audit	1 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

How do you connect Ray Data to Ray Train?

Wire Ray Data parquet loads into Ray Train and PyTorch or TensorFlow loops for distributed ML without bespoke shard glue.

Who is it for?

ML engineers building distributed training on Ray who ingest parquet from S3 and want framework-agnostic shard iteration without custom glue code.

Skip if: Single-machine CSV prototyping or teams not using Ray Train, Ray Data, or parquet-backed pipelines.

When should I use this skill?

A developer asks to integrate Ray Data parquet loads with Ray Train, PyTorch, or TensorFlow distributed training loops.

What you get

Ray Train jobs with parquet-backed dataset shards, ScalingConfig launches, and per-worker iter_batches training loops.

Ray Train launch configs
dataset shard iterators
parquet-backed training loops

Files

SKILL.mdMarkdownGitHub ↗

Ray Data - Scalable ML Data Processing

Distributed data processing library for ML and AI workloads.

When to use Ray Data

Use Ray Data when:

Processing large datasets (>100GB) for ML training
Need distributed data preprocessing across cluster
Building batch inference pipelines
Loading multi-modal data (images, audio, video)
Scaling data processing from laptop to cluster

Key features:

Streaming execution: Process data larger than memory
GPU support: Accelerate transforms with GPUs
Framework integration: PyTorch, TensorFlow, HuggingFace
Multi-modal: Images, Parquet, CSV, JSON, audio, video

Use alternatives instead:

Pandas: Small data (<1GB) on single machine
Dask: Tabular data, SQL-like operations
Spark: Enterprise ETL, SQL queries

Quick start

Installation

pip install -U 'ray[data]'

Load and transform data

import ray

# Read Parquet files
ds = ray.data.read_parquet("s3://bucket/data/*.parquet")

# Transform data (lazy execution)
ds = ds.map_batches(lambda batch: {"processed": batch["text"].str.lower()})

# Consume data
for batch in ds.iter_batches(batch_size=100):
    print(batch)

Integration with Ray Train

import ray
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer

# Create dataset
train_ds = ray.data.read_parquet("s3://bucket/train/*.parquet")

def train_func(config):
    # Access dataset in training
    train_ds = ray.train.get_dataset_shard("train")

    for epoch in range(10):
        for batch in train_ds.iter_batches(batch_size=32):
            # Train on batch
            pass

# Train with Ray
trainer = TorchTrainer(
    train_func,
    datasets={"train": train_ds},
    scaling_config=ScalingConfig(num_workers=4, use_gpu=True)
)
trainer.fit()

Reading data

From cloud storage

import ray

# Parquet (recommended for ML)
ds = ray.data.read_parquet("s3://bucket/data/*.parquet")

# CSV
ds = ray.data.read_csv("s3://bucket/data/*.csv")

# JSON
ds = ray.data.read_json("gs://bucket/data/*.json")

# Images
ds = ray.data.read_images("s3://bucket/images/")

From Python objects

# From list
ds = ray.data.from_items([{"id": i, "value": i * 2} for i in range(1000)])

# From range
ds = ray.data.range(1000000)  # Synthetic data

# From pandas
import pandas as pd
df = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]})
ds = ray.data.from_pandas(df)

Transformations

Map batches (vectorized)

# Batch transformation (fast)
def process_batch(batch):
    batch["doubled"] = batch["value"] * 2
    return batch

ds = ds.map_batches(process_batch, batch_size=1000)

Row transformations

# Row-by-row (slower)
def process_row(row):
    row["squared"] = row["value"] ** 2
    return row

ds = ds.map(process_row)

Filter

# Filter rows
ds = ds.filter(lambda row: row["value"] > 100)

Group by and aggregate

# Group by column
ds = ds.groupby("category").count()

# Custom aggregation
ds = ds.groupby("category").map_groups(lambda group: {"sum": group["value"].sum()})

GPU-accelerated transforms

# Use GPU for preprocessing
def preprocess_images_gpu(batch):
    import torch
    images = torch.tensor(batch["image"]).cuda()
    # GPU preprocessing
    processed = images * 255
    return {"processed": processed.cpu().numpy()}

ds = ds.map_batches(
    preprocess_images_gpu,
    batch_size=64,
    num_gpus=1  # Request GPU
)

Writing data

# Write to Parquet
ds.write_parquet("s3://bucket/output/")

# Write to CSV
ds.write_csv("output/")

# Write to JSON
ds.write_json("output/")

Performance optimization

Repartition

# Control parallelism
ds = ds.repartition(100)  # 100 blocks for 100-core cluster

Batch size tuning

# Larger batches = faster vectorized ops
ds.map_batches(process_fn, batch_size=10000)  # vs batch_size=100

Streaming execution

# Process data larger than memory
ds = ray.data.read_parquet("s3://huge-dataset/")
for batch in ds.iter_batches(batch_size=1000):
    process(batch)  # Streamed, not loaded to memory

Common patterns

Batch inference

import ray

# Load model
def load_model():
    # Load once per worker
    return MyModel()

# Inference function
class BatchInference:
    def __init__(self):
        self.model = load_model()

    def __call__(self, batch):
        predictions = self.model(batch["input"])
        return {"prediction": predictions}

# Run distributed inference
ds = ray.data.read_parquet("s3://data/")
predictions = ds.map_batches(BatchInference, batch_size=32, num_gpus=1)
predictions.write_parquet("s3://output/")

Data preprocessing pipeline

# Multi-step pipeline
ds = (
    ray.data.read_parquet("s3://raw/")
    .map_batches(clean_data)
    .map_batches(tokenize)
    .map_batches(augment)
    .write_parquet("s3://processed/")
)

Integration with ML frameworks

PyTorch

# Convert to PyTorch
torch_ds = ds.to_torch(label_column="label", batch_size=32)

for batch in torch_ds:
    # batch is dict with tensors
    inputs, labels = batch["features"], batch["label"]

TensorFlow

# Convert to TensorFlow
tf_ds = ds.to_tf(feature_columns=["image"], label_column="label", batch_size=32)

for features, labels in tf_ds:
    # Train model
    pass

Supported data formats

Format	Read	Write	Use Case
Parquet	✅	✅	ML data (recommended)
CSV	✅	✅	Tabular data
JSON	✅	✅	Semi-structured
Images	✅	❌	Computer vision
NumPy	✅	✅	Arrays
Pandas	✅	❌	DataFrames

Performance benchmarks

Scaling (processing 100GB data):

1 node (16 cores): ~30 minutes
4 nodes (64 cores): ~8 minutes
16 nodes (256 cores): ~2 minutes

GPU acceleration (image preprocessing):

CPU only: 1,000 images/sec
1 GPU: 5,000 images/sec
4 GPUs: 18,000 images/sec

Use cases

Production deployments:

Pinterest: Last-mile data processing for model training
ByteDance: Scaling offline inference with multi-modal LLMs
Spotify: ML platform for batch inference

References

[Transformations Guide](references/transformations.md) - Map, filter, groupby operations
[Integration Guide](references/integration.md) - Ray Train, PyTorch, TensorFlow

Resources

Docs: https://docs.ray.io/en/latest/data/data.html
GitHub: https://github.com/ray-project/ray ⭐ 36,000+
Version: Ray 2.40.0+
Examples: https://docs.ray.io/en/latest/data/examples/overview.html

Ray Data Integration Guide

Integration with Ray Train and ML frameworks.

Ray Train integration

Basic training with datasets

import ray
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer

# Create datasets
train_ds = ray.data.read_parquet("s3://data/train/")
val_ds = ray.data.read_parquet("s3://data/val/")

def train_func(config):
    # Get dataset shards
    train_ds = ray.train.get_dataset_shard("train")
    val_ds = ray.train.get_dataset_shard("val")

    for epoch in range(config["epochs"]):
        # Iterate over batches
        for batch in train_ds.iter_batches(batch_size=32):
            # Train on batch
            pass

# Launch training
trainer = TorchTrainer(
    train_func,
    train_loop_config={"epochs": 10},
    datasets={"train": train_ds, "val": val_ds},
    scaling_config=ScalingConfig(num_workers=4, use_gpu=True)
)

result = trainer.fit()

PyTorch integration

Convert to PyTorch Dataset

# Option 1: to_torch (recommended)
torch_ds = ds.to_torch(
    label_column="label",
    batch_size=32,
    drop_last=True
)

for batch in torch_ds:
    inputs = batch["features"]
    labels = batch["label"]
    # Train model

# Option 2: iter_torch_batches
for batch in ds.iter_torch_batches(batch_size=32):
    # batch is dict of tensors
    pass

TensorFlow integration

tf_ds = ds.to_tf(
    feature_columns=["image", "text"],
    label_column="label",
    batch_size=32
)

for features, labels in tf_ds:
    # Train TensorFlow model
    pass

Best practices

1. Shard datasets in Ray Train - Automatic with get_dataset_shard() 2. Use streaming - Don't load entire dataset to memory 3. Preprocess in Ray Data - Distribute preprocessing across cluster 4. Cache preprocessed data - Write to Parquet, read in training

Ray Data Transformations

Complete guide to data transformations in Ray Data.

Core operations

Map batches (vectorized)

# Recommended for performance
def process_batch(batch):
    # batch is dict of numpy arrays or pandas Series
    batch["doubled"] = batch["value"] * 2
    return batch

ds = ds.map_batches(process_batch, batch_size=1000)

Performance: 10-100× faster than row-by-row

Map (row-by-row)

# Use only when vectorization not possible
def process_row(row):
    row["squared"] = row["value"] ** 2
    return row

ds = ds.map(process_row)

Filter

# Remove rows
ds = ds.filter(lambda row: row["score"] > 0.5)

Flat map

# One row → multiple rows
def expand_row(row):
    return [{"value": row["value"] + i} for i in range(3)]

ds = ds.flat_map(expand_row)

GPU-accelerated transforms

def gpu_transform(batch):
    import torch
    data = torch.tensor(batch["data"]).cuda()
    # GPU processing
    result = data * 2
    return {"processed": result.cpu().numpy()}

ds = ds.map_batches(gpu_transform, num_gpus=1, batch_size=64)

Groupby operations

# Group by column
grouped = ds.groupby("category")

# Aggregate
result = grouped.count()

# Custom aggregation
result = grouped.map_groups(lambda group: {
    "sum": group["value"].sum(),
    "mean": group["value"].mean()
})

Best practices

1. Use map_batches over map - 10-100× faster 2. Tune batch_size - Larger = faster (balance with memory) 3. Use GPUs for heavy compute - Image/audio preprocessing 4. Stream large datasets - Use iter_batches for >memory data

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Choose ray-data for Ray Train parquet sharding; choose huggingface-accelerate when the problem is Accelerate plugin configuration instead of Ray data loaders.

FAQ

How does ray-data connect datasets to Ray Train?

ray-data shows creating datasets with ray.data.read_parquet, launching TorchTrainer with ScalingConfig, and using ray.train.get_dataset_shard inside train_func to iter_batches during each epoch.

Which storage format does ray-data emphasize?

ray-data centers on parquet datasets—README examples load train and validation splits from parquet paths such as s3://data/train/ and s3://data/val/ before sharding to workers.

Is Ray Data safe to install?

skills.sh reports 1 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLpipelinesanalytics

About

Ray Data by the numbers

Add your badge

How do you connect Ray Data to Ray Train?

Who is it for?

When should I use this skill?

What you get

Files

Ray Data - Scalable ML Data Processing

When to use Ray Data

Quick start

Installation

Load and transform data

Integration with Ray Train

Reading data

From cloud storage

From Python objects

Transformations

Map batches (vectorized)

Row transformations

Filter

Group by and aggregate

GPU-accelerated transforms

Writing data

Performance optimization

Repartition

Batch size tuning

Streaming execution

Common patterns

Batch inference

Data preprocessing pipeline

Integration with ML frameworks

PyTorch

TensorFlow

Supported data formats

Performance benchmarks

Use cases

References

Resources

Ray Data Integration Guide

Ray Train integration

Basic training with datasets

PyTorch integration

Convert to PyTorch Dataset

TensorFlow integration

Best practices

Ray Data Transformations

Core operations

Map batches (vectorized)

Map (row-by-row)

Filter

Flat map

GPU-accelerated transforms

Groupby operations

Best practices

Related skills

How it compares

FAQ

How does ray-data connect datasets to Ray Train?

Which storage format does ray-data emphasize?

Is Ray Data safe to install?

This week in AI coding