Evaluating Cosmos Policy

Name: Evaluating Cosmos Policy
Author: orchestra-research

orchestra-research/ai-research-skills

338 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

evaluating-cosmos-policy is a Claude Code skill that runs NVIDIA Cosmos Policy LIBERO robot-policy evaluations on GPU machines or Slurm jobs for developers who need reproducible MuJoCo benchmark scores.

About

evaluating-cosmos-policy is a research evaluation skill from orchestra-research/ai-research-skills that documents command matrices for running NVIDIA Cosmos Policy LIBERO benchmarks through the official cosmos_policy.experiments.robot.libero.run_libero_eval module. The skill covers interactive GPU shells and Slurm batch jobs, including headless MuJoCo rendering via EGL environment variables such as MUJOCO_GL=egl and PYOPENGL_PLATFORM=egl. Example Slurm allocations request one GPU, 64G memory, and eight CPUs for hour-long evaluation windows. Developers reach for evaluating-cosmos-policy when they need smoke evals on a single suite or full LIBERO trials without hand-rolling CUDA, MuJoCo, and uv dependency wiring each run.

Command matrix for local GPU, interactive shells, and batch (Slurm) LIBERO runs
Official entrypoint: cosmos_policy.experiments.robot.libero.run_libero_eval
Headless EGL/MuJoCo environment variables for GPU rendering
Smoke eval recipe (e.g., limited trials per task suite)
Configurable checkpoints, wrist image, proprioception, and action chunking flags

Evaluating Cosmos Policy by the numbers

338 all-time installs (skills.sh)
+35 installs in the week ending Jul 18, 2026 (Skillselion tracking)
Ranked #674 of 2,184 Testing & QA skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill evaluating-cosmos-policy

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/evaluating-cosmos-policy.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/evaluating-cosmos-policy)

Installs	338
repo stars	★ 11.2k
Security audit	2 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

How do you run Cosmos Policy LIBERO evals on Slurm?

Run NVIDIA Cosmos Policy LIBERO robot-policy evaluations on a GPU machine or Slurm job with headless MuJoCo rendering and the official run_libero_eval module.

Who is it for?

ML engineers with GPU or Slurm access who must benchmark NVIDIA Cosmos Policy robot manipulation policies on LIBERO suites.

Skip if: Developers without CUDA GPUs, MuJoCo, or the cosmos-policy repository who only need general pytest or web-app testing.

When should I use this skill?

A developer asks to evaluate, benchmark, or smoke-test a Cosmos Policy model on LIBERO with Slurm or local GPU commands.

What you get

Executed LIBERO evaluation runs, Slurm GPU job commands, and logged trial scores from run_libero_eval.

LIBERO evaluation run logs
Slurm GPU job command matrix

By the numbers

Documents Slurm jobs allocating 1 GPU, 64G memory, and 8 CPUs
Supports smoke eval with 1 trial on a single LIBERO suite

Files

SKILL.mdMarkdownGitHub ↗

Cosmos Policy Evaluation

Evaluation workflows for NVIDIA Cosmos Policy on LIBERO and RoboCasa simulation environments from the public cosmos-policy repository. Covers blank-machine setup, headless GPU evaluation, and inference profiling.

Quick start

Run a minimal LIBERO evaluation using the official public eval module:

uv run --extra cu128 --group libero --python 3.10 \
  python -m cosmos_policy.experiments.robot.libero.run_libero_eval \
    --config cosmos_predict2_2b_480p_libero__inference_only \
    --ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \
    --config_file cosmos_policy/config/config.py \
    --use_wrist_image True \
    --use_proprio True \
    --normalize_proprio True \
    --unnormalize_actions True \
    --dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \
    --t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \
    --trained_with_image_aug True \
    --chunk_size 16 \
    --num_open_loop_steps 16 \
    --task_suite_name libero_10 \
    --num_trials_per_task 1 \
    --local_log_dir cosmos_policy/experiments/robot/libero/logs/ \
    --seed 195 \
    --randomize_seed False \
    --deterministic True \
    --run_id_note smoke \
    --ar_future_prediction False \
    --ar_value_prediction False \
    --use_jpeg_compression True \
    --flip_images True \
    --num_denoising_steps_action 5 \
    --num_denoising_steps_future_state 1 \
    --num_denoising_steps_value 1 \
    --data_collection False

Core concepts

What Cosmos Policy is: NVIDIA Cosmos Policy is a vision-language-action (VLA) model that uses Cosmos Tokenizer to encode visual observations into discrete tokens, then predicts robot actions conditioned on language instructions and visual context.

Key architecture choices:

Component	Design
Visual encoder	Cosmos Tokenizer (discrete tokens)
Language conditioning	Cross-attention to language embeddings
Action prediction	Autoregressive action token generation

Public command surface: The supported evaluation entrypoints are cosmos_policy.experiments.robot.libero.run_libero_eval and cosmos_policy.experiments.robot.robocasa.run_robocasa_eval. Keep reproduction notes anchored to these public modules and their documented flags.

Compute requirements

Task	GPU	VRAM	Typical wall time
LIBERO smoke eval (1 trial)	1x A40/A100	~16 GB	5-10 min
LIBERO full eval (50 trials)	1x A40/A100	~16 GB	2-4 hours
RoboCasa single-task (2 trials)	1x A40/A100	~18 GB	10-15 min
RoboCasa all-tasks	1x A40/A100	~18 GB	4-8 hours

When to use vs alternatives

Use this skill when:

Evaluating NVIDIA Cosmos Policy on LIBERO or RoboCasa benchmarks
Profiling inference latency and throughput for Cosmos Policy
Setting up headless EGL rendering for robot simulation on GPU clusters

Use alternatives when:

Training or fine-tuning Cosmos Policy from scratch (use official Cosmos training docs)
Working with OpenVLA-based policies (use fine-tuning-openvla-oft)
Working with Physical Intelligence pi0 models (use fine-tuning-serving-openpi)
Running real-robot evaluation rather than simulation

---

Workflow 1: LIBERO evaluation

Copy this checklist and track progress:

LIBERO Eval Progress:
- [ ] Step 1: Install environment and dependencies
- [ ] Step 2: Configure headless EGL rendering
- [ ] Step 3: Run smoke evaluation
- [ ] Step 4: Validate outputs and parse results
- [ ] Step 5: Run full benchmark if smoke passes

Step 1: Install environment

git clone https://github.com/NVlabs/cosmos-policy.git
cd cosmos-policy
# Follow SETUP.md to build and enter the supported Docker container.
# Then, inside the container:
uv sync --extra cu128 --group libero --python 3.10

Step 2: Configure headless rendering

export CUDA_VISIBLE_DEVICES=0
export MUJOCO_EGL_DEVICE_ID=0
export MUJOCO_GL=egl
export PYOPENGL_PLATFORM=egl

Step 3: Run smoke evaluation

uv run --extra cu128 --group libero --python 3.10 \
  python -m cosmos_policy.experiments.robot.libero.run_libero_eval \
    --config cosmos_predict2_2b_480p_libero__inference_only \
    --ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \
    --config_file cosmos_policy/config/config.py \
    --use_wrist_image True \
    --use_proprio True \
    --normalize_proprio True \
    --unnormalize_actions True \
    --dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \
    --t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \
    --trained_with_image_aug True \
    --chunk_size 16 \
    --num_open_loop_steps 16 \
    --task_suite_name libero_10 \
    --num_trials_per_task 1 \
    --local_log_dir cosmos_policy/experiments/robot/libero/logs/ \
    --seed 195 \
    --randomize_seed False \
    --deterministic True \
    --run_id_note smoke \
    --ar_future_prediction False \
    --ar_value_prediction False \
    --use_jpeg_compression True \
    --flip_images True \
    --num_denoising_steps_action 5 \
    --num_denoising_steps_future_state 1 \
    --num_denoising_steps_value 1 \
    --data_collection False

Step 4: Validate and parse results

import json
import glob

# Find latest evaluation result from the official log directory
log_files = sorted(glob.glob("cosmos_policy/experiments/robot/libero/logs/**/*.json", recursive=True))
with open(log_files[-1]) as f:
    results = json.load(f)

print(results)

Step 5: Scale up

Run across all four LIBERO task suites with 50 trials:

for suite in libero_spatial libero_object libero_goal libero_10; do
  uv run --extra cu128 --group libero --python 3.10 \
    python -m cosmos_policy.experiments.robot.libero.run_libero_eval \
      --config cosmos_predict2_2b_480p_libero__inference_only \
      --ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \
      --config_file cosmos_policy/config/config.py \
      --use_wrist_image True \
      --use_proprio True \
      --normalize_proprio True \
      --unnormalize_actions True \
      --dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \
      --t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \
      --trained_with_image_aug True \
      --chunk_size 16 \
      --num_open_loop_steps 16 \
      --task_suite_name "$suite" \
      --num_trials_per_task 50 \
      --local_log_dir cosmos_policy/experiments/robot/libero/logs/ \
      --seed 195 \
      --randomize_seed False \
      --deterministic True \
      --run_id_note "suite_${suite}" \
      --ar_future_prediction False \
      --ar_value_prediction False \
      --use_jpeg_compression True \
      --flip_images True \
      --num_denoising_steps_action 5 \
      --num_denoising_steps_future_state 1 \
      --num_denoising_steps_value 1 \
      --data_collection False
done

---

Workflow 2: RoboCasa evaluation

Copy this checklist and track progress:

RoboCasa Eval Progress:
- [ ] Step 1: Install RoboCasa assets and verify macros
- [ ] Step 2: Run single-task smoke evaluation
- [ ] Step 3: Validate outputs
- [ ] Step 4: Expand to multi-task runs

Step 1: Install RoboCasa

git clone https://github.com/moojink/robocasa-cosmos-policy.git
uv pip install -e robocasa-cosmos-policy
python -m robocasa.scripts.setup_macros
python -m robocasa.scripts.download_kitchen_assets

This fork installs the robocasa Python package expected by Cosmos Policy while preserving the patched environment changes used in the public RoboCasa eval path. Verify macros_private.py exists and paths are correct.

Step 2: Single-task smoke evaluation

uv run --extra cu128 --group robocasa --python 3.10 \
  python -m cosmos_policy.experiments.robot.robocasa.run_robocasa_eval \
    --config cosmos_predict2_2b_480p_robocasa_50_demos_per_task__inference \
    --ckpt_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B \
    --config_file cosmos_policy/config/config.py \
    --use_wrist_image True \
    --num_wrist_images 1 \
    --use_proprio True \
    --normalize_proprio True \
    --unnormalize_actions True \
    --dataset_stats_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B/robocasa_dataset_statistics.json \
    --t5_text_embeddings_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B/robocasa_t5_embeddings.pkl \
    --trained_with_image_aug True \
    --chunk_size 32 \
    --num_open_loop_steps 16 \
    --task_name TurnOffMicrowave \
    --obj_instance_split A \
    --num_trials_per_task 2 \
    --local_log_dir cosmos_policy/experiments/robot/robocasa/logs/ \
    --seed 195 \
    --randomize_seed False \
    --deterministic True \
    --run_id_note smoke \
    --use_variance_scale False \
    --use_jpeg_compression True \
    --flip_images True \
    --num_denoising_steps_action 5 \
    --num_denoising_steps_future_state 1 \
    --num_denoising_steps_value 1 \
    --data_collection False

Step 3: Validate outputs

Confirm the eval log prints the expected task name, object split, and checkpoint/config values.
Inspect the final Success rate: line in the log.

Step 4: Expand scope

Increase --num_trials_per_task or add more tasks. Keep --obj_instance_split fixed across repeated runs for comparability.

---

Workflow 3: Blank-machine cluster launch

Cluster Launch Progress:
- [ ] Step 1: Clone the public repo and enter the supported runtime
- [ ] Step 2: Sync the benchmark-specific dependency group
- [ ] Step 3: Export rendering and cache environment variables before eval

Step 1: Clone and enter the supported runtime

git clone https://github.com/NVlabs/cosmos-policy.git
cd cosmos-policy
# Follow SETUP.md, start the Docker container, and enter it before continuing.

Step 2: Sync dependencies

uv sync --extra cu128 --group libero --python 3.10
# or, for RoboCasa:
uv sync --extra cu128 --group robocasa --python 3.10
# then install the Cosmos-compatible RoboCasa fork:
git clone https://github.com/moojink/robocasa-cosmos-policy.git
uv pip install -e robocasa-cosmos-policy

Step 3: Export runtime environment

export CUDA_VISIBLE_DEVICES=0
export MUJOCO_EGL_DEVICE_ID=0
export MUJOCO_GL=egl
export PYOPENGL_PLATFORM=egl
export HF_HOME=${HF_HOME:-$HOME/.cache/huggingface}
export TRANSFORMERS_CACHE=${TRANSFORMERS_CACHE:-$HF_HOME}

---

Expected performance benchmarks

Reference values from official evaluation (tied to specific setup and seeds):

Task Suite	Success Rate	Notes
LIBERO-Spatial	98.1%	Official LIBERO spatial result
LIBERO-Object	100.0%	Official LIBERO object result
LIBERO-Goal	98.2%	Official LIBERO goal result
LIBERO-Long	97.6%	Official LIBERO long-horizon result
LIBERO-Average	98.5%	Official average across LIBERO suites
RoboCasa	67.1%	Official RoboCasa average result

Reproduction note: Published success rates still depend on checkpoint choice, task suite, seeds, and simulator setup. Record the exact command and environment alongside any reported number.

---

Non-negotiable rules

EGL alignment: Always set CUDA_VISIBLE_DEVICES, MUJOCO_EGL_DEVICE_ID, MUJOCO_GL=egl, and PYOPENGL_PLATFORM=egl together on headless GPU nodes.
Official runtime first: If host-Python installs hit binary compatibility issues, fall back to the supported container workflow from SETUP.md before debugging package internals.
Cache consistency: Use the same cache directory across setup and eval so Hugging Face and dependency caches are reused.
Run comparability: Keep task name, object split, seed, and trial count fixed across repeated runs.

---

Common issues

Issue: binary compatibility or loader failures on host Python

Fix: rerun inside the official container/runtime from SETUP.md. Do not assume host-package rebuilds will match the public release environment.

Issue: LIBERO prompts for config path in a non-interactive shell

Fix: pre-create LIBERO_CONFIG_PATH/config.yaml:

import os, yaml

config_dir = os.path.expanduser("~/.libero")
os.makedirs(config_dir, exist_ok=True)
with open(os.path.join(config_dir, "config.yaml"), "w") as f:
    yaml.dump({"benchmark_root": "/path/to/libero/datasets"}, f)

Issue: EGL initialization or shutdown noise

Fix: align EGL environment variables first. Treat teardown-only EGL_NOT_INITIALIZED warnings as low-signal unless the job exits non-zero.

Issue: Kitchen object sampling NaNs or asset lookup failures in RoboCasa

Fix: rerun asset setup and confirm the patched robocasa install is intact:

python -m robocasa.scripts.download_kitchen_assets
python -c "import robocasa; print(robocasa.__file__)"

Issue: MuJoCo rendering mismatch

Fix: verify GPU device alignment:

import os
cuda_dev = os.environ.get("CUDA_VISIBLE_DEVICES", "not set")
egl_dev = os.environ.get("MUJOCO_EGL_DEVICE_ID", "not set")
assert cuda_dev == egl_dev, f"GPU mismatch: CUDA={cuda_dev}, EGL={egl_dev}"
print(f"Rendering on GPU {cuda_dev}")

---

Advanced topics

LIBERO command matrix: See references/libero-commands.md RoboCasa command matrix: See references/robocasa-commands.md

Resources

Cosmos Policy repository: https://github.com/NVlabs/cosmos-policy
LIBERO benchmark: https://github.com/Lifelong-Robot-Learning/LIBERO
Cosmos-compatible RoboCasa fork: https://github.com/moojink/robocasa-cosmos-policy
Upstream RoboCasa project: https://github.com/robocasa/robocasa
MuJoCo documentation: https://mujoco.readthedocs.io/

LIBERO Command Matrix

Command variations for running Cosmos Policy LIBERO evaluation on local machines, interactive GPU shells, or batch systems. All commands use the official public cosmos_policy.experiments.robot.libero.run_libero_eval module.

Preferred path: interactive GPU shell

Acquire one GPU, then run evaluations directly:

# Slurm example
srun --partition=gpu --gpus-per-node=1 \
  --time=01:00:00 --mem=64G --cpus-per-task=8 --pty bash

cd /path/to/cosmos-policy

# Set headless rendering environment
export CUDA_VISIBLE_DEVICES=0
export MUJOCO_EGL_DEVICE_ID=0
export MUJOCO_GL=egl
export PYOPENGL_PLATFORM=egl

# Smoke eval (1 trial, single suite)
uv run --extra cu128 --group libero --python 3.10 \
  python -m cosmos_policy.experiments.robot.libero.run_libero_eval \
    --config cosmos_predict2_2b_480p_libero__inference_only \
    --ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \
    --config_file cosmos_policy/config/config.py \
    --use_wrist_image True \
    --use_proprio True \
    --normalize_proprio True \
    --unnormalize_actions True \
    --dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \
    --t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \
    --trained_with_image_aug True \
    --chunk_size 16 \
    --num_open_loop_steps 16 \
    --task_suite_name libero_10 \
    --num_trials_per_task 1 \
    --local_log_dir cosmos_policy/experiments/robot/libero/logs/ \
    --seed 195 \
    --randomize_seed False \
    --deterministic True \
    --run_id_note smoke \
    --ar_future_prediction False \
    --ar_value_prediction False \
    --use_jpeg_compression True \
    --flip_images True \
    --num_denoising_steps_action 5 \
    --num_denoising_steps_future_state 1 \
    --num_denoising_steps_value 1 \
    --data_collection False

# Full eval (50 trials, single suite)
uv run --extra cu128 --group libero --python 3.10 \
  python -m cosmos_policy.experiments.robot.libero.run_libero_eval \
    --config cosmos_predict2_2b_480p_libero__inference_only \
    --ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \
    --config_file cosmos_policy/config/config.py \
    --use_wrist_image True \
    --use_proprio True \
    --normalize_proprio True \
    --unnormalize_actions True \
    --dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \
    --t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \
    --trained_with_image_aug True \
    --chunk_size 16 \
    --num_open_loop_steps 16 \
    --task_suite_name libero_10 \
    --num_trials_per_task 50 \
    --local_log_dir cosmos_policy/experiments/robot/libero/logs/ \
    --seed 195 \
    --randomize_seed False \
    --deterministic True \
    --run_id_note full \
    --ar_future_prediction False \
    --ar_value_prediction False \
    --use_jpeg_compression True \
    --flip_images True \
    --num_denoising_steps_action 5 \
    --num_denoising_steps_future_state 1 \
    --num_denoising_steps_value 1 \
    --data_collection False

# All four suites
for suite in libero_spatial libero_object libero_goal libero_10; do
  uv run --extra cu128 --group libero --python 3.10 \
    python -m cosmos_policy.experiments.robot.libero.run_libero_eval \
      --config cosmos_predict2_2b_480p_libero__inference_only \
      --ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \
      --config_file cosmos_policy/config/config.py \
      --use_wrist_image True \
      --use_proprio True \
      --normalize_proprio True \
      --unnormalize_actions True \
      --dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \
      --t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \
      --trained_with_image_aug True \
      --chunk_size 16 \
      --num_open_loop_steps 16 \
      --task_suite_name "$suite" \
      --num_trials_per_task 50 \
      --local_log_dir cosmos_policy/experiments/robot/libero/logs/ \
      --seed 195 \
      --randomize_seed False \
      --deterministic True \
      --run_id_note "suite_${suite}" \
      --ar_future_prediction False \
      --ar_value_prediction False \
      --use_jpeg_compression True \
      --flip_images True \
      --num_denoising_steps_action 5 \
      --num_denoising_steps_future_state 1 \
      --num_denoising_steps_value 1 \
      --data_collection False
done

Local GPU workstation path

Skip srun and run the same uv run ... python -m commands directly. Set EGL env vars first. If host-Python binaries are unstable, prefer the official container/runtime from SETUP.md.

Blank-machine setup reminder

Before running any command below:

clone https://github.com/NVlabs/cosmos-policy.git
follow SETUP.md and enter the supported Docker container
run uv sync --extra cu128 --group libero --python 3.10

Batch fallback

Only use batch submission after the direct command path works interactively:

sbatch --partition=gpu --time=04:00:00 --wrap="
  export CUDA_VISIBLE_DEVICES=0 MUJOCO_EGL_DEVICE_ID=0 MUJOCO_GL=egl PYOPENGL_PLATFORM=egl
  cd /path/to/cosmos-policy
  uv run --extra cu128 --group libero --python 3.10 \
    python -m cosmos_policy.experiments.robot.libero.run_libero_eval \
      --config cosmos_predict2_2b_480p_libero__inference_only \
      --ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \
      --config_file cosmos_policy/config/config.py \
      --use_wrist_image True \
      --use_proprio True \
      --normalize_proprio True \
      --unnormalize_actions True \
      --dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \
      --t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \
      --trained_with_image_aug True \
      --chunk_size 16 \
      --num_open_loop_steps 16 \
      --task_suite_name libero_10 \
      --num_trials_per_task 50 \
      --local_log_dir cosmos_policy/experiments/robot/libero/logs/ \
      --seed 195 \
      --randomize_seed False \
      --deterministic True \
      --run_id_note batch \
      --ar_future_prediction False \
      --ar_value_prediction False \
      --use_jpeg_compression True \
      --flip_images True \
      --num_denoising_steps_action 5 \
      --num_denoising_steps_future_state 1 \
      --num_denoising_steps_value 1 \
      --data_collection False
"

High-signal gotchas

If host-Python binaries fail to import cleanly, return to the official container/runtime from SETUP.md before debugging Python package state.
Always align CUDA_VISIBLE_DEVICES and MUJOCO_EGL_DEVICE_ID to the same GPU index.
Keep the full config block with the command because upstream eval depends on many explicit flags, not only task suite and trial count.

RoboCasa Command Matrix

Command variations for running Cosmos Policy RoboCasa evaluation on local machines, interactive GPU shells, or batch systems. All commands use the official public cosmos_policy.experiments.robot.robocasa.run_robocasa_eval module.

Preferred path: interactive GPU shell

Acquire one GPU, then run evaluations directly:

# Slurm example
srun --partition=gpu --gpus-per-node=1 \
  --time=01:00:00 --mem=64G --cpus-per-task=8 --pty bash

cd /path/to/cosmos-policy

# Set headless rendering environment
export CUDA_VISIBLE_DEVICES=0
export MUJOCO_EGL_DEVICE_ID=0
export MUJOCO_GL=egl
export PYOPENGL_PLATFORM=egl

# Smoke eval on one task (2 trials)
uv run --extra cu128 --group robocasa --python 3.10 \
  python -m cosmos_policy.experiments.robot.robocasa.run_robocasa_eval \
    --config cosmos_predict2_2b_480p_robocasa_50_demos_per_task__inference \
    --ckpt_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B \
    --config_file cosmos_policy/config/config.py \
    --use_wrist_image True \
    --num_wrist_images 1 \
    --use_proprio True \
    --normalize_proprio True \
    --unnormalize_actions True \
    --dataset_stats_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B/robocasa_dataset_statistics.json \
    --t5_text_embeddings_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B/robocasa_t5_embeddings.pkl \
    --trained_with_image_aug True \
    --chunk_size 32 \
    --num_open_loop_steps 16 \
    --task_name TurnOffMicrowave \
    --obj_instance_split A \
    --num_trials_per_task 2 \
    --local_log_dir cosmos_policy/experiments/robot/robocasa/logs/ \
    --seed 195 \
    --randomize_seed False \
    --deterministic True \
    --run_id_note smoke \
    --use_variance_scale False \
    --use_jpeg_compression True \
    --flip_images True \
    --num_denoising_steps_action 5 \
    --num_denoising_steps_future_state 1 \
    --num_denoising_steps_value 1 \
    --data_collection False

# Full eval on one task (50 trials)
uv run --extra cu128 --group robocasa --python 3.10 \
  python -m cosmos_policy.experiments.robot.robocasa.run_robocasa_eval \
    --config cosmos_predict2_2b_480p_robocasa_50_demos_per_task__inference \
    --ckpt_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B \
    --config_file cosmos_policy/config/config.py \
    --use_wrist_image True \
    --num_wrist_images 1 \
    --use_proprio True \
    --normalize_proprio True \
    --unnormalize_actions True \
    --dataset_stats_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B/robocasa_dataset_statistics.json \
    --t5_text_embeddings_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B/robocasa_t5_embeddings.pkl \
    --trained_with_image_aug True \
    --chunk_size 32 \
    --num_open_loop_steps 16 \
    --task_name TurnOffMicrowave \
    --obj_instance_split A \
    --num_trials_per_task 50 \
    --local_log_dir cosmos_policy/experiments/robot/robocasa/logs/ \
    --seed 195 \
    --randomize_seed False \
    --deterministic True \
    --run_id_note full \
    --use_variance_scale False \
    --use_jpeg_compression True \
    --flip_images True \
    --num_denoising_steps_action 5 \
    --num_denoising_steps_future_state 1 \
    --num_denoising_steps_value 1 \
    --data_collection False

Local GPU workstation path

Skip srun and run the same uv run ... python -m commands directly. Set EGL env vars first. If host-Python binaries are unstable, prefer the official container/runtime from SETUP.md.

Blank-machine setup reminder

Before running any command below:

clone https://github.com/NVlabs/cosmos-policy.git
follow SETUP.md and enter the supported Docker container
run uv sync --extra cu128 --group robocasa --python 3.10
clone https://github.com/moojink/robocasa-cosmos-policy.git and install it with uv pip install -e robocasa-cosmos-policy
run python -m robocasa.scripts.setup_macros and python -m robocasa.scripts.download_kitchen_assets before the first eval

Batch fallback

Only use batch submission after the direct command path works interactively:

sbatch --partition=gpu --time=01:00:00 --wrap="
  export CUDA_VISIBLE_DEVICES=0 MUJOCO_EGL_DEVICE_ID=0 MUJOCO_GL=egl PYOPENGL_PLATFORM=egl
  cd /path/to/cosmos-policy
  uv run --extra cu128 --group robocasa --python 3.10 \
    python -m cosmos_policy.experiments.robot.robocasa.run_robocasa_eval \
      --config cosmos_predict2_2b_480p_robocasa_50_demos_per_task__inference \
      --ckpt_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B \
      --config_file cosmos_policy/config/config.py \
      --use_wrist_image True \
      --num_wrist_images 1 \
      --use_proprio True \
      --normalize_proprio True \
      --unnormalize_actions True \
      --dataset_stats_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B/robocasa_dataset_statistics.json \
      --t5_text_embeddings_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B/robocasa_t5_embeddings.pkl \
      --trained_with_image_aug True \
      --chunk_size 32 \
      --num_open_loop_steps 16 \
      --task_name TurnOffMicrowave \
      --obj_instance_split A \
      --num_trials_per_task 50 \
      --local_log_dir cosmos_policy/experiments/robot/robocasa/logs/ \
      --seed 195 \
      --randomize_seed False \
      --deterministic True \
      --run_id_note batch \
      --use_variance_scale False \
      --use_jpeg_compression True \
      --flip_images True \
      --num_denoising_steps_action 5 \
      --num_denoising_steps_future_state 1 \
      --num_denoising_steps_value 1 \
      --data_collection False
"

High-signal gotchas

If host-Python binaries fail to import cleanly, return to the official container/runtime from SETUP.md before debugging Python package state.
Keep task name, object split, seed, and trial count fixed across repeated runs for comparability.
Always align CUDA_VISIBLE_DEVICES and MUJOCO_EGL_DEVICE_ID to the same GPU index.

Related skills

TddFollow test-driven development with a strict red-green-refactor loop when creating reliable features or fixing bugs.510k185k

Test Driven DevelopmentEnforce writing failing tests before any production implementation code.176k260k

QaRun conversational QA sessions that turn user-reported bugs into well-written, domain-aware GitHub issues without manual ticket writing.164k185k

Migrate To ShoehornAutomatically update TypeScript test files that rely on unsafe `as` type assertions by replacing them with type-safe partial objects from @total-typescript/shoehorn.151k185k

Webapp TestingVerify frontend behavior, debug UI issues, capture screenshots, and inspect logs of a running local web application using Playwright.121k164k

Playwright CliRun browser automation, generate element snapshots, inspect DOM attributes, and execute Playwright tests from the terminal.96.3k12.2k

How it compares

Pick evaluating-cosmos-policy over generic ML test skills when the target artifact is a NVIDIA Cosmos Policy checkpoint evaluated specifically on LIBERO robot manipulation suites.

FAQ

What module does evaluating-cosmos-policy use for LIBERO runs?

evaluating-cosmos-policy routes all commands through the official public module cosmos_policy.experiments.robot.libero.run_libero_eval. The skill documents uv run invocations with CUDA extras and headless EGL rendering so evaluations match upstream Cosmos Policy expectations.

Does evaluating-cosmos-policy support Slurm batch jobs?

evaluating-cosmos-policy includes Slurm srun examples that request one GPU per node, 64G memory, eight CPUs, and a one-hour wall clock. Developers acquire an interactive GPU shell, export MuJoCo EGL variables, then run LIBERO eval commands inside the cosmos-policy checkout.

Is Evaluating Cosmos Policy safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Testing & QAresearchautomation

About

Evaluating Cosmos Policy by the numbers

Add your badge

How do you run Cosmos Policy LIBERO evals on Slurm?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Cosmos Policy Evaluation

Quick start

Core concepts

Compute requirements

When to use vs alternatives

Workflow 1: LIBERO evaluation

Workflow 2: RoboCasa evaluation

Workflow 3: Blank-machine cluster launch

Expected performance benchmarks

Non-negotiable rules

Common issues

Advanced topics

Resources

LIBERO Command Matrix

Preferred path: interactive GPU shell

Local GPU workstation path

Blank-machine setup reminder

Batch fallback

High-signal gotchas

RoboCasa Command Matrix

Preferred path: interactive GPU shell

Local GPU workstation path

Blank-machine setup reminder

Batch fallback

High-signal gotchas

Related skills

How it compares

FAQ

What module does evaluating-cosmos-policy use for LIBERO runs?

Does evaluating-cosmos-policy support Slurm batch jobs?

Is Evaluating Cosmos Policy safe to install?

This week in AI coding