
Evaluating Cosmos Policy
Run NVIDIA Cosmos Policy LIBERO robot-policy evaluations on a GPU machine or Slurm job with headless MuJoCo rendering and the official run_libero_eval module.
Overview
Evaluating-cosmos-policy is an agent skill for the Ship phase that runs Cosmos Policy LIBERO evaluations via the official run_libero_eval module on GPU hosts.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill evaluating-cosmos-policyWhat is this skill?
- Command matrix for local GPU, interactive shells, and batch (Slurm) LIBERO runs
- Official entrypoint: cosmos_policy.experiments.robot.libero.run_libero_eval
- Headless EGL/MuJoCo environment variables for GPU rendering
- Smoke eval recipe (e.g., limited trials per task suite)
- Configurable checkpoints, wrist image, proprioception, and action chunking flags
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have a Cosmos Policy checkpoint but no standardized, headless GPU command sequence to benchmark it on LIBERO suites.
Who is it for?
Robotics/embodied-AI builders reproducing NVIDIA Cosmos-Policy-LIBERO-Predict2-2B results on Slurm or local CUDA machines.
Skip if: Training new policies from scratch, non-LIBERO benchmarks, or CPU-only environments without GPU simulation support.
When should I use this skill?
You need to run Cosmos Policy LIBERO evaluations locally, on an interactive GPU shell, or via batch schedulers using the official run_libero_eval commands.
What do I get? / Deliverables
You can execute smoke or full LIBERO eval runs with documented env vars, uv invocations, and checkpoint/config flags, writing logs under the project’s libero logs directory.
- Executed LIBERO eval run with chosen task suite and trial counts
- Local log output under cosmos_policy/experiments/robot/libero/logs
- Documented smoke or full benchmark command for reproducibility
Recommended Skills
Journey fit
How it compares
Focused LIBERO eval command cookbook—not a general ML unit-test framework or live robot hardware bring-up skill.
Common Questions / FAQ
Who is evaluating-cosmos-policy for?
Solo and small-team researchers shipping Cosmos Policy models who need repeatable LIBERO simulation evals on GPU infrastructure.
When should I use evaluating-cosmos-policy?
During Ship/testing when validating a checkpoint after training changes, before publishing results, or when smoke-testing inference configs on libero_10 and related suites.
Is evaluating-cosmos-policy safe to install?
Eval runs pull checkpoints and execute GPU-heavy simulation; review the Security Audits panel on this Prism page and only use trusted checkpoints and cluster quotas.
SKILL.md
READMESKILL.md - Evaluating Cosmos Policy
# LIBERO Command Matrix Command variations for running Cosmos Policy LIBERO evaluation on local machines, interactive GPU shells, or batch systems. All commands use the official public `cosmos_policy.experiments.robot.libero.run_libero_eval` module. ## Preferred path: interactive GPU shell Acquire one GPU, then run evaluations directly: ```bash # Slurm example srun --partition=gpu --gpus-per-node=1 \ --time=01:00:00 --mem=64G --cpus-per-task=8 --pty bash cd /path/to/cosmos-policy # Set headless rendering environment export CUDA_VISIBLE_DEVICES=0 export MUJOCO_EGL_DEVICE_ID=0 export MUJOCO_GL=egl export PYOPENGL_PLATFORM=egl # Smoke eval (1 trial, single suite) uv run --extra cu128 --group libero --python 3.10 \ python -m cosmos_policy.experiments.robot.libero.run_libero_eval \ --config cosmos_predict2_2b_480p_libero__inference_only \ --ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \ --config_file cosmos_policy/config/config.py \ --use_wrist_image True \ --use_proprio True \ --normalize_proprio True \ --unnormalize_actions True \ --dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \ --t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \ --trained_with_image_aug True \ --chunk_size 16 \ --num_open_loop_steps 16 \ --task_suite_name libero_10 \ --num_trials_per_task 1 \ --local_log_dir cosmos_policy/experiments/robot/libero/logs/ \ --seed 195 \ --randomize_seed False \ --deterministic True \ --run_id_note smoke \ --ar_future_prediction False \ --ar_value_prediction False \ --use_jpeg_compression True \ --flip_images True \ --num_denoising_steps_action 5 \ --num_denoising_steps_future_state 1 \ --num_denoising_steps_value 1 \ --data_collection False # Full eval (50 trials, single suite) uv run --extra cu128 --group libero --python 3.10 \ python -m cosmos_policy.experiments.robot.libero.run_libero_eval \ --config cosmos_predict2_2b_480p_libero__inference_only \ --ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \ --config_file cosmos_policy/config/config.py \ --use_wrist_image True \ --use_proprio True \ --normalize_proprio True \ --unnormalize_actions True \ --dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \ --t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \ --trained_with_image_aug True \ --chunk_size 16 \ --num_open_loop_steps 16 \ --task_suite_name libero_10 \ --num_trials_per_task 50 \ --local_log_dir cosmos_policy/experiments/robot/libero/logs/ \ --seed 195 \ --randomize_seed False \ --deterministic True \ --run_id_note full \ --ar_future_prediction False \ --ar_value_prediction False \ --use_jpeg_compression True \ --flip_images True \ --num_denoising_steps_action 5 \ --num_denoising_steps_future_state 1 \ --num_denoising_steps_value 1 \ --data_collection False # All four suites for suite in libero_spatial libero_object libero_goal libero_10; do uv run --extra cu128 --group libero --python 3.10 \ python -m cosmos_policy.experiments.robot.libero.run_libero_eval \ --config cosmos_predict2_2b_480p_libero__inference_only \ --ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \ --config_file cosmos_policy/config/config.py \ --use_wrist_image True \ --use_proprio True \ --normalize_proprio True \ --unnormalize_actions True \ --dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \ --t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \ --trained_with_image_aug True \ --chunk_size 16 \ --num_open_loop_steps 16 \ --task_suite_name "$suite" \ --num_trials_per_task 50 \ --local_