
Comfyui Video Pipeline
Orchestrate ComfyUI video generation—picking Wan 2.2, FramePack, or AnimateDiff—for image-to-video, text-to-video, and motion-controlled clips from your agent.
Overview
ComfyUI Video Pipeline is an agent skill for the Build phase that selects and runs ComfyUI video engines (Wan 2.2, FramePack, AnimateDiff) for image-to-video and text-to-video generation.
Install
npx skills add https://github.com/mckruz/comfyui-expert --skill comfyui-video-pipelineWhat is this skill?
- Decision tree routes requests to Wan 2.2 MoE 14B, Wan 2.2 1.3B, FramePack, AnimateDiff Lightning, AnimateDiff V3, or def
- Supports image-to-video, text-to-video, talking heads, and motion-controlled animation
- VRAM-aware routing: 24GB+ for MoE quality, 8GB for 1.3B, FramePack for 60s on ~6GB, Lightning for 4–8 step iteration
- Wan 2.2 MoE called out for exclusive first+last frame control
- OpenClaw metadata expects COMFYUI_URL plus curl or wget on darwin, linux, or win32
- Engine tree includes Wan 2.2 MoE 14B (24GB+ VRAM), Wan 2.2 1.3B (8GB VRAM), FramePack (~60 seconds on 6GB), and AnimateD
Adoption & trust: 588 installs on skills.sh; 69 GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You want agent-generated video from images or prompts but do not know which ComfyUI engine, checkpoints, or VRAM budget fit your clip length and quality bar.
Who is it for?
Indie builders with ComfyUI already running who need repeatable routing between Wan, FramePack, and AnimateDiff for content and product demos.
Skip if: Teams without a ComfyUI instance or GPU budget, static image-only generation, or workflows that do not involve video diffusion at all.
When should I use this skill?
Creating any video content from character images or text descriptions with ComfyUI, including image-to-video, text-to-video, talking heads, and motion-controlled animation.
What do I get? / Deliverables
You get a concrete engine choice, model prerequisites, and pipeline steps aligned to your hardware instead of a single default workflow that fails on memory or duration.
- Selected engine and pipeline variant for the request
- Checkpoint and model path checklist
- Rendered video output via ComfyUI workflow execution
Recommended Skills
Journey fit
Video pipeline work happens in Build when you wire generative engines, model paths, and ComfyUI workflows into something shippable. Integrations is the right shelf because the skill selects engines and prerequisites against a remote or local COMFYUI_URL rather than teaching generic frontend UI.
How it compares
Use instead of one-off ComfyUI JSON dumps when you need VRAM-aware engine selection across Wan 2.2, FramePack, and AnimateDiff—not a hosted video API integration skill.
Common Questions / FAQ
Who is comfyui-video-pipeline for?
Solo builders and small teams using ComfyUI who want an agent to pick the right video stack for quality, length, speed, or motion-control requirements.
When should I use comfyui-video-pipeline?
Use it in Build when creating product videos, character animation from stills, or long clips; use it in Launch when producing distribution-ready motion content from the same ComfyUI setup.
Is comfyui-video-pipeline safe to install?
It needs network access to your ComfyUI URL and download tooling; review the Security Audits panel on this page and lock down COMFYUI_URL to trusted hosts.
SKILL.md
READMESKILL.md - Comfyui Video Pipeline
# ComfyUI Video Pipeline Orchestrates video generation across three engines, selecting the best one based on requirements and available resources. ## Engine Selection ``` VIDEO REQUEST | |-- Need film-level quality? | |-- Yes + 24GB+ VRAM → Wan 2.2 MoE 14B | |-- Yes + 8GB VRAM → Wan 2.2 1.3B | |-- Need long video (>10 seconds)? | |-- Yes → FramePack (60 seconds on 6GB) | |-- Need fast iteration? | |-- Yes → AnimateDiff Lightning (4-8 steps) | |-- Need camera/motion control? | |-- Yes → AnimateDiff V3 + Motion LoRAs | |-- Need first+last frame control? | |-- Yes → Wan 2.2 MoE (exclusive feature) | |-- Default → Wan 2.2 (best general quality) ``` ## Pipeline 1: Wan 2.2 MoE (Highest Quality) ### Image-to-Video **Prerequisites:** - `wan2.1_i2v_720p_14b_bf16.safetensors` in `models/diffusion_models/` - `umt5_xxl_fp8_e4m3fn_scaled.safetensors` in `models/clip/` - `open_clip_vit_h_14.safetensors` in `models/clip_vision/` - `wan_2.1_vae.safetensors` in `models/vae/` **Settings:** | Parameter | Value | Notes | |-----------|-------|-------| | Resolution | 1280x720 (landscape) or 720x1280 (portrait) | Native training resolution | | Frames | 81 (~5 seconds at 16fps) | Multiples of 4 + 1 | | Steps | 30-50 | Higher = better quality | | CFG | 5-7 | | | Sampler | uni_pc | Recommended for Wan | | Scheduler | normal | | **Frame count guide:** | Duration | Frames (16fps) | |----------|----------------| | 1 second | 17 | | 3 seconds | 49 | | 5 seconds | 81 | | 10 seconds | 161 | **VRAM optimization:** - FP8 quantization: halves VRAM with minimal quality loss - SageAttention: faster attention computation - Reduce frames if OOM ### Text-to-Video Same as I2V but uses `wan2.1_t2v_14b_bf16.safetensors` and `EmptySD3LatentImage` instead of image conditioning. ### First+Last Frame Control (Wan 2.2 Exclusive) Wan 2.2 MoE allows specifying both the first and last frame, enabling precise video planning: 1. Generate two hero images with consistent character 2. Use first as start frame, second as end frame 3. Wan interpolates the motion between them ## Pipeline 2: FramePack (Long Videos, Low VRAM) ### Key Innovation VRAM usage is **invariant to video length** - generates 60-second videos at 30fps on just 6GB VRAM. **How it works:** - Dynamic context compression: 1536 markers for key frames, 192 for transitions - Bidirectional memory with reverse generation prevents drift - Frame-by-frame generation with context window ### Settings | Parameter | Value | Notes | |-----------|-------|-------| | Resolution | 640x384 to 1280x720 | Depends on VRAM | | Duration | Up to 60 seconds | VRAM-invariant | | Quality | High (comparable to Wan) | Uses same base models | ### When to Use - Videos longer than 10 seconds - Limited VRAM systems (but RTX 5090 doesn't need this) - When VRAM is needed for parallel operations - Batch video generation ## Pipeline 3: AnimateDiff V3 (Fast, Controllable) ### Strengths - Motion LoRAs for camera control (pan, zoom, tilt, roll) - Effect LoRAs (shatter, smoke, explosion, liquid) - Sliding context window for infinite length - Very fast with Lightning model (4-8 steps) ### Settings | Parameter | Value (Standard) | Value (Lightning) | |-----------|-----------------|-------------------| | Motion Module | `v3_sd15_mm.ckpt` | `animatediff_lightning_4step.safetensors` | | Steps | 20-25 | 4-8 | | CFG | 7-8 | 1.5-2.0 | | Sampler | euler_ancestral | lcm | | Resolution | 512x512 | 512x512 | | Context Length