Now liveThe Skillselion MCP - thousands of ranked skills, loaded into your agent mid-task. No install.Get it →

orchestra-research/ai-research-skills

99 skills · 40.6k installs · 1M stars · GitHub

Install

npx skills add https://github.com/orchestra-research/ai-research-skills

Skills in this repo

1Ml Paper Writingml-paper-writing is a skill from orchestra-research/ai-research-skills that documents mandatory checklist requirements for major ML and AI conferences. The reference covers NeurIPS, ICML, ICLR, and ACL, including NeurIPS's 16-item mandatory paper checklist that triggers automatic desk rejection when omitted. Researchers use ml-paper-writing during the final days before submission to verify checklist placement, supplemental material rules, and universal pre-submission items. The skill pairs with systems-paper-writing for OSDI, NSDI, ASPLOS, and SOSP venues. Developers and ML engineers preparing arXiv or conference uploads reach for this skill when compliance formatting is as critical as experimental results.739installs 2Brainstorming Research Ideasbrainstorming-research-ideas is a version 1.0.0 Orchestra Research agent skill in the AI-Research-SKILLs library that guides structured research ideation through ten complementary lenses such as Problem-First vs Solution-First Thinking, the Abstraction Ladder, Tension Hunting, Cross-Pollination, and the Explain-It Test. A Framework Selection Guide maps starting situations to 2–3 recommended lenses, and an integrated diverge-converge workflow ranks raw candidates into testable hypotheses and proposal-ready directions. The skill is self-contained with zero declared dependencies and fits early exploration, pivots between projects, or fresh angles on half-formed ideas—not literature reviews or experimental execution. Developers reach for brainstorming-research-ideas when choosing the next ML or AI research topic, preparing a collaborator brainstorm, or validating whether an idea has genuine demand before investing GPU time or writing code.541installs 3Creative Thinking For Researchcreative-thinking-for-research is a Claude Code skill from orchestra-research/ai-research-skills (version 1.0.0) that applies eight empirically grounded cognitive-science frameworks—combinatorial creativity, analogical reasoning, constraint manipulation, problem reformulation, and related strategies—to CS and AI research ideation. Unlike ad-hoc brainstorming, each framework gives structured prompts for producing genuinely novel directions rather than marginal benchmark increments. ML researchers and PhD engineers invoke it when literature surveys feel exhausted but a defensible new problem framing is still missing. The skill targets ideation before experiments, proofs, or manuscript outlines.531installs 4Mlflowmlflow is a production deployment guide skill in orchestra-research/ai-research-skills for ML engineers moving trained models out of the MLflow registry. The skill documents six deployment targets—local server, REST API, Docker, AWS SageMaker, Azure ML, and Kubernetes—plus batch inference and monitoring patterns, each with complexity ratings from low to high. Developers reach for mlflow when a registered model is validated and needs a repeatable path to serving infrastructure, whether containerized Docker deploys, managed cloud endpoints, or batch scoring jobs, with production checklists for REST serving and orchestration.521installs 5Faissfaiss is an orchestra-research skill that documents FAISS index tradeoffs across dataset sizes: Flat for under 10K vectors with 100% accuracy, IVF for 10K–1M at 95–99% accuracy, HNSW for 1M–10M near 99% accuracy, and IVF+PQ beyond 10M for memory-efficient 90–95% accuracy. It includes Python examples such as IndexFlatL2 with 128-dimensional vectors and k-neighbor search snippets developers can paste into RAG or recommendation services. Reach for faiss when you must pick an index family before wiring embedding storage rather than after latency problems appear in production.516installs 6Serving Llms Vllmserving-llms-vllm is an orchestra-research skill covering vLLM performance mechanics: PagedAttention block allocation that cuts KV-cache fragmentation, continuous batching for variable sequence lengths, prefix caching to reuse shared prompt blocks, and speculative decoding setup guidance. It contrasts traditional contiguous KV caches that waste roughly 50% GPU memory with paged block queues, citing examples like 160GB KV-cache demand for a 70B model under traditional layouts. Developers reach for it when standing up or tuning a private OpenAI-compatible inference endpoint on GPUs. The skill focuses on throughput and memory efficiency rather than model training.516installs 7Tensorrt Llmtensorrt-llm is a version 1.0.0 Orchestra Research agent skill that documents production LLM inference with NVIDIA TensorRT-LLM on A100, H100, and GB200 GPUs. It covers pip and Docker installation, the Python LLM API, trtllm-serve with --tp_size and --max_batch_size, and parallelism strategies including tensor parallelism for same-node sharding, pipeline parallelism for 405B-class models, and expert parallelism for MoE architectures like Mixtral. Reference guides span optimization, multi-GPU setup, and serving, with benchmarks citing up to 24,000 tokens/sec on Llama 3-8B and 100× faster inference versus PyTorch in documented H100 tests. Developers reach for tensorrt-llm when a model exceeds single-GPU memory, when NVLink or InfiniBand multi-node serving is required, or when FP8 quantization can halve memory on H100—prefer vLLM or llama.cpp when hardware is non-NVIDIA or setup simplicity matters more.516installs 8Academic PlottingAcademic Plotting is a pattern library skill from orchestra-research/ai-research-skills for producing polished ML paper figures with matplotlib and seaborn. The skill ships serif publication defaults—Times New Roman, DejaVu Serif, tuned axis and legend sizes—and reusable layout recipes for experiment plots researchers reuse across papers. Developers invoke it when notebook charts look generic or fail venue formatting expectations and they want consistent typography, tick formatters, and color palettes without re-deriving rcParams each project. The workflow centers on Python imports, rcParams updates, and seaborn integration to export camera-ready figures suitable for arXiv, conference submissions, or internal research memos.493installs 9Ray Trainray-train is an agent skill from orchestra-research/ai-research-skills covering Ray Train multi-node architecture with one head node, multiple worker nodes, and an Apache Arrow Plasma object store for shared memory. Developers follow ray start --head on port 6379 with dashboard at 8265, then connect workers via ray start --address. The skill targets distributed PyTorch training through TorchTrainer when single-machine jobs hit memory or throughput limits. Manual cluster setup steps and local multi-node patterns are documented for reproducible ML scaling without rewriting training loops from scratch.463installs 10Prompt Guardprompt-guard is a security skill for developers shipping LLM applications that must filter hostile inputs at the boundary. It wraps Meta's 86M-parameter Prompt Guard model (version 1.0.0) to classify prompt injections and jailbreak attempts with 99%+ true-positive rate, under 1% false-positive rate, and sub-2ms GPU inference. The skill supports HuggingFace deployment and batch processing for RAG pipelines, with multilingual coverage across 8 languages. Dependencies include transformers and torch. Teams invoke prompt-guard when building chatbots, agents, or retrieval systems that ingest third-party documents and need input validation before model calls.449installs 11Stable Diffusion Image Generationstable-diffusion-image-generation is an advanced usage guide for building Stable Diffusion pipelines with Hugging Face diffusers. It shows how to load UNet2DConditionModel, AutoencoderKL, DDPMScheduler, CLIPTextModel, and CLIPTokenizer components individually from stable-diffusion-v1-5 and assemble them into custom StableDiffusionPipeline instances with PyTorch. Developers reach for this skill when embedding text-to-image generation in agents, research tooling, or backend services that need fine-grained control over pipeline components and denoising rather than a hosted image API.449installs 12Huggingface Acceleratehuggingface-accelerate is an orchestra-research ai-research-skills module documenting how to extend Hugging Face Accelerate beyond built-in DDP, FSDP, and DeepSpeed strategies using custom plugins and dataclass-based configuration. The skill walks through Base Plugin structure with `DistributedDataParallelKwargs`, validation in `__post_init__`, and wiring plugins into `Accelerator` initialization for multi-GPU or mixed-precision runs. Developers reach for huggingface-accelerate when standard Accelerate presets fail to encode team-specific distributed behavior or kwargs that must stay consistent across training jobs. Examples use Python imports from `accelerate` and `accelerate.utils`, emphasizing predictable distributed training rather than one-off notebook hacks.446installs 13Ray Dataray-data is an orchestra-research ai-research-skills guide for integrating Ray Data with Ray Train and ML frameworks. It demonstrates `ray.data.read_parquet` for train and validation paths, `TorchTrainer` with `ScalingConfig`, and `ray.train.get_dataset_shard` inside `train_func` to iterate `iter_batches` per epoch. Developers adopt ray-data when S3 or parquet-backed datasets must shard across workers without hand-written partition logic between Ray and PyTorch or TensorFlow. The skill focuses on backend data plumbing—launching trainers, fetching dataset shards, and batching—rather than model architecture design. Examples use Ray Train torch integrations with explicit batch_size configuration in training loops.443installs 14Quantizing Models Bitsandbytesquantizing-models-bitsandbytes is an AI research skill from orchestra-research/ai-research-skills that teaches memory optimization for Hugging Face and PyTorch workflows. The skill walks through bitsandbytes BitsAndBytesConfig setup, CPU offloading of weights, gradient checkpointing, 8-bit and paged optimizers, and FP16/BF16 mixed-precision training. Its memory optimization guide cites 50-75% VRAM reduction from quantization alongside offload and checkpointing tactics. Developers reach for quantizing-models-bitsandbytes when a Transformers AutoModelForCausalLM load fails with CUDA OOM or when fine-tuning needs a smaller memory footprint without abandoning the target model size. The skill pairs concrete Python snippets with strategy tradeoffs between compute cost and resident GPU memory during training runs.442installs 15Unslothunsloth is an orchestra-research/ai-research-skills index over Unsloth open-source LLM fine-tuning and reinforcement learning documentation spanning 136 pages in its llms-txt category. It links getting-started paths, system and GPU VRAM requirements, beginner notebooks, and FAQ guidance on whether fine-tuning beats RAG for a given use case. Developers reach for unsloth when they need faster training on limited VRAM and want structured answers before installing Unsloth or opening training notebooks. The skill organizes Unsloth docs into navigable categories rather than executing training jobs itself.441installs 16Knowledge Distillationknowledge-distillation is a Claude skill from orchestra-research/ai-research-skills covering MiniLLM, the reverse-KL knowledge distillation method from arXiv paper 2306.08543. Standard forward KL minimization KL(Student || Teacher) is mode-seeking—the student matches the teacher's mean behavior and ignores low-probability regions, hurting generative diversity. MiniLLM replaces forward KLD with reverse KLD for better performance on generative language models. The skill references the Microsoft LMOps MiniLLM GitHub implementation and explains when reverse KL preserves broader teacher distributions. ML engineers reach for knowledge-distillation when compressing large LLMs into deployable student models for inference cost reduction without sacrificing generative quality. The guide connects paper theory to practical training decisions around divergence choice and student capacity planning.438installs 17Evaluating Llms Harnessevaluating-llms-harness is an orchestra-research/ai-research-skills guide for running lm-evaluation-harness benchmarks against closed and compatible API models. The skill explains the unified TemplateAPI interface for evaluating OpenAI completions, Anthropic Claude models, local OpenAI-compatible endpoints, and custom API backends. Developers use it to compare API model quality against open models, validate performance before shipping agents, and track regressions when providers update models. The readme covers supported providers, request types, and logprobs behavior in a comparison table format. Reach for this skill when model selection—not implementation—is the blocking decision for an agent or LLM-backed product.437installs 18LangchainLangChain is a build guide skill from orchestra-research/ai-research-skills for constructing LangChain agents that combine language models with callable tools. The skill documents the ReAct pattern—reason, act, observe, loop—using create_agent, ChatAnthropic, and Python tool definitions such as calculators or search helpers. Developers reach for it when prototyping agent features that must decide actions, execute functions, ingest observations, and stream tokens to clients. Coverage spans basic agent creation, tool binding, and streaming configuration so teams avoid ad-hoc prompt loops when shipping production agent endpoints in Python services.437installs 19Weights And Biasesweights-and-biases is a research-oriented agent skill that teaches solo and indie ML builders how to use Weights & Biases Artifacts and the Model Registry as the system of record for datasets, checkpoints, preprocessing outputs, and evaluation bundles. Instead of losing track of which CSV or torch.save file belonged to which experiment, you log versioned artifacts with descriptions and metadata, attach files or cloud references, and rely on W&B’s deduplicated storage and lineage graph to see which runs consumed or produced each artifact. The guide walks through wandb.init, constructing dataset and model artifacts, logging them from training loops, and applying aliases for production promotion workflows common in small teams shipping fine-tuned models or custom eval harnesses. Prism places it on Build integrations because the first value is wiring the SDK into your codebase, but the same patterns support Ship reproducibility checks and Operate model rollbacks. Agents load it when you ask for artifact versioning, registry aliases, or reproducible ML delivery without rebuilding tribal knowledge from scattered notebooks.436installs 20Crewai Multi Agentcrewai-multi-agent is a Claude skill from orchestra-research/ai-research-skills that documents CrewAI Flows for event-driven multi-agent orchestration in Python. Flows provide conditional branching, complex state management, and event-driven workflows beyond what sequential or hierarchical Crews alone offer. The skill explains when to choose Flows versus Crews using a six-scenario comparison table, then walks through creating a Flow class with Pydantic BaseModel state and decorators including @start, @listen, @router, or_, and and_. Developers reach for crewai-multi-agent when building agents that need router-based path selection, durable typed state across steps, or hybrid patterns that embed Crew tasks inside Flow steps. Example code imports from crewai.flow.flow and pydantic BaseModel, making the skill directly applicable to production Python agent backends.435installs 21Deepspeeddeepspeed is an agent skill from orchestra-research/ai-research-skills that indexes Microsoft DeepSpeed MoE documentation and blog posts for large language model training. The bundled notes cover MoE NLG architecture, training infrastructure, dataset choices, and published results such as 8x larger MoE model training and roughly 5x lower training cost claims from DeepSpeed-MoE releases. Developers reach for deepspeed when planning multi-GPU or multi-node fine-tuning and pretraining runs where expert parallelism, memory optimization, and cost tradeoffs must be justified before picking frameworks. The skill acts as a curated lookup layer over deepspeed.ai sources rather than a runtime installer.435installs 22Pytorch Lightningpytorch-lightning is an orchestra-research/ai-research-skills entry listed on skills.sh with 1 install for guiding PyTorch Lightning experiment code. The skill helps developers structure LightningModule classes, configure Trainer loops, and refactor raw PyTorch training scripts into Lightning conventions during ML backend work. Developers reach for pytorch-lightning when experiments grow beyond ad-hoc scripts and need standardized training, logging, and module boundaries. The catalog entry confirms it ships from orchestra-research/ai-research-skills as a focused Lightning reference for coding agents assisting on training code.435installs 23Llama Cppllama-cpp is an Orchestra Research agent skill for developers running local GGUF models through llama.cpp who need higher token throughput on CPU or hybrid GPU setups. The skill guides thread tuning with the -t flag using physical core counts, enabling OpenBLAS via LLAMA_OPENBLAS=1 for roughly 2–3× matrix speedups, and GPU layer offloading with -ngl including hybrid mode when VRAM is limited. Developers reach for llama-cpp when inference is CPU-bound, partial GPU offload causes OOM, or batch and context settings waste memory. The workflow systematically benchmarks -t, -ngl, and batch flags rather than guessing defaults on new hardware.434installs 24Qdrant Vector Searchqdrant-vector-search is an orchestra-research/ai-research-skills guide for production Qdrant deployments beyond single-node prototypes. The skill documents distributed cluster setup with docker-compose across 3 nodes, Raft consensus coordination, and per-node HTTP, gRPC, and P2P port configuration with QDRANT__CLUSTER__ENABLED environment variables. Developers use it when RAG or semantic search backends need sharded collections, persistent volumes, and multi-node Qdrant rather than embedded local vector stores. The advanced usage guide targets backend engineers wiring retrieval infrastructure that must scale horizontally and survive node failures during agent or API product builds.434installs 25Sglangsglang is an Orchestra Research agent skill (version 1.0.0, 435 lines plus 3 reference guides) for high-performance SGLang LLM and VLM serving with RadixAttention prefix caching. The skill walks through pip install, python -m sglang.launch_server launch flags, structured generation with JSON schemas and regex constraints, agent workflows with function calling, and an OpenAI-compatible /v1/chat/completions endpoint on port 30000. Production deployment coverage includes multi-GPU tensor parallelism (--tp 4 for Llama 3-70B), FP8/AWQ/GPTQ quantization, Docker images, Kubernetes manifests with liveness probes, Prometheus metrics, NGINX load balancing, and HPA autoscaling. Benchmarks in the skill cite 5× faster agent workloads and 10× faster few-shot prompting versus vLLM, with support for 100+ HuggingFace models. Reach for sglang when shipping GPU-backed LLM APIs that need constrained decoding, repeated system prompts, or production monitoring—not for simple one-off text generation without structure.434installs 26Whisperwhisper is a language support reference skill from orchestra-research/ai-research-skills for OpenAI Whisper multilingual speech-to-text. It documents 99 supported languages grouped by expected word-error rate: top-tier WER under 10% for English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian, Japanese, Korean, and Chinese; good support at WER 10–20% for Arabic, Turkish, Vietnamese, Swedish, Finnish, and additional locales; plus a full alphabetical list from Afrikaans through major world languages. Developers reach for whisper when picking `language` parameters, estimating transcription quality, or planning multilingual agent voice input before deploying STT in production pipelines.434installs 27DspyDSPy is a skill-sized pattern library for turning language-model behavior into composable Python modules that solo builders can test, optimize, and ship. It starts with a minimal RAG flow: retrieve top-k passages, join them into context, and run a ChainOfThought signature to produce an answer—then shows how to wire a real vector store through ChromadbRM and global settings. The optimized RAG section introduces BootstrapFewShot with labeled Examples and a correctness metric, which is the bridge from demo prompts to measurable iteration. Additional sections in the source material walk agent systems, classification, data processing, and multi-stage pipelines—useful when your agent product needs more than one LM call in sequence. Reach for this skill when you are past raw API prompts and want signatures, modules, and teleprompters that an coding agent can extend. It assumes comfort with Python and an existing corpus or labels for optimization; it is not a hosted vector DB or a deployment platform by itself.433installs 28Gguf Quantizationgguf-quantization is a Claude Code skill from orchestra-research/ai-research-skills covering advanced llama.cpp workflows for GGUF models. It walks through speculative decoding with draft models via llama-speculative, self-speculative lookup caches, batched Python inference with llama_cpp.Llama, and custom GGUF conversion for q4_k_m and related quant formats. Developers reach for gguf-quantization when self-hosting agent inference on CPU or GPU and need lower latency, higher batch throughput, or smaller on-disk model footprints without retraining.433installs 29Grpo Rl Traininggrpo-rl-training is a procedural library of Group Relative Policy Optimization (GRPO) reward functions for solo builders and small teams fine-tuning language models with verifiable or structured outputs. Instead of inventing reward logic from scratch, you adapt pre-defined correctness rewards (exact and fuzzy match), format penalties, length controls, and style signals that mirror battle-tested training setups. The skill fits when you already have a GRPO trainer and need consistent, weighted objectives for math, Q&A, summarization, or formatted agent responses. It matters because mis-specified rewards silently waste GPU time and produce models that look fluent but fail grading or schema checks. Treat it as reference code to wire into your training script—not a full training orchestration skill.433installs 30Verl Rl Trainingverl-rl-training is an orchestra-research/ai-research-skills API reference for VERL distributed reinforcement learning on large language models. The skill covers RayPPOTrainer as the central training loop controller with init_workers and fit calls, plus ResourcePoolManager allocating GPUs across worker groups via Ray PlacementGroups. Example resource_pool_spec maps actor_rollout_ref to 4 GPUs and critic to 2 GPUs. Developers reach for verl-rl-training when configuring PPO-based RL fine-tuning pipelines that need Ray-coordinated rollout, actor, and critic workers rather than single-GPU supervised training. The skill fits ML backend engineers wiring VERL configs during LLM alignment experiments.433installs 31Audiocraft Audio Generationaudiocraft-audio-generation is an Orchestra Research agent skill in orchestra-research/ai-research-skills for Meta AudioCraft text-to-music and text-to-audio generation. It documents MusicGen models from musicgen-small (300M) through musicgen-large (3.3B), AudioGen sound-effect generation, EnCodec compression, melody conditioning, stereo variants, and LoRA fine-tuning with custom datasets resampled to 32 kHz mono WAV plus metadata.json. Quick-start examples use audiocraft.models.MusicGen and HuggingFace transformers with torch>=2.0.0. Advanced references cover dora fine-tuning, FastAPI deployment, Gradio demos, and batch sound-design workflows. Developers reach for audiocraft-audio-generation when building music generation APIs, fine-tuning MusicGen on custom audio corpora, or generating sound effects with facebook/audiogen-medium at 16 kHz output.432installs 32Chromachroma is an orchestra-research/ai-research-skills guide for integrating ChromaDB into Python RAG stacks. The skill walks through LangChain's Chroma vectorstore with OpenAI embeddings, persist_directory setup, similarity_search queries, and retriever configuration, plus LlamaIndex's ChromaVectorStore with chromadb PersistentClient collections. Developers use chroma when they need a local or persisted vector index inside LangChain or LlamaIndex rather than hand-rolling chromadb client code. The skill covers from_documents ingestion, k-neighbor retrieval, and retriever wiring so agents can stand up semantic search quickly during backend or agent-knowledge work.432installs 33Instructorinstructor is an orchestra-research/ai-research-skills guide for structured LLM output with the Instructor library and Pydantic models against Claude. The skill demonstrates CompanyInfo extraction from unstructured text, sentiment classification with Enum fields, and response_model wiring on client.messages.create calls using claude-sonnet-4-5-20250929. Developers reach for instructor when API routes or batch jobs need validated JSON-like objects instead of free-form completions. The examples cover data extraction, classification schemas, and multi-field parsing patterns that reduce post-processing glue code in Python backends serving agents or user-facing apps.432installs 34Llamaindexllamaindex is an orchestra-research/ai-research-skills guide for building LlamaIndex agents with tools and retrieval. The skill covers FunctionAgent.from_tools with custom Python functions, OpenAI LLM wiring on gpt-4o, and RAG agents that expose VectorStoreIndex query engines through QueryEngineTool. Developers use llamaindex when they need an agent that both calls functions and queries embedded document corpora in one chat loop. Examples progress from a basic multiply tool agent to a RAG agent with named query tools over indexed documents. The skill fits Python backends and research prototypes where LlamaIndex is the orchestration layer for tool-using LLM workflows.432installs 35Modal Serverless GpuModal Serverless GPU is a research-oriented skill package for solo builders and small teams who need serious training capacity without owning hardware. It documents how to define Modal apps, slim Debian images with torch, transformers, accelerate, and deepspeed, and attach explicit GPU shapes from single H100 pods to eight-way A100 fleets. The workflows cover Accelerate-based multi-GPU loops, DeepSpeed-backed Trainer runs with fp16 and gradient accumulation, and the subtle Multi-GPU footguns when frameworks re-execute the Python entrypoint—subprocess and ddp_spawn guidance is included for that. Use it while building ML backends, fine-tuning agents, or research prototypes that must scale out temporarily then disappear from your bill. It complements generic DevOps skills by focusing on Modal’s serverless contract rather than raw Kubernetes. Expect intermediate Python and PyTorch familiarity; outputs are runnable function stubs you adapt to your dataset and checkpoint strategy.432installs 36TensorboardTensorboard is an agent skill that teaches solo builders how to integrate TensorBoard with the ML frameworks they already use. It walks through creating a SummaryWriter, logging scalars at batch and epoch granularity, capturing weight histograms, and exporting computation graphs so training behavior is inspectable instead of opaque console prints. The readme structures integrations by ecosystem—PyTorch and torchvision first, then TensorFlow/Keras, Lightning, HuggingFace, Fast.ai, JAX, and scikit-learn—so you can copy patterns that match your repo rather than guessing APIs. For indie ML products and research spikes, that means faster debugging of loss curves, sane experiment folders, and a repeatable habit before you ship models or hand runs to teammates. Use it when you are implementing or refactoring training code and need observability without bolting on a separate experiment platform on day one.432installs 37Training Llms Megatrontraining-llms-megatron is an Orchestra Research skill for Megatron-Core distributed LLM training performance. The skill documents Model FLOP Utilization (MFU) up to 47% on H100 clusters, GPT-3 175B configs with TP=4 and PP=8 across 128–512 GPUs reaching 390 TFlops/GPU, and LLaMA configuration tables spanning tensor parallel (TP), pipeline parallel (PP), and context parallel (CP) settings. Developers reach for training-llms-megatron when sizing parallelism for billion-parameter runs and comparing hardware throughput instead of guessing GPU counts. Benchmark excerpts tie larger model sizes to higher MFU through increased GEMM arithmetic intensity on H100 hardware.432installs 38Clipclip is an Orchestra Research skill (version 1.0.0, MIT license) for integrating OpenAI’s CLIP vision-language model without fine-tuning. CLIP was trained on 400M image-text pairs and the skill documents five model variants from RN50 (102M params) through ViT-L/14 (428M params), defaulting to ViT-B/32 for balanced speed and quality. Workflows cover zero-shot classification with clip.tokenize labels, cosine-similarity image-text matching, semantic search over image embedding indexes, batch 10×3 similarity matrices, and NSFW or violence moderation categories. Performance notes cite ~20ms GPU versus ~200ms CPU image encoding on a V100, plus ChromaDB integration for vector storage. Use clip for general-purpose image understanding and search—not for fine-grained detection, LLaVA chat, or SAM segmentation tasks called out as alternatives.431installs 39Pyvene Interventionspyvene-interventions is an Orchestra Research skill for the pyvene interpretability library on Hugging Face causal LMs. The skill documents IntervenableModel construction with IntervenableConfig and RepresentationConfig, targeting specific layers and components such as block_output with VanillaIntervention types. Developers reach for pyvene-interventions when agents or research scripts need controlled activation swaps between base and source inputs during forward passes. Examples load AutoModelForCausalLM checkpoints like gpt2 and return both original and intervened outputs from a single intervenable call, replacing manual hook registration.431installs 40Autoresearchautoresearch is an orchestra-research/ai-research-skills reference that mandates setting up a wall-clock loop before any autonomous research begins. The loop fires every 20 minutes with a fixed-interval prompt telling the agent to keep working and check for errors, separate from inner and outer research experiment loops that run at minutes-to-hours cadence. Without this loop, agents complete one cycle and halt instead of continuing overnight or multi-day runs. Developers reach for autoresearch when configuring platform-specific continuity for literature reviews, experiment sweeps, or other open-ended research tasks that must persist without manual nudging.419installs 41Awq Quantizationawq-quantization is an Orchestra Research agent skill for ML engineers deploying open-weight LLMs who need smaller, faster models without unacceptable quality loss. AWQ (Activation-aware Weight Quantization) identifies roughly 1% of salient weights by examining activation distributions, applies mathematical scaling to protect critical channels, and quantizes remaining weights to 4-bit with minimal error using the core formula L(s) = ||Q(W * s)(s^-1 * X) - W * X||. The skill guides kernel selection between GEMM and GEMV based on batch size and latency targets, and contrasts AWQ tradeoffs against GPTQ for production inference. Reach for awq-quantization when GPU memory limits block a model, batch inference needs throughput tuning, or 4-bit weights must ship to llama.cpp or vLLM runtimes.401installs 42Axolotlaxolotl is a Claude Code skill for ML developers fine-tuning large language models who need guided access to Axolotl's extensive Python API documented across 150 pages. The skill covers modules including cli.cloud.modal_ with ModalCloud and run_cmd for Modal Volume workflows, core.trainers.base with AxolotlTrainer classes, and related training configuration APIs from docs.axolotl.ai. Developers reach for axolotl when customizing fine-tuning pipelines, launching training on Modal Cloud, or debugging trainer configuration without manually paging through the full API reference. It suits backend ML engineers shipping custom LLM training jobs who need accurate API usage examples and module navigation during build.401installs 43Fine Tuning With Trlfine-tuning-with-trl is a research-backed Claude Code skill from orchestra-research/ai-research-skills for aligning open models with TRL DPOConfig instead of guessing loss types. The guide documents 10+ DPO loss variants—including sigmoid, IPO, hinge, robust, and BCO—with formulas, when-to-use notes, and copy-paste Python configs covering beta, batch size, and learning rate. Developers reach for fine-tuning-with-trl when wiring preference datasets into Hugging Face TRL trainers and need a defensible loss choice before spending GPU hours on a failed alignment run.401installs 44Hqq Quantizationhqq-quantization is an Orchestra Research skill for Half-Quadratic Quantization (HQQ) on PyTorch causal language models. The skill walks through HQQLinear backend selection keyed to CUDA compute capability—marlin on Ampere (cap ≥80), aten on Volta/Turing (≥70), and pytorch_compile on older GPUs—and shows per-layer backend assignment for mixed-precision stacks. Developers reach for hqq-quantization when a Hugging Face or custom transformer exceeds GPU memory at fp16/bf16 and naive 4-bit paths fail or underperform. The guide covers custom backend configuration, hardware-aware defaults, and layer-level tuning so locally served LLMs stay fast without trial-and-error kernel swaps.401installs 45Sentence Transformerssentence-transformers is an orchestra-research/ai-research-skills model selection guide for RAG embedding backends. The skill compares all-MiniLM-L6-v2 at 384 dimensions and roughly 2000 sentences per second for prototyping, all-mpnet-base-v2 at 768 dimensions and roughly 600 sentences per second for production RAG, and all-roberta-large-v1 at 1024 dimensions and roughly 300 sentences per second for highest accuracy. It also covers paraphrase-multilingual-MiniLM-L12-v2 supporting 50+ languages at 384 dimensions. Developers reach for sentence-transformers when embedding choice—not vector database ops—is the bottleneck before ingestion into Chroma, Qdrant, or similar stores.401installs 46LlamaguardLlamaGuard is an agent skill package that teaches you how to run Meta’s specialized moderation model to classify chat for policy violations before and after your main LLM generates text. It targets builders shipping agents, copilots, or APIs who need a dedicated safety layer instead of hoping the base model self-censors. The skill documents six hazard categories—from violence and hate through criminal planning—and walks through HuggingFace transformers setup, optional vLLM serving, and enterprise paths such as SageMaker, plus integration notes with NeMo Guardrails. You get concrete Python for applying the chat template, generating a short safety verdict, and interpreting outputs like unsafe with a category code. Use it when compliance, brand risk, or platform rules require systematic filtering at scale, and you want a reproducible deployment recipe rather than ad-hoc keyword blocklists.400installs 47Openrlhf Trainingopenrlhf-training is an Orchestra Research agent skill for ML engineers running RLHF alignment who must choose among OpenRLHF reinforcement learning algorithms and tune training flags. OpenRLHF supports 6 RL algorithms selectable via --advantage_estimator: gae for PPO with Generalized Advantage Estimation, reinforce for REINFORCE++, reinforce_baseline, group_norm for GRPO, dr_grpo for Dr. GRPO without std normalization, and rloo for Reinforcement Learning with Online Off-policy Correction. The skill compares algorithm formulas, critic requirements, and batch assumptions so developers match PPO, GRPO, or REINFORCE++ variants to hardware and reward signal characteristics. Reach for openrlhf-training when starting a new alignment run, switching estimators after instability, or benchmarking RL algorithms on the same base model.400installs 48Optimizing Attention Flashoptimizing-attention-flash is a benchmark-driven skill from orchestra-research/ai-research-skills comparing standard attention against Flash Attention 2 and Flash Attention 3 across NVIDIA A100 80GB and H100 80GB GPUs. Tables report forward-pass milliseconds at sequence lengths 512 through 8192 with batch=8, heads=32, dim=64—showing up to 3.3× speedup for FA2 at 8192 tokens on A100. Developers reach for optimizing-attention-flash when long-context training or inference is memory- or latency-bound and they need evidence-backed Flash Attention version and GPU selection.400installs 49Outlinesoutlines is an Orchestra Research skill for the Outlines structured-generation library across local and API model backends. The skill documents Transformers (Hugging Face) setup with CUDA device options, vLLM and llama.cpp local paths, and OpenAI API configuration, plus performance comparisons and production deployment patterns. Developers reach for outlines when agent tools must return valid JSON or Pydantic-bound objects instead of free-form text that breaks downstream parsers. Examples include outlines.models.transformers loading models like microsoft/Phi-3-mini-4k-instruct and outlines.generate.json binding outputs to typed schemas.400installs 50Sentencepiecesentencepiece is a tokenizer training skill from orchestra-research/ai-research-skills focused on the SentencePiece library's BPE and Unigram modes. It explains merge-based BPE training with worked corpus iterations—such as merging 'e'+'s' then 'es'+'t'—and contrasts Unigram probabilistic segmentation plus subword regularization tradeoffs. Developers reach for sentencepiece when building language-agnostic vocabularies, choosing BPE vs Unigram for a domain corpus, or generating `.model` files before Hugging Face or custom training. The guide includes Python snippets using `import sentencepiece as spm` and `spm.SentencePieceTrainer` patterns for reproducible vocabulary creation.400installs 51Sparse Autoencoder Trainingsparse-autoencoder-training is an Orchestra Research skill for SAELens sparse autoencoder workflows on transformer activations. The skill documents SAE.from_pretrained loading from official releases such as gpt2-small-res-jb, HuggingFace repos, or local disk paths, plus core attributes including W_enc and W_dec weight matrices with documented shapes. Developers reach for sparse-autoencoder-training when they need feature dictionaries, sparsity metrics, and activation decomposition for mechanistic interpretability on hooks like blocks.8.hook_resid_pre. The guide covers pretrained SAE retrieval, CUDA device placement, and inspection patterns for encoder-decoder weights and sparsity statistics.400installs 52Huggingface Tokenizershuggingface-tokenizers is a deep-dive skill from orchestra-research/ai-research-skills on subword tokenization algorithms used in Hugging Face workflows. It walks through Byte-Pair Encoding merge steps with worked corpus examples, WordPiece likelihood scoring, and Unigram probabilistic segmentation so developers understand why tokens split the way they do. The guide is aimed at engineers fixing vocabulary mismatches, reproducing training tokenization, or choosing an algorithm before pretraining or fine-tuning. Agents use it when debugging OOV behavior, explaining merge tables, or aligning custom corpora with standard Hugging Face tokenizer implementations in Python.399installs 53Long Contextlong-context is a research synthesis skill from orchestra-research/ai-research-skills comparing three major context-extension techniques: YaRN (arXiv 2309.00071), ALiBi, and position interpolation. YaRN extends RoPE models to 128k+ tokens with roughly 10× less training data than prior methods via NTK-aware interpolation and attention temperature scaling. The skill explains when each method fits agent products needing larger context windows and what trade-offs appear in fine-tuning cost, extrapolation quality, and implementation complexity.399installs 54Nanogptnanogpt is an Orchestra Research AI Research Skills package teaching Karpathy-style educational GPT implementation in approximately 300 lines of PyTorch (283 lines in model.py plus reference docs). The skill covers CausalSelfAttention with multi-head masked self-attention, token embeddings, feed-forward blocks, and a clean GPT-2 configuration suitable for training experiments. Developers reach for nanogpt when learning transformer architecture, prototyping small language models, or teaching LLM internals without navigating full frameworks like Hugging Face Transformers or distributed training stacks. It ships as part of the 86-skill ai-research-skills library installable via npx @orchestra-research/ai-research-skills and targets hands-on architecture comprehension over production-scale training.399installs 55Nemo Guardrailsnemo-guardrails is an Orchestra Research skill (version 1.0.0, MIT license) for NVIDIA NeMo Guardrails runtime safety on LLM applications. The skill documents Colang 2.0 DSL rails for jailbreak detection, input and output validation, fact-checking, hallucination detection, PII filtering, and toxicity detection, with production deployment guidance including T4 GPU operation. Developers reach for nemo-guardrails when an agent pipeline or chat API needs programmable safety layers beyond a single system prompt. The skill covers quick-start wiring, nemoguardrails dependency setup, and Colang flow patterns so guardrails enforce policy before responses reach end users.399installs 56Peft Fine Tuningpeft-fine-tuning is an AI research skill for advanced parameter-efficient fine-tuning of causal LMs with Hugging Face PEFT. The guide configures LoraConfig with r=16, lora_alpha=32, and target_modules q_proj, v_proj, k_proj, and o_proj, enabling DoRA via use_dora=True for weight-decomposed adaptation that often beats standard LoRA on instruction-following tasks at roughly 10% higher memory from magnitude vectors. It also covers AdaLoRA adaptive rank allocation and LoRA+ learning-rate splits for quality-critical fine-tunes. Developers reach for peft-fine-tuning when GPU memory blocks full fine-tuning and they need documented adapter variant selection instead of default LoRA settings that underperform on instruction data.399installs 57Langsmith Observabilitylangsmith-observability is an Orchestra Research agent skill for developers building LLM agents who need structured evaluation and tracing through LangSmith before shipping to production. The skill configures evaluate() runs against test datasets, implements custom Python evaluators that score accuracy and other metrics from run outputs, and sets up LLM-as-judge evaluators for subjective quality grading. Developers reach for langsmith-observability when agent responses need regression datasets, production traces require scoring dashboards, or custom evaluators must replace manual output review. The workflow connects offline dataset evaluation with ongoing production observability through LangSmith tracing APIs.398installs 58Moe Trainingmoe-training is an architecture research skill from orchestra-research/ai-research-skills covering major MoE model families with parameter counts, routing rules, and layer structures. Mixtral 8x7B is documented with 47B total parameters, 13B active per token across top-2 of 8 experts (~7B each), plus grouped-query attention in a sparse MoE layout. The guide also covers DeepSeek-V3, Google Switch Transformers, and GLaM with a comparison table for routing, expert counts, and activation patterns. Developers reach for moe-training when choosing a sparse architecture, explaining expert routing to teammates, or grounding training plans in published MoE designs before implementation.398installs 59Nemo Curatornemo-curator is an Orchestra Research agent skill for ML engineers preparing training or RAG corpora who must remove duplicate and near-duplicate documents at scale. The skill configures NeMo Curator modules including ExactDuplicates with md5 or sha256 hashing, FuzzyDuplicates using MinHash plus LSH with configurable hash permutations, and semantic deduplication for paraphrased content. Exact deduplication runs roughly 16× faster on GPU versus CPU according to the guide. Developers reach for nemo-curator when fine-tuning datasets contain repeated crawled pages, RAG indexes return redundant chunks, or corpus size bloats storage and training cost. The workflow selects the right dedup tier by match type and corpus scale.398installs 60Segment Anything Modelsegment-anything-model is a Claude Code skill from orchestra-research/ai-research-skills for adding Meta segmentation to products. It covers SAM 2 video segmentation with build_sam2_video_predictor, init_state on video files, point prompts via add_new_points, and propagate_in_video mask streaming. Grounded SAM text-to-mask workflows connect natural-language prompts to masks for agent vision pipelines. Developers reach for segment-anything-model when shipping interactive segmentation, video object tracking, or text-conditioned masking without writing integration code from scratch.398installs 61Simpo Trainingsimpo-training is a dataset preparation skill from orchestra-research/ai-research-skills for SimPO (Simple Preference Optimization) alignment workflows. It defines required JSON fields—`prompt`, `chosen`, and `rejected`—with auto-detected aliases such as `question`, `instruction`, `response_chosen`, `winner`, and `response_rejected`. Worked examples show quantum-computing prompt pairs with preferred and rejected completions so agents can normalize messy exports into SimPO-ready files. Developers reach for simpo-training when converting human preference logs, ranking exports, or RLHF datasets into the schema SimPO trainers expect before launching fine-tuning jobs on instruction models.398installs 62Speculative Decodingspeculative-decoding is a Claude Code skill from orchestra-research/ai-research-skills based on the ICML 2024 lookahead decoding paper and LMSYS blog guidance. It reformulates autoregressive generation as Jacobi iteration, parallelizing token updates to achieve roughly 1.5–2.3× speedup without draft models or additional training. The skill contrasts sequential y_t = f(x, y_{1..t-1}) decoding with parallel Jacobi updates and points to the hao-ai-lab LookaheadDecoding repository. Developers reach for speculative-decoding when agent inference is token-latency bound but adding a draft model is impractical.398installs 63Transformer Lens Interpretabilitytransformer-lens-interpretability is a mechanistic interpretability skill from orchestra-research/ai-research-skills centered on TransformerLens `HookedTransformer`. It documents loading GPT-2 and LLaMA-family checkpoints with device and dtype controls, gated-model HF token setup, and hook access on every activation for circuit analysis. Developers use it when tracing attention heads, inspecting weight matrices, or running activation patching on models from `gpt2-small` through `meta-llama/Llama-2-7b-hf`. The API reference tables enumerate `from_pretrained()` parameters—device, dtype, and tokenizer options—so agents can stand up reproducible interpretability notebooks or scripts in PyTorch.398installs 64Constitutional Aiconstitutional-ai is an Orchestra Research skill (version 1.0.0, MIT license) documenting Anthropic’s two-phase Constitutional AI method from arXiv:2212.08073. Phase 1 runs self-critique and revision against a written constitution, then fine-tunes with trl SFTTrainer on revised responses. Phase 2 generates comparison pairs, uses AI preference evaluation (RLAIF) instead of human harm labels, trains a RewardTrainer model, and finishes with PPOTrainer RL optimization. Dependencies are transformers, torch, and trl; hardware guidance cites 1× A100 40GB for 7B SL and 2× A100 40GB for RL with policy plus reward model. Three reference files cover constitution design, RLAIF versus RLHF comparison, and chain-of-thought critique prompts. Use constitutional-ai when aligning open models for harmlessness without human red-team labels—not for runtime guardrails-only setups or pre-built moderation APIs.397installs 65Distributed Llm Pretraining Torchtitandistributed-llm-pretraining-torchtitan is an AI research skill for fault-tolerant LLM pretraining with Meta TorchTitan and PyTorch Distributed Checkpoint (DCP). The guide enables checkpoint saves with interval=500, sets folder paths, and supports last_save_model_only exports in bfloat16 to shrink checkpoint size by dropping optimizer state. Developers can exclude keys like data_loader and lr_scheduler from loading when resuming with modified training settings via TOML or CLI flags. Reach for distributed-llm-pretraining-torchtitan when long-running TorchTitan pretraining jobs need interoperable DCP checkpoints, partial reloads, or async save strategies instead of ad hoc torch.save calls that stall distributed training.397installs 66Gptqgptq is an AI research skill for post-training GPTQ quantization with deliberate calibration data selection. The guide explains that calibration computes Hessian weight importance to minimize quantization error, noting good calibration keeps perplexity increases under 1.5% while poor calibration can raise perplexity 5–10% and missing calibration may produce gibberish outputs. It recommends 128–256 samples of 512 tokens each (65K–131K total tokens) as the sweet spot, warning that fewer than 64 samples underfit. Developers reach for gptq when deploying INT4 or INT8 compressed LLMs and need a calibration recipe that preserves accuracy rather than default random slices that collapse model quality.397installs 67Guidanceguidance is a Microsoft Guidance backend configuration skill from orchestra-research/ai-research-skills that walks developers through API-based models (Anthropic Claude, OpenAI) and local runtimes (Transformers, llama.cpp). The guide covers basic setup with environment variables or explicit API keys, available model identifiers, backend comparison, performance tuning, and advanced configuration for template-driven generation. Developers reach for guidance when an agent must pick the right Guidance model wrapper, set dtype/device options, or align backend capabilities with structured output constraints in Python. The skill documents concrete import patterns such as `from guidance import models` and `models.Anthropic(...)`, making it a reference for integration rather than open-ended prompt writing.397installs 68Lambda Labs Gpu Cloudlambda-labs-gpu-cloud is an AI research skill for multi-node GPU training on Lambda Labs cloud instances. The guide sets up PyTorch DistributedDataParallel with dist.init_process_group using the NCCL backend, reading RANK, WORLD_SIZE, and LOCAL_RANK from the torchrun launcher environment. Training scripts call torch.cuda.set_device(local_rank) and wrap models in DDP for synchronized gradient updates across nodes. Developers reach for lambda-labs-gpu-cloud when scaling fine-tuning or pretraining from a single Lambda GPU to a multi-node cluster and need correct distributed initialization instead of broken single-process scripts on rented A100 or H100 hardware.397installs 69Miles Rl Trainingmiles-rl-training is an enterprise RL configuration skill from orchestra-research/ai-research-skills for the miles framework built on slime. It documents unified FP8 training and inference, INT4 quantization-aware training, Rollout Routing Replay (R3), and speculative RL training atop slime's configuration system and Sample dataclass with `rollout_routed_experts` for MoE routing replay. Developers use it when launching GRPO advantage-estimator jobs on models such as qwen3-30b-a3b with Hugging Face checkpoints. The quick-start CLI example shows `python train.py --advantage-estimator grpo --model-name qwen3-30b-a3b`, making the skill a reference for miles-specific flags beyond base slime arguments.397installs 70Torchforge Rl Trainingtorchforge-rl-training is a distributed RL architecture skill from orchestra-research/ai-research-skills for Meta's torchforge stack. It documents a fully asynchronous system layering application reward models and loss functions atop a Forge API with ForgeActor and Service abstractions, coordinated by Monarch, trained with TorchTitan FSDP, and served generation through vLLM. Developers reach for torchforge-rl-training when designing non-blocking RL loops that separate rollout inference from gradient updates across a PyTorch-native cluster. The architecture diagram in the reference maps application code through Forge API services to Monarch, TorchTitan, and vLLM components for production-scale LLM RL.397installs 71Llama Factoryllama-factory is an agent-assisted workflow skill from orchestra-research/ai-research-skills built on LLaMA-Factory documentation spanning installation, LoRA fine-tuning, weight merging, and chat inference. It references a 3-step GPT-OSS LoRA path requiring VRAM above 44 GB on a single GPU with multi-GPU support, plus advanced topics across 14 documented pages including quantization visuals and Web UI screenshots. Developers reach for llama-factory when standing up preference optimization (RLHF, DPO, KTO), merging adapters, or serving tuned models through vLLM or NPU backends. The skill translates LLaMA-Factory CLI and UI options into agent-ready setup sequences for open-weight model customization.396installs 72Llavallava is an AI research skill for training LLaVA vision-language models in two stages. Stage 1 feature alignment pretrains on 558K CC3M image-caption pairs using CLIP ViT-L/14 and Vicuna-7B or LLaMA-2-7B base models via scripts/v1_5/pretrain.sh, taking roughly 20 hours on 8× A100 GPUs. Stage 2 visual instruction tuning fine-tunes on 150K GPT-generated multimodal instruction samples through scripts/v1_5/finetune.sh with JSON conversation formatting. Developers reach for llava when building custom LLaVA checkpoints and need the correct data formats, base model choices, and bash training scripts instead of misconfigured single-stage fine-tunes that fail to align vision and language modules.396installs 73Pytorch Fsdp2pytorch-fsdp2 is an AI research skill based on the official PyTorch tutorial for asynchronous saving with Distributed Checkpoint (DCP), last verified November 2024 and updated September 2025. The guide uses torch.distributed.checkpoint.async_save to move checkpoint writes off the critical training path, noting async save copies model state into internal CPU buffers which adds memory overhead. It pairs FSDP2 sharding patterns with DCP recipes so large LLM training jobs avoid pipeline stalls from synchronous torch.save calls. Developers reach for pytorch-fsdp2 when multi-GPU FSDP training pauses noticeably at checkpoint intervals and they need the official async DCP pattern with realistic memory tradeoffs documented.396installs 74Skypilot Multi Cloud Orchestrationskypilot-multi-cloud-orchestration is an AI research skill for SkyPilot multi-cloud GPU job orchestration. The guide defines YAML resources with accelerators like A100:8 and any_of cloud preference lists spanning GCP us-central1, AWS us-west-2, and Azure westus2, plus wildcard regions such as aws us-* for spot capacity. Kubernetes entries can precede public cloud fallbacks, and instance_type constraints like p4d.24xlarge pin specific hardware SKUs. Developers reach for skypilot-multi-cloud-orchestration when GPU training jobs must survive quota limits or regional outages by automatically failing over across clouds instead of maintaining separate launch scripts per provider.396installs 75Slime Rl Trainingslime-rl-training is an AI research skill for Ray-orchestrated RL fine-tuning of agent policies using the slime framework. The architecture splits into three modules: a Data Buffer for prompt initialization, custom data generation, filtering, and rollout sample storage; a Megatron-LM Training module for actor model updates; and a SGLang Rollout module with router for response generation during rollouts. Ray coordinates data flow between buffer, trainer, and rollout workers so RL loops scale across GPUs. Developers reach for slime-rl-training when building agent RL pipelines that need Megatron-scale training integrated with SGLang inference rollouts instead of hand-wiring separate trainer and sampler scripts.396installs 76Autogpt Agentsautogpt-agents is an AI research skill for building custom AutoGPT blocks that plug into autonomous agent pipelines. The guide walks through defining Block subclasses with BlockType, input_schema, and output_schema using Pydantic BaseModel classes for typed inputs like query strings and max_results plus structured outputs such as result lists and counts. Developers subclass Block, assign a UUID id, and register the block so AutoGPT orchestration can invoke the logic as a standard pipeline step. Reach for autogpt-agents when extending AutoGPT with domain-specific tools—search, API wrappers, or data transforms—that must run inside agent workflows with validated schemas instead of ad hoc scripts.395installs 77Blip 2 Vision Languageblip-2-vision-language is an AI research skill for fine-tuning Salesforce BLIP-2 vision-language models without guessing freeze layers or PEFT hyperparameters. The guide loads Blip2ForConditionalGeneration from Salesforce/blip2-opt-2.7b in float16, configures LoraConfig with r=16, lora_alpha=32, and target_modules q_proj, v_proj, k_proj, and out_proj, then applies get_peft_model for parameter-efficient training. It also documents Q-Former-only training paths when full language-model adaptation is unnecessary. Developers reach for blip-2-vision-language when building captioning or VQA features on limited GPU memory and need LoRA recipes that preserve BLIP-2 accuracy instead of trial-and-error adapter setup.395installs 78Mamba Architecturemamba-architecture is a model-architecture skill from Orchestra-Research/AI-Research-SKILLs (253 lines plus 3 reference files) that teaches Mamba's Selective SSM (S6) layer mechanics. Unlike fixed-matrix SSMs, Mamba makes state-space parameters input-dependent via Linear_B, Linear_C, and Linear_Δ projections, enabling selective state updates with O(n) complexity reportedly 5× faster than Transformers on long sequences. The skill documents discretization, selective state updates, and configuration choices before implementing linear-time sequence models. It is one of 5 skills in the Model Architecture category within a library of 86 production-ready research skills across 22 categories. Developers reach for mamba-architecture when choosing between Mamba, RWKV, or transformer backbones or implementing S6 layers in PyTorch training code.395installs 79Nnsight Remote Interpretabilitynnsight-remote-interpretability is a Claude Code skill from orchestra-research/ai-research-skills that teaches agents to load models through nnsight's LanguageModel wrapper, enter trace() contexts, and inspect or intervene on hidden states during forward passes. The bundled API reference covers GPT-2 and Llama-3.1-8B loading patterns, tokenizer access, device_map placement, and torch_dtype settings so experiments stay close to standard HuggingFace workflows. Developers reach for nnsight-remote-interpretability when they need activation patching, causal tracing, or remote interpretability runs on production-scale transformers rather than one-off notebook scripts. The skill assumes PyTorch familiarity and nnsight installed in the environment.395installs 80Rwkv Architecturerwkv-architecture is a model-architecture skill from Orchestra-Research/AI-Research-SKILLs (253 lines plus 3 reference files) that teaches RWKV's Time-Mixing and Channel-Mixing blocks. The core WKV (Weighted Key-Value) mechanism computes attention-like outputs in O(n) time via recurrence instead of O(n²) softmax attention matrices. The skill contrasts traditional Q@K.T attention with WKV's exponential decay recurrence updating aa and ab accumulators per timestep. RWKV is described as an RNN-transformer hybrid supporting long-context inference, backed by the Linux Foundation project. It is one of 5 Model Architecture skills in the 86-skill AI-Research-SKILLs library. Developers reach for rwkv-architecture when evaluating RWKV versus Mamba or transformers or implementing WKV layers in training code.395installs 81Phoenix Observabilityphoenix-observability is a Claude Code skill from orchestra-research/ai-research-skills that guides agents through Phoenix evals setup, including OpenAIModel-backed llm_classify flows and template-based evaluators with accuracy, completeness, and clarity rubrics scored 1-5. The skill shows how to wire input, output, and reference fields into reusable evaluator functions so teams can compare agent behavior across prompt or model changes. Developers reach for phoenix-observability when LLM responses need structured regression checks rather than ad-hoc manual review. It pairs naturally with Phoenix tracing datasets and production observability pipelines already collecting LLM spans.394installs 82Pineconepinecone is a version 1.0.0 Claude Code skill from orchestra-research/ai-research-skills that teaches agents to provision and query Pinecone's managed vector database for production AI workloads. The skill covers serverless auto-scaling indexes, hybrid search combining dense and sparse vectors, metadata filtering, namespaces, and sub-100ms p95 latency targets suited to RAG pipelines and recommendation systems. It declares a pinecone-client dependency and MIT licensing guidance so integrations stay aligned with Orchestra Research conventions. Developers reach for pinecone when semantic search or retrieval must run on managed infrastructure instead of local embeddings stores or DIY vector hosts.394installs 83Model Mergingmodel-merging is an agent skill in orchestra-research/ai-research-skills that benchmarks merged Hugging Face models using research-grade evaluation suites. The guide centers on the Open LLM Leaderboard with six standard tasks—ARC (25-shot science reasoning), HellaSwag (10-shot commonsense), MMLU (5-shot across 57 subjects), TruthfulQA (0-shot factual accuracy), Winogrande (5-shot commonsense), and GSM8K (5-shot math)—plus lm_eval harness usage and MT-Bench-style multi-turn conversation scoring. Developers reach for model-merging after producing merged checkpoints (SLERP, TIES, DARE, or similar) when leaderboard scores, task-level regressions, and conversational layout compatibility must be verified before serving. The workflow documents metrics, comparison frameworks, and quality-assurance checks aligned with published merge research practices. Use when comparing two merge recipes or validating a new merged artifact against base models. Skip for training merges from scratch, dataset curation, or production vLLM deployment tuning without an evaluation pass.393installs 84Evaluating Code Modelsevaluating-code-models is an agent skill that wraps the BigCode Evaluation Harness to benchmark code-generation models across 15+ standardized suites before teams adopt a codegen or agent stack. It documents HumanEval with 164 Python problems, MBPP with 500 entry-level tasks, HumanEval+ with stricter test expansion, and MultiPL-E spanning 18 languages, all scored with pass@k at k=1, 10, and 100. Workflows cover accelerate launch commands, multi-language evaluation, instruction-tuned model runs, and head-to-head model comparisons with configurable temperature, n_samples, and max_length_generation. Developers reach for evaluating-code-models when they need reproducible functional-correctness numbers comparable to HuggingFace leaderboards instead of anecdotal code samples, including optional Docker-isolated code execution for untrusted model output.392installs 85Implementing Llms Litgptimplementing-llms-litgpt is a research engineering skill for developers building custom large language model architectures with the LitGPT library in Python. The skill documents how to extend the base GPT class or create entirely new models by modifying core classes in litgpt/model.py: GPT, Block, CausalSelfAttention, MLP, RMSNorm, and LayerNorm. Developers reach for implementing-llms-litgpt when implementing new research architectures, adapting models for specific domains, experimenting with attention mechanisms, or adding custom transformer layers during training or fine-tuning. LitGPT's single-file implementations make architecture changes approachable, and the skill walks through Config dataclass definitions that wire hyperparameters to model components. Use cases include domain-specific model design, attention variant prototyping, and custom layer insertion without navigating a sprawling framework codebase.392installs 86Model Pruningmodel-pruning is an agent skill that teaches developers to apply Wanda (Pruning by Weights and Activations), the ICLR 2024 approach from arXiv 2306.11695, to compress large language models without retraining. The pruning criterion scores each weight as absolute magnitude multiplied by the L2 norm of its input activations, then removes low-importance connections to reach about 50% sparsity with under 1% accuracy loss according to the published results. The skill references the official locuslab/wanda GitHub implementation and explains when magnitude-only pruning fails on rarely activated dimensions. Developers reach for model-pruning when inference cost or GPU memory limits block shipping an LLM feature and post-training compression is preferable to full fine-tuning or distillation cycles.392installs 87Nemo Evaluator Sdknemo-evaluator-sdk is an agent skill for ML engineers integrating custom adapters into the NeMo Evaluator benchmark framework. NeMo Evaluator uses an adapter pipeline where requests pass through chained interceptors before reaching model endpoints and responses are processed on return. The skill guides developers through wiring custom request/response interceptors from the nemo-evaluator core library, configuring the adapter pipeline for endpoint-specific HTTP shaping, authentication headers, payload transforms, and response normalization. Developers reach for nemo-evaluator-sdk when benchmark runs fail because model endpoints need custom request formatting or response parsing that built-in interceptors do not cover. The skill covers the architecture from evaluation engine through interceptor chains to model endpoints, helping teams run reproducible LLM benchmarks against proprietary or self-hosted inference services.391installs 88Systems Paper Writingsystems-paper-writing is a research-oriented agent skill that packages a comprehensive pre-submission checklist for systems conference papers. Indie researchers and small labs shipping serious systems work—not blog posts—use it when a draft nears submission to OSDI, SOSP, ASPLOS, NSDI, or EuroSys and they need a disciplined self-review instead of vague “read it once more” advice. The skill walks through structural completeness: a testable thesis repeated across abstract, introduction, and conclusion; three to five numbered contributions tied to sections and evaluation claims; mandatory section presence from background through related work; and explicit page budgets so design and evaluation stay balanced. It treats evaluation as first-class—end-to-end results, ablations, and scalability—and related work as differentiated grouping rather than a bibliography dump. Tag it multi-phase because ideation and writing start earlier, but the skill’s natural invoke point is Ship review immediately before upload. Pair it with your plotting, benchmarking, and citation skills; it does not replace human peer review or venue-specific formatting tools.366installs 89Experiment Tracking Swanlabexperiment-tracking-swanlab is an Orchestra Research agent skill (version 1.0.0) for open-source ML experiment tracking with SwanLab. The skill documents swanlab.init, swanlab.log, run.finish, local mode with swanlab watch, and cloud or self-hosted deployment via swanlab login. It ships integration patterns for 4 frameworks—PyTorch, HuggingFace Transformers, PyTorch Lightning, and Fastai—plus media logging for images, audio, text, GIFs, point clouds, and molecules through swanlab.Image, Audio, Text, Video, Object3D, and Molecule APIs. Dependencies pin swanlab>=0.7.11 with pillow and soundfile for media examples, and two reference files cover framework callbacks and ECharts visualization. Reach for experiment-tracking-swanlab when instrumenting training scripts, comparing hyperparameter sweeps, or running offline-first experiments without a managed SaaS tracker.358installs 90Fine Tuning Openvla Oftfine-tuning-openvla-oft is an orchestra-research/ai-research-skills guide for OpenVLA-OFT+ training and real-robot evaluation on the ALOHA stack. The workflow uses server-client inference: a server machine hosts the VLA model behind a /act endpoint via uvicorn and FastAPI, while a client machine controls the robot environment and requests actions. Setup creates separate conda environments with Python 3.10, installing torch, torchvision, torchaudio, and project dependencies on both sides. Developers reach for this skill when building embodied AI or manipulation policies that require fine-tuning on ALOHA demonstrations rather than generic LLM chat workflows. Outputs include configured training and inference environments, server endpoints, and client control paths for policy evaluation on physical or simulated robots.340installs 91Fine Tuning Serving Openpifine-tuning-serving-openpi is a robotics ML serving skill for developers deploying Physical Intelligence OpenPI policies to simulators or hardware. The skill documents default environment-to-checkpoint mappings and explicit serve_policy.py commands run through uv against GCS paths such as gs://openpi-assets/checkpoints. It covers four environments—ALOHA, ALOHA_SIM, DROID, and LIBERO—with named configs like pi05_aloha and pi05_droid. Developers reach for this skill when a robotics stack is configured but policy inference fails because the wrong checkpoint directory or environment mode is selected.339installs 92Evaluating Cosmos Policyevaluating-cosmos-policy is a research evaluation skill from orchestra-research/ai-research-skills that documents command matrices for running NVIDIA Cosmos Policy LIBERO benchmarks through the official cosmos_policy.experiments.robot.libero.run_libero_eval module. The skill covers interactive GPU shells and Slurm batch jobs, including headless MuJoCo rendering via EGL environment variables such as MUJOCO_GL=egl and PYOPENGL_PLATFORM=egl. Example Slurm allocations request one GPU, 64G memory, and eight CPUs for hour-long evaluation windows. Developers reach for evaluating-cosmos-policy when they need smoke evals on a single suite or full LIBERO trials without hand-rolling CUDA, MuJoCo, and uv dependency wiring each run.338installs 93Evolving Ai Agentsevolving-ai-agents is a reference skill for Orchestra’s A-Evolve stack: import `agent_evolve as ae`, construct an `Evolver` with an agent seed or custom workspace, attach a named or custom benchmark, and iterate evolution cycles until benchmarks improve. Solo builders shipping autonomous coding agents use it when ad-hoc prompt tweaks stop scaling and they need a structured loop—tasks, scoring, and evolvable layers—grounded in public harnesses like SWE-verified or MCP-Atlas. The doc spells resolution rules for string agent names, working-directory copies, and manifest validation so you do not start evolution on a broken workspace. You can override `workspace_dir`, inject a custom engine, and thread `EvolveConfig` without re-reading the whole Python package. Complexity is advanced because you are orchestrating benchmarks, seeds, and evolution state—not invoking a single API call. Treat this as Build-phase agent infrastructure; pair it with your own eval harness and version control before Ship.336installs 94Ml Training RecipesML Training Recipes is a reference skill that packages modern transformer implementation patterns for solo builders training or customizing language models. It walks through RMSNorm, rotary position embeddings, grouped-query attention, sliding-window flash attention, value embeddings, activation choices, residual scaling, logit soft capping, assembled transformer blocks, and configuration conventions—each with concise Python-oriented guidance meant to be copied into a real training repo. Use it when you are past the idea stage and actively coding a model stack rather than shopping hosted APIs. The content assumes comfort with PyTorch-style modules and transformer training loops. It does not replace experiment design or dataset curation; it accelerates correct, contemporary architecture wiring so you spend fewer cycles debugging norm placement or attention variants.326installs 95Presenting Conference TalksPresenting Conference Talks is a template skill for building professional conference presentations in two formats: Beamer LaTeX for pixel-stable PDFs and python-pptx for slides you can edit in PowerPoint. Solo builders, indie hackers, and researchers use it when they need a credible oral-talk layout—title slide, optional table of contents, figure paths, and per-slide speaker notes—without designing deck architecture from scratch. The Beamer template targets 16:9 at 12pt with Metropolis styling, appendix numbering, and commented options to show or hide notes on a second screen. Color blocks are parameterized so you can align with venue or lab branding quickly. The skill fits anyone translating a paper or product narrative into a timed talk; it is not a substitute for rehearsal or content strategy, but it removes blank-deck friction. Pair it with your own figures directory and metadata (title, authors, institute) before generating frames for problem, approach, evaluation, and conclusion sections typical of research oral presentations.325installs 96Ara Rigor ReviewerARA Rigor Reviewer is an agent skill for solo builders and small research teams who publish structured research artifacts with explicit claims and experiments. After Level 1 structural validation passes, it runs a Level 2 semantic review across six epistemic dimensions—starting with whether cited experiments substantively support each claim and whether falsifiability statements are meaningful. It is meant for ARA-style documents where references alone are not enough and you need type-appropriate evidence (for example ablations for causal claims or heterogeneous setups for generalization). Use it when a draft looks complete but you still distrust the claim–evidence graph, before sharing externally or folding conclusions into product specs. The output mindset is scored dimensions plus major/suggestion findings you can fix in place, not a generic rewrite. It pairs well with agentic research workflows where another skill assembled the ARA and you need a disciplined second reader.303installs 97Ara Research ManagerARA Research Manager is a journey-wide agent skill for builders and researchers who treat an AI coding session as a lab notebook. Whenever conversation or code changes imply a new question, committed decision, benchmark run, abandoned approach, or evidence-driven pivot, the skill tells the agent how to classify the moment and which file to update—exploration_tree.yaml for the branching trace of trials, and logic/ markdown for claims, heuristics, concepts, and constraints. That split keeps ephemeral search paths separate from statements you want to reuse in specs or papers. Solo builders training custom models, probing architectures, or running systematic ablations benefit because dead ends and pivots do not vanish in chat scrollback. The skill is procedural routing metadata, not a hosted experiment platform; your repo must already use or adopt the Orchestra-style trace and logic layout. Use it from early Idea research through Operate iteration whenever you want session observability without manually curating notes after every agent turn.299installs 98Ara CompilerARA Compiler is an agent skill that encodes the ARA directory schema—a layered layout for AI research repositories spanning manifest, logic, source stubs, trace DAGs, and indexed evidence. Solo and indie builders running agent-assisted research use it when they need a consistent, citable structure instead of ad-hoc folders and half-documented experiments. The reference walks field-by-field through problem statements, claims, architecture, algorithms, configs, and raw result tables so both humans and coding agents know where to read and write. It fits early journey work when you are still proving ideas and documenting rigor, and it carries into Build when you formalize docs and execution modules. Pair it with your actual experiment code and review workflows; it does not run training or collect metrics by itself.297installs 99Writing Systems Paperswriting-systems-papers is a Claude Code skill for ai & agent building. It helps solo builders move faster with AI-assisted coding.3installs

Five minutes, every Monday - the tools, releases and tactics for developers.

unsubscribe anytime.