
Optimizing Attention Flash
Pick Flash Attention 2 vs 3 and GPU settings using published forward-pass benchmarks before you ship long-context training or inference.
Overview
Optimizing Attention Flash is an agent skill most often used in Build (also Ship perf) that summarizes Flash Attention GPU benchmarks and speedups for long-sequence transformers.
Install
npx skills add https://github.com/orchestra-research/ai-research-skills --skill optimizing-attention-flashWhat is this skill?
- Forward-pass tables for A100, H100, and A10G from 512–8192 sequence length
- Standard vs Flash Attention 2 vs Flash Attention 3 (FP16/FP8) speedup columns
- H100 FP8 note: ~1.2 PFLOPS (~75% of theoretical max)
- Sections for memory usage, training vs inference, and FA version comparison
- Scaling guidance tied to batch, heads, and dim configs in the tables
- Benchmark grid spans sequence lengths 512, 1024, 2048, 4096, and 8192
- Documented FA2 speedups up to roughly 3.3x on A100 and up to ~8.3x best case on H100 vs standard
- H100 Flash Attention 3 FP8 forward path cited around ~1.2 PFLOPS (~75% theoretical max)
Adoption & trust: 1 installs on skills.sh; 9.4k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need to serve or train longer contexts but do not know which Flash Attention version or GPU tier actually wins on forward-pass time and memory.
Who is it for?
Indie ML builders sizing inference on A100/H100/A10G or debating Flash Attention 2 vs 3 for context length beyond 2k tokens.
Skip if: Teams that only need generic “use flash-attn pip” steps without benchmark interpretation, or pure frontend work with no model kernels.
When should I use this skill?
You are choosing attention implementations or GPUs for long-context training or inference and need cited speed and memory comparisons.
What do I get? / Deliverables
You leave with comparable timing tables and speedup ratios so you can choose FA2/FA3, precision, and sequence limits aligned to your target GPU class.
- Kernel/version recommendation grounded in benchmark tables
- Expected forward-pass speedup range for your sequence length and GPU
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Attention kernels sit in the model/backend stack where solo builders first wire transformers—not in launch or growth tooling. Canonical shelf is backend because the skill is GPU kernel choice, sequence-length scaling, and memory tradeoffs for model code.
Where it fits
Compare FA2 vs standard attention timings at 4096 tokens before locking the training script.
Validate whether H100 + FA3 FP8 is worth the complexity for your launch SLO on 8k context.
Explain a GPU bill spike using published A10G vs A100 scaling rows when sequence length doubled.
How it compares
Reference benchmark tables for kernel choice—not a generic Python debugging or API integration skill.
Common Questions / FAQ
Who is optimizing-attention-flash for?
Solo and indie builders shipping or fine-tuning LLM backends who must justify attention kernel and GPU choices with numbers, not intuition.
When should I use optimizing-attention-flash?
During Build when wiring transformer backends, before Ship perf reviews for long-context inference, and when Operate cost spikes trace to attention on 4k–8k sequences.
Is optimizing-attention-flash safe to install?
Treat it as documentation-style procedural knowledge; review the Security Audits panel on this Prism page before pulling it into an agent workflow.
SKILL.md
READMESKILL.md - Optimizing Attention Flash
# Performance Benchmarks ## Contents - Speed comparisons across GPUs - Memory usage analysis - Scaling with sequence length - Training vs inference performance - Flash Attention versions comparison ## Speed comparisons across GPUs ### A100 80GB (Ampere) **Forward pass time** (milliseconds, batch=8, heads=32, dim=64): | Seq Length | Standard | Flash Attn 2 | Flash Attn 3 | Speedup (FA2) | |------------|----------|--------------|--------------|---------------| | 512 | 1.2 | 0.9 | N/A | 1.3x | | 1024 | 3.8 | 1.4 | N/A | 2.7x | | 2048 | 14.2 | 4.8 | N/A | 3.0x | | 4096 | 55.1 | 17.3 | N/A | 3.2x | | 8192 | 218.5 | 66.2 | N/A | 3.3x | ### H100 80GB (Hopper) **Forward pass time** (milliseconds, same config): | Seq Length | Standard | Flash Attn 2 | Flash Attn 3 (FP16) | Flash Attn 3 (FP8) | Best Speedup | |------------|----------|--------------|---------------------|--------------------|--------------| | 512 | 0.8 | 0.6 | 0.4 | 0.3 | 2.7x | | 1024 | 2.6 | 1.0 | 0.6 | 0.4 | 6.5x | | 2048 | 9.8 | 3.4 | 2.0 | 1.3 | 7.5x | | 4096 | 38.2 | 12.5 | 7.2 | 4.8 | 8.0x | | 8192 | 151.4 | 47.8 | 27.1 | 18.2 | 8.3x | **Key insight**: Flash Attention 3 on H100 with FP8 achieves ~1.2 PFLOPS (75% of theoretical max). ### A10G 24GB (Ampere) **Forward pass time** (milliseconds, batch=4): | Seq Length | Standard | Flash Attn 2 | Speedup | |------------|----------|--------------|---------| | 512 | 2.1 | 1.6 | 1.3x | | 1024 | 6.8 | 2.8 | 2.4x | | 2048 | 25.9 | 9.4 | 2.8x | | 4096 | 102.1 | 35.2 | 2.9x | ## Memory usage analysis ### GPU memory consumption (batch=8, heads=32, dim=64) **Standard attention memory**: | Seq Length | Attention Matrix | KV Cache | Total | Notes | |------------|------------------|----------|-------|-------| | 512 | 8 MB | 32 MB | 40 MB | Manageable | | 2048 | 128 MB | 128 MB | 256 MB | Growing | | 8192 | 2048 MB (2 GB) | 512 MB | 2.5 GB | Large | | 32768 | 32768 MB (32 GB) | 2048 MB | 34 GB | OOM on 24GB GPUs | **Flash Attention 2 memory**: | Seq Length | Attention (on-chip) | KV Cache | Total | Reduction | |------------|---------------------|----------|-------|-----------| | 512 | 0 MB (recomputed) | 32 MB | 32 MB | 20% | | 2048 | 0 MB | 128 MB | 128 MB | 50% | | 8192 | 0 MB | 512 MB | 512 MB | 80% | | 32768 | 0 MB | 2048 MB | 2 GB | 94% | **Key insight**: Flash Attention doesn't materialize attention matrix, saving O(N²) memory. ### Memory scaling comparison **Llama 2 7B model memory** (float16, batch=1): | Context Length | Standard Attention | Flash Attention 2 | Can Fit 24GB GPU? | |----------------|-------------------|-------------------|-------------------| | 2K | 3.2 GB | 2.1 GB | Both: Yes | | 4K | 5.8 GB | 2.8 GB | Both: Yes | | 8K | 12.1 GB | 4.2 GB | Both: Yes | | 16K | 26.3 GB (OOM) | 7.8 GB | Only Flash: Yes | | 32K | OOM | 14.2 GB | Only Flash: Yes | ### Training memory (Llama 2 7B, batch=4) | Context | Standard (GB) | Flash Attn (GB) | Reduction | |---------|---------------|-----------------|-----------| | 2K | 18.2 | 12.4 | 32% | | 4K | 34.8 | 16.8 | 52% | | 8K | OOM (>40GB) | 26.2 | Fits! | ## Scaling with sequence length ### Computational complexity **Standard attention**: - Time: O(N² × d) - Memory: O(N² + N × d) **Flash Attention**: - Time: O(N² × d) (same, but with better constants) - Memory: O(N × d) (linear!) ### Empirical scaling (A100, batch=1, heads=32, dim=64) **Time per token (milliseconds)**: | Sequence | 512 | 1K | 2K | 4K | 8K | 16K | |----------|-----|-----|-----|-----|-----|------| | Standard | 0.15 | 0.37 | 1.11 | 3.44 | 13.4 | 52.8 | | Flash Attn 2 | 0.11 | 0.14 | 0.24 | 0.43 | 0.83 | 1.64 | | Speedup | 1.4x | 2.6x | 4.6x | 8.0x | 16.1x | 32.2x | **Observation**: Speedup increases quadratically with sequence length! ### Memory per token (MB) | Sequence | 512 | 1K | 2K | 4K | 8K | 16K | |----------|-----|-----|-----|-----|-----|------| | Standard | 0.08 | 0.13 | 0.25 | 0.64 | 2.05 | 8.13 | | Flash Attn 2 | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 | 0.06 | **Observation