Optimizing Attention Flash

Name: Optimizing Attention Flash
Author: orchestra-research

orchestra-research/ai-research-skills

400 installs
11.2k repo stars
Updated June 16, 2026
orchestra-research/ai-research-skills

optimizing-attention-flash is an agent skill that helps developers choose Flash Attention 2 versus 3 and GPU settings using published forward-pass benchmarks before shipping long-context workloads.

About

optimizing-attention-flash is a benchmark-driven skill from orchestra-research/ai-research-skills comparing standard attention against Flash Attention 2 and Flash Attention 3 across NVIDIA A100 80GB and H100 80GB GPUs. Tables report forward-pass milliseconds at sequence lengths 512 through 8192 with batch=8, heads=32, dim=64—showing up to 3.3× speedup for FA2 at 8192 tokens on A100. Developers reach for optimizing-attention-flash when long-context training or inference is memory- or latency-bound and they need evidence-backed Flash Attention version and GPU selection.

Forward-pass tables for A100, H100, and A10G from 512–8192 sequence length
Standard vs Flash Attention 2 vs Flash Attention 3 (FP16/FP8) speedup columns
H100 FP8 note: ~1.2 PFLOPS (~75% of theoretical max)
Sections for memory usage, training vs inference, and FA version comparison
Scaling guidance tied to batch, heads, and dim configs in the tables

Optimizing Attention Flash by the numbers

400 all-time installs (skills.sh)
+37 installs in the week ending Jul 18, 2026 (Skillselion tracking)
Ranked #492 of 2,066 Data Science & ML skills by installs in the Skillselion catalog
Security screen: MEDIUM risk (skills.sh audit)
Data as of Jul 28, 2026 (Skillselion catalog sync)

npx skills add https://github.com/orchestra-research/ai-research-skills --skill optimizing-attention-flash

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/orchestra-research/ai-research-skills/optimizing-attention-flash.svg)](https://skillselion.com/skills/orchestra-research/ai-research-skills/optimizing-attention-flash)

Installs	400
repo stars	★ 11.2k
Security audit	3 / 3 scanners passed
Last updated	June 16, 2026
Repository	orchestra-research/ai-research-skills ↗

Flash Attention 2 or 3 for my GPU and sequence length?

Pick Flash Attention 2 vs 3 and GPU settings using published forward-pass benchmarks before you ship long-context training or inference.

Who is it for?

ML engineers profiling attention kernels on A100 or H100 before long-context training or inference.

Skip if: Application teams with no GPU training pipeline who only call remote inference APIs.

When should I use this skill?

A developer is choosing Flash Attention 2 vs 3 or sizing GPUs for 4k–8k+ sequence attention workloads.

What you get

A Flash Attention version and GPU recommendation backed by sequence-length benchmark tables.

Flash Attention version recommendation
GPU sizing rationale from benchmark tables

By the numbers

3.3× Flash Attention 2 speedup at 8192 seq length on A100 80GB
Benchmarks sequence lengths 512 through 8192 with batch=8, heads=32, dim=64

Files

SKILL.mdMarkdownGitHub ↗

Flash Attention - Fast Memory-Efficient Attention

Quick start

Flash Attention provides 2-4x speedup and 10-20x memory reduction for transformer attention through IO-aware tiling and recomputation.

PyTorch native (easiest, PyTorch 2.2+):

import torch
import torch.nn.functional as F

q = torch.randn(2, 8, 512, 64, device='cuda', dtype=torch.float16)  # [batch, heads, seq, dim]
k = torch.randn(2, 8, 512, 64, device='cuda', dtype=torch.float16)
v = torch.randn(2, 8, 512, 64, device='cuda', dtype=torch.float16)

# Automatically uses Flash Attention if available
out = F.scaled_dot_product_attention(q, k, v)

flash-attn library (more features):

pip install flash-attn --no-build-isolation

from flash_attn import flash_attn_func

# q, k, v: [batch, seqlen, nheads, headdim]
out = flash_attn_func(q, k, v, dropout_p=0.0, causal=True)

Common workflows

Workflow 1: Enable in existing PyTorch model

Copy this checklist:

Flash Attention Integration:
- [ ] Step 1: Check PyTorch version (≥2.2)
- [ ] Step 2: Enable Flash Attention backend
- [ ] Step 3: Verify speedup with profiling
- [ ] Step 4: Test accuracy matches baseline

Step 1: Check PyTorch version

python -c "import torch; print(torch.__version__)"
# Should be ≥2.2.0

If <2.2, upgrade:

pip install --upgrade torch

Step 2: Enable Flash Attention backend

Replace standard attention:

# Before (standard attention)
attn_weights = torch.softmax(q @ k.transpose(-2, -1) / math.sqrt(d_k), dim=-1)
out = attn_weights @ v

# After (Flash Attention)
import torch.nn.functional as F
out = F.scaled_dot_product_attention(q, k, v, attn_mask=mask)

Force Flash Attention backend:

with torch.backends.cuda.sdp_kernel(
    enable_flash=True,
    enable_math=False,
    enable_mem_efficient=False
):
    out = F.scaled_dot_product_attention(q, k, v)

Step 3: Verify speedup with profiling

import torch.utils.benchmark as benchmark

def test_attention(use_flash):
    q, k, v = [torch.randn(2, 8, 2048, 64, device='cuda', dtype=torch.float16) for _ in range(3)]

    if use_flash:
        with torch.backends.cuda.sdp_kernel(enable_flash=True):
            return F.scaled_dot_product_attention(q, k, v)
    else:
        attn = (q @ k.transpose(-2, -1) / 8.0).softmax(dim=-1)
        return attn @ v

# Benchmark
t_flash = benchmark.Timer(stmt='test_attention(True)', globals=globals())
t_standard = benchmark.Timer(stmt='test_attention(False)', globals=globals())

print(f"Flash: {t_flash.timeit(100).mean:.3f}s")
print(f"Standard: {t_standard.timeit(100).mean:.3f}s")

Expected: 2-4x speedup for sequences >512 tokens.

Step 4: Test accuracy matches baseline

# Compare outputs
q, k, v = [torch.randn(1, 8, 512, 64, device='cuda', dtype=torch.float16) for _ in range(3)]

# Flash Attention
out_flash = F.scaled_dot_product_attention(q, k, v)

# Standard attention
attn_weights = torch.softmax(q @ k.transpose(-2, -1) / 8.0, dim=-1)
out_standard = attn_weights @ v

# Check difference
diff = (out_flash - out_standard).abs().max()
print(f"Max difference: {diff:.6f}")
# Should be <1e-3 for float16

Workflow 2: Use flash-attn library for advanced features

For multi-query attention, sliding window, or H100 FP8.

Copy this checklist:

flash-attn Library Setup:
- [ ] Step 1: Install flash-attn library
- [ ] Step 2: Modify attention code
- [ ] Step 3: Enable advanced features
- [ ] Step 4: Benchmark performance

Step 1: Install flash-attn library

# NVIDIA GPUs (CUDA 12.0+)
pip install flash-attn --no-build-isolation

# Verify installation
python -c "from flash_attn import flash_attn_func; print('Success')"

Step 2: Modify attention code

from flash_attn import flash_attn_func

# Input: [batch_size, seq_len, num_heads, head_dim]
# Transpose from [batch, heads, seq, dim] if needed
q = q.transpose(1, 2)  # [batch, seq, heads, dim]
k = k.transpose(1, 2)
v = v.transpose(1, 2)

out = flash_attn_func(
    q, k, v,
    dropout_p=0.1,
    causal=True,  # For autoregressive models
    window_size=(-1, -1),  # No sliding window
    softmax_scale=None  # Auto-scale
)

out = out.transpose(1, 2)  # Back to [batch, heads, seq, dim]

Step 3: Enable advanced features

Multi-query attention (shared K/V across heads):

from flash_attn import flash_attn_func

# q: [batch, seq, num_q_heads, dim]
# k, v: [batch, seq, num_kv_heads, dim]  # Fewer KV heads
out = flash_attn_func(q, k, v)  # Automatically handles MQA

Sliding window attention (local attention):

# Only attend to window of 256 tokens before/after
out = flash_attn_func(
    q, k, v,
    window_size=(256, 256),  # (left, right) window
    causal=True
)

Step 4: Benchmark performance

import torch
from flash_attn import flash_attn_func
import time

q, k, v = [torch.randn(4, 4096, 32, 64, device='cuda', dtype=torch.float16) for _ in range(3)]

# Warmup
for _ in range(10):
    _ = flash_attn_func(q, k, v)

# Benchmark
torch.cuda.synchronize()
start = time.time()
for _ in range(100):
    out = flash_attn_func(q, k, v)
    torch.cuda.synchronize()
end = time.time()

print(f"Time per iteration: {(end-start)/100*1000:.2f}ms")
print(f"Memory allocated: {torch.cuda.max_memory_allocated()/1e9:.2f}GB")

Workflow 3: H100 FP8 optimization (FlashAttention-3)

For maximum performance on H100 GPUs.

FP8 Setup:
- [ ] Step 1: Verify H100 GPU available
- [ ] Step 2: Install flash-attn with FP8 support
- [ ] Step 3: Convert inputs to FP8
- [ ] Step 4: Run with FP8 attention

Step 1: Verify H100 GPU

nvidia-smi --query-gpu=name --format=csv
# Should show "H100" or "H800"

Step 2: Install flash-attn with FP8 support

pip install flash-attn --no-build-isolation
# FP8 support included for H100

Step 3: Convert inputs to FP8

import torch

q = torch.randn(2, 4096, 32, 64, device='cuda', dtype=torch.float16)
k = torch.randn(2, 4096, 32, 64, device='cuda', dtype=torch.float16)
v = torch.randn(2, 4096, 32, 64, device='cuda', dtype=torch.float16)

# Convert to float8_e4m3 (FP8)
q_fp8 = q.to(torch.float8_e4m3fn)
k_fp8 = k.to(torch.float8_e4m3fn)
v_fp8 = v.to(torch.float8_e4m3fn)

Step 4: Run with FP8 attention

from flash_attn import flash_attn_func

# FlashAttention-3 automatically uses FP8 kernels on H100
out = flash_attn_func(q_fp8, k_fp8, v_fp8)
# Result: ~1.2 PFLOPS, 1.5-2x faster than FP16

When to use vs alternatives

Use Flash Attention when:

Training transformers with sequences >512 tokens
Running inference with long context (>2K tokens)
GPU memory constrained (OOM with standard attention)
Need 2-4x speedup without accuracy loss
Using PyTorch 2.2+ or can install flash-attn

Use alternatives instead:

Standard attention: Sequences <256 tokens (overhead not worth it)
xFormers: Need more attention variants (not just speed)
Memory-efficient attention: CPU inference (Flash Attention needs GPU)

Common issues

Issue: ImportError: cannot import flash_attn

Install with no-build-isolation flag:

pip install flash-attn --no-build-isolation

Or install CUDA toolkit first:

conda install cuda -c nvidia
pip install flash-attn --no-build-isolation

Issue: Slower than expected (no speedup)

Flash Attention benefits increase with sequence length:

<512 tokens: Minimal speedup (10-20%)
512-2K tokens: 2-3x speedup
>2K tokens: 3-4x speedup

Check sequence length is sufficient.

Issue: RuntimeError: CUDA error

Verify GPU supports Flash Attention:

import torch
print(torch.cuda.get_device_capability())
# Should be ≥(7, 5) for Turing+

Flash Attention requires:

Ampere (A100, A10): ✅ Full support
Turing (T4): ✅ Supported
Volta (V100): ❌ Not supported

Issue: Accuracy degradation

Check dtype is float16 or bfloat16 (not float32):

q = q.to(torch.float16)  # Or torch.bfloat16

Flash Attention uses float16/bfloat16 for speed. Float32 not supported.

Advanced topics

Integration with HuggingFace Transformers: See references/transformers-integration.md for enabling Flash Attention in BERT, GPT, Llama models.

Performance benchmarks: See references/benchmarks.md for detailed speed and memory comparisons across GPUs and sequence lengths.

Algorithm details: See references/algorithm.md for tiling strategy, recomputation, and IO complexity analysis.

Advanced features: See references/advanced-features.md for rotary embeddings, ALiBi, paged KV cache, and custom attention masks.

Hardware requirements

GPU: NVIDIA Ampere+ (A100, A10, A30) or AMD MI200+
VRAM: Same as standard attention (Flash Attention doesn't increase memory)
CUDA: 12.0+ (11.8 minimum)
PyTorch: 2.2+ for native support

Not supported: V100 (Volta), CPU inference

Resources

Paper: "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (NeurIPS 2022)
Paper: "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" (ICLR 2024)
Blog: https://tridao.me/blog/2024/flash3/
GitHub: https://github.com/Dao-AILab/flash-attention
PyTorch docs: https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html

Performance Benchmarks

Speed comparisons across GPUs
Memory usage analysis
Scaling with sequence length
Training vs inference performance
Flash Attention versions comparison

Speed comparisons across GPUs

A100 80GB (Ampere)

Forward pass time (milliseconds, batch=8, heads=32, dim=64):

Seq Length	Standard	Flash Attn 2	Flash Attn 3	Speedup (FA2)
512	1.2	0.9	N/A	1.3x
1024	3.8	1.4	N/A	2.7x
2048	14.2	4.8	N/A	3.0x
4096	55.1	17.3	N/A	3.2x
8192	218.5	66.2	N/A	3.3x

H100 80GB (Hopper)

Forward pass time (milliseconds, same config):

Seq Length	Standard	Flash Attn 2	Flash Attn 3 (FP16)	Flash Attn 3 (FP8)	Best Speedup
512	0.8	0.6	0.4	0.3	2.7x
1024	2.6	1.0	0.6	0.4	6.5x
2048	9.8	3.4	2.0	1.3	7.5x
4096	38.2	12.5	7.2	4.8	8.0x
8192	151.4	47.8	27.1	18.2	8.3x

Key insight: Flash Attention 3 on H100 with FP8 achieves ~1.2 PFLOPS (75% of theoretical max).

A10G 24GB (Ampere)

Forward pass time (milliseconds, batch=4):

Seq Length	Standard	Flash Attn 2	Speedup
512	2.1	1.6	1.3x
1024	6.8	2.8	2.4x
2048	25.9	9.4	2.8x
4096	102.1	35.2	2.9x

Memory usage analysis

GPU memory consumption (batch=8, heads=32, dim=64)

Standard attention memory:

Seq Length	Attention Matrix	KV Cache	Total	Notes
512	8 MB	32 MB	40 MB	Manageable
2048	128 MB	128 MB	256 MB	Growing
8192	2048 MB (2 GB)	512 MB	2.5 GB	Large
32768	32768 MB (32 GB)	2048 MB	34 GB	OOM on 24GB GPUs

Flash Attention 2 memory:

Seq Length	Attention (on-chip)	KV Cache	Total	Reduction
512	0 MB (recomputed)	32 MB	32 MB	20%
2048	0 MB	128 MB	128 MB	50%
8192	0 MB	512 MB	512 MB	80%
32768	0 MB	2048 MB	2 GB	94%

Key insight: Flash Attention doesn't materialize attention matrix, saving O(N²) memory.

Memory scaling comparison

Llama 2 7B model memory (float16, batch=1):

Context Length	Standard Attention	Flash Attention 2	Can Fit 24GB GPU?
2K	3.2 GB	2.1 GB	Both: Yes
4K	5.8 GB	2.8 GB	Both: Yes
8K	12.1 GB	4.2 GB	Both: Yes
16K	26.3 GB (OOM)	7.8 GB	Only Flash: Yes
32K	OOM	14.2 GB	Only Flash: Yes

Training memory (Llama 2 7B, batch=4)

Context	Standard (GB)	Flash Attn (GB)	Reduction
2K	18.2	12.4	32%
4K	34.8	16.8	52%
8K	OOM (>40GB)	26.2	Fits!

Scaling with sequence length

Computational complexity

Standard attention:

Time: O(N² × d)
Memory: O(N² + N × d)

Flash Attention:

Time: O(N² × d) (same, but with better constants)
Memory: O(N × d) (linear!)

Empirical scaling (A100, batch=1, heads=32, dim=64)

Time per token (milliseconds):

Sequence	512	1K	2K	4K	8K	16K
Standard	0.15	0.37	1.11	3.44	13.4	52.8
Flash Attn 2	0.11	0.14	0.24	0.43	0.83	1.64
Speedup	1.4x	2.6x	4.6x	8.0x	16.1x	32.2x

Observation: Speedup increases quadratically with sequence length!

Memory per token (MB)

Sequence	512	1K	2K	4K	8K	16K
Standard	0.08	0.13	0.25	0.64	2.05	8.13
Flash Attn 2	0.06	0.06	0.06	0.06	0.06	0.06

Observation: Flash Attention memory per token is constant!

Training vs inference performance

Training (forward + backward, Llama 2 7B, A100)

Batch × Seq	Standard (samples/sec)	Flash Attn (samples/sec)	Speedup
4 × 2K	1.2	3.1	2.6x
8 × 2K	2.1	5.8	2.8x
4 × 4K	0.4	1.3	3.3x
8 × 4K	OOM	2.4	Enabled
2 × 8K	0.1	0.4	4.0x

Inference (generation, Llama 2 7B, A100)

Context Length	Standard (tokens/sec)	Flash Attn (tokens/sec)	Speedup
512	48	52	1.1x
2K	42	62	1.5x
4K	31	58	1.9x
8K	18	51	2.8x
16K	OOM	42	Enabled

Note: Inference speedup less dramatic than training because generation is memory-bound (KV cache accesses).

Flash Attention versions comparison

Flash Attention 1 vs 2 vs 3 (H100, seq=4096, batch=8)

Metric	FA1	FA2	FA3 (FP16)	FA3 (FP8)
Forward time (ms)	28.4	12.5	7.2	4.8
Memory (GB)	4.8	4.2	4.2	2.8
TFLOPS	180	420	740	1150
GPU util %	35%	55%	75%	82%

Key improvements:

FA2: 2.3x faster than FA1 (better parallelism)
FA3 (FP16): 1.7x faster than FA2 (H100 async optimizations)
FA3 (FP8): 2.6x faster than FA2 (low precision)

Features by version

Feature	FA1	FA2	FA3
Basic attention	✅	✅	✅
Causal masking	✅	✅	✅
Multi-query attention	❌	✅	✅
Sliding window	❌	✅	✅
Paged KV cache	❌	✅	✅
FP8 support	❌	❌	✅ (H100 only)
Work partitioning	Basic	Advanced	Optimal

Real-world model benchmarks

Llama 2 models (A100 80GB, batch=4, seq=2048)

Model	Params	Standard (samples/sec)	Flash Attn (samples/sec)	Speedup
Llama 2 7B	7B	1.2	3.1	2.6x
Llama 2 13B	13B	0.6	1.7	2.8x
Llama 2 70B	70B	0.12	0.34	2.8x

GPT-style models (seq=1024)

Model	Standard (tokens/sec)	Flash Attn (tokens/sec)	Speedup
GPT-2 (124M)	520	680	1.3x
GPT-J (6B)	42	98	2.3x
GPT-NeoX (20B)	8	22	2.75x

Recommendations by use case

Training large models (>7B parameters):

Use Flash Attention 2 on A100
Use Flash Attention 3 FP8 on H100 for maximum speed
Expected: 2.5-3x speedup

Long context inference (>4K tokens):

Flash Attention essential (enables contexts standard attention can't handle)
Expected: 2-4x speedup, 5-10x memory reduction

Short sequences (<512 tokens):

Flash Attention provides 1.2-1.5x speedup
Minimal memory benefit
Still worth enabling (no downside)

Multi-user serving:

Flash Attention reduces per-request memory
Allows higher concurrent batch sizes
Can serve 2-3x more users on same hardware

HuggingFace Transformers Integration

Enabling Flash Attention in Transformers
Supported model architectures
Configuration examples
Performance comparisons
Troubleshooting model-specific issues

Enabling Flash Attention in Transformers

HuggingFace Transformers (v4.36+) supports Flash Attention 2 natively.

Simple enable for any supported model:

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.float16,
    device_map="auto"
)

Install requirements:

pip install transformers>=4.36
pip install flash-attn --no-build-isolation

Supported model architectures

As of Transformers 4.40:

Fully supported:

Llama / Llama 2 / Llama 3
Mistral / Mixtral
Falcon
GPT-NeoX
Phi / Phi-2 / Phi-3
Qwen / Qwen2
Gemma
Starcoder2
GPT-J
OPT
BLOOM

Partially supported (encoder-decoder):

BART
T5 / Flan-T5
Whisper

Check support:

from transformers import AutoConfig

config = AutoConfig.from_pretrained("model-name")
print(config._attn_implementation_internal)
# 'flash_attention_2' if supported

Configuration examples

Llama 2 with Flash Attention

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "meta-llama/Llama-2-7b-hf"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.float16,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Generate
inputs = tokenizer("Once upon a time", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))

Mistral with Flash Attention for long context

from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,  # Better for long context
    device_map="auto",
    max_position_embeddings=32768  # Extended context
)

# Process long document (32K tokens)
long_text = "..." * 10000
inputs = tokenizer(long_text, return_tensors="pt", truncation=False).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512)

Fine-tuning with Flash Attention

from transformers import Trainer, TrainingArguments
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.float16
)

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    fp16=True,  # Must match model dtype
    optim="adamw_torch_fused"  # Fast optimizer
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset
)

trainer.train()

Multi-GPU training

from transformers import AutoModelForCausalLM
import torch

# Model parallelism with Flash Attention
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-hf",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.float16,
    device_map="auto",  # Automatic multi-GPU placement
    max_memory={0: "20GB", 1: "20GB"}  # Limit per GPU
)

Performance comparisons

Memory usage (Llama 2 7B, batch=1)

Sequence Length	Standard Attention	Flash Attention 2	Reduction
512	1.2 GB	0.9 GB	25%
2048	3.8 GB	1.4 GB	63%
8192	14.2 GB	3.2 GB	77%
32768	OOM (>24GB)	10.8 GB	Fits!

Speed (tokens/sec, A100 80GB)

Model	Standard	Flash Attn 2	Speedup
Llama 2 7B (seq=2048)	42	118	2.8x
Llama 2 13B (seq=4096)	18	52	2.9x
Llama 2 70B (seq=2048)	4	11	2.75x

Training throughput (samples/sec)

Model	Batch Size	Standard	Flash Attn 2	Speedup
Llama 2 7B	4	1.2	3.1	2.6x
Llama 2 7B	8	2.1	5.8	2.8x
Llama 2 13B	2	0.6	1.7	2.8x

Troubleshooting model-specific issues

Issue: Model doesn't support Flash Attention

Check support list above. If not supported, use PyTorch SDPA as fallback:

model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    attn_implementation="sdpa",  # PyTorch native (still faster)
    torch_dtype=torch.float16
)

Issue: CUDA out of memory during loading

Reduce memory footprint:

model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.float16,
    device_map="auto",
    max_memory={0: "18GB"},  # Reserve memory for KV cache
    low_cpu_mem_usage=True
)

Issue: Slower inference than expected

Ensure dtype matches:

# Model and inputs must both be float16/bfloat16
model = model.to(torch.float16)
inputs = tokenizer(..., return_tensors="pt").to("cuda")
inputs = {k: v.to(torch.float16) if v.dtype == torch.float32 else v
          for k, v in inputs.items()}

Issue: Different outputs vs standard attention

Flash Attention is numerically equivalent but uses different computation order. Small differences (<1e-3) are normal:

# Compare outputs
model_standard = AutoModelForCausalLM.from_pretrained("model-name", torch_dtype=torch.float16)
model_flash = AutoModelForCausalLM.from_pretrained(
    "model-name",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.float16
)

inputs = tokenizer("Test", return_tensors="pt").to("cuda")

with torch.no_grad():
    out_standard = model_standard(**inputs).logits
    out_flash = model_flash(**inputs).logits

diff = (out_standard - out_flash).abs().max()
print(f"Max diff: {diff:.6f}")  # Should be ~1e-3 to 1e-4

Issue: ImportError during model loading

Install flash-attn:

pip install flash-attn --no-build-isolation

Or disable Flash Attention:

model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    attn_implementation="eager",  # Standard PyTorch
    torch_dtype=torch.float16
)

Best practices

1. Always use float16/bfloat16 with Flash Attention (not float32) 2. Set device_map="auto" for automatic memory management 3. Use bfloat16 for long context (better numerical stability) 4. Enable gradient checkpointing for training large models 5. Monitor memory with torch.cuda.max_memory_allocated()

Example with all best practices:

from transformers import AutoModelForCausalLM, TrainingArguments

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,  # Better for training
    device_map="auto",
    low_cpu_mem_usage=True
)

# Enable gradient checkpointing for memory
model.gradient_checkpointing_enable()

# Training with optimizations
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    bf16=True,  # Match model dtype
    optim="adamw_torch_fused",
    gradient_checkpointing=True
)

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Use for GPU kernel selection with published numbers rather than generic 'enable flash attention' toggles.

FAQ

How much faster is Flash Attention 2 at 8k context?

optimizing-attention-flash reports on A100 80GB with batch=8, heads=32, dim=64: standard attention takes 218.5 ms at 8192 sequence length versus 66.2 ms for Flash Attention 2, a 3.3× speedup.

Does optimizing-attention-flash cover H100 GPUs?

optimizing-attention-flash includes H100 80GB Hopper forward-pass benchmark tables alongside A100 Ampere results, enabling Flash Attention 2 versus 3 comparisons at multiple sequence lengths.

Is Optimizing Attention Flash safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLllmresearch