FlashAttention-3 doubles attention throughput on H100 with FP8.
Editor's pick: an early SSM result promoted ahead of the automated bar.
Hardware-aware SSM scans approach roofline memory bandwidth on modern GPUs.
Mamba-2 reframes state-space models as a form of attention via the SSD framework.
Compiler-fused attention kernels close the eager/graph performance gap.
4-bit NF4 quantization fine-tunes a 65B model on a single 48GB GPU with no quality loss.
torch.compile fuses eager PyTorch into optimized kernels with one decorator.
succeeded