Tri Dao· Chief Scientist· Together AI · Princeton University3h
FlashAttention-3 doubles attention throughput on H100 with FP8.
Albert Gu· Assistant Professor· Carnegie Mellon University · Cartesia4h
Editor's pick: an early SSM result promoted ahead of the automated bar.
Tri Dao· Chief Scientist· Together AI · Princeton University5h
Hardware-aware SSM scans approach roofline memory bandwidth on modern GPUs.
Albert Gu· Assistant Professor· Carnegie Mellon University · Cartesia6h
Mamba-2 reframes state-space models as a form of attention via the SSD framework.
Horace He· Engineer· Meta · PyTorch7h
Compiler-fused attention kernels close the eager/graph performance gap.
Tim Dettmers· Research Scientist· Allen Institute for AI · University of Washington8h
4-bit NF4 quantization fine-tunes a 65B model on a single 48GB GPU with no quality loss.
Horace He· Engineer· Meta · PyTorch9h
torch.compile fuses eager PyTorch into optimized kernels with one decorator.
succeeded