Notes
Research, engineering write-ups, and the dead ends in between.
Speculative Decoding, Formally: The Algorithm, the Proof, and the Metrics That Matter
June 25, 2026
The Need for Speed: Why LLMs Are Slow and What Speculation Promises
June 25, 2026
A Field Guide to Speculative Decoding Methods
June 25, 2026
The EAGLE Family: Speculating in Feature Space
June 25, 2026
Parallel Drafting with Block Diffusion: DFlash and DDTree
June 25, 2026
Diffusion vs Autoregression: Why Language Models May Not Need to Think Left to Right
June 25, 2026
Putting It to Work: Serving Speculative Decoding with vLLM and SGLang
June 25, 2026
Broad Review of DLM architectures
June 25, 2026
Why Diffusion LLM Quantization Is Harder Than It Looks
June 25, 2026
Apple Foundation Model 3, what even is it?
June 9, 2026
Distribution Matching Is Not Enough: Two Failure Modes in Latent Text Drifting
May 25, 2026
Probing Latent Directions in Video Diffusion Models
May 25, 2026
Hybrid Lexical–Semantic Retrieval for Tool Selection in Agent Systems
April 30, 2026
From Single-GPU to Distributed Training: A Framework for Making the Right Call
April 20, 2026
Distributed Data Parallel: How It Actually Works
April 20, 2026
Tensor Parallelism and Sequence Parallelism
April 20, 2026
Pipeline Parallelism: How It Actually Works
April 20, 2026
ZeRO and FSDP: Model Sharding
April 20, 2026
Kinetic-4B: A 4-Billion Parameter Model That Outperforms Claude Haiku at Tool Calling
April 1, 2026
LLM Inference at the Edge
March 30, 2026