Notes

Research, engineering write-ups, and the dead ends in between.

Speculative Decoding, Formally: The Algorithm, the Proof, and the Metrics That Matter

The Need for Speed: Why LLMs Are Slow and What Speculation Promises

A Field Guide to Speculative Decoding Methods

The EAGLE Family: Speculating in Feature Space

Parallel Drafting with Block Diffusion: DFlash and DDTree

Diffusion vs Autoregression: Why Language Models May Not Need to Think Left to Right

Putting It to Work: Serving Speculative Decoding with vLLM and SGLang

Broad Review of DLM architectures

Why Diffusion LLM Quantization Is Harder Than It Looks

Apple Foundation Model 3, what even is it?

Distribution Matching Is Not Enough: Two Failure Modes in Latent Text Drifting

Probing Latent Directions in Video Diffusion Models

Hybrid Lexical–Semantic Retrieval for Tool Selection in Agent Systems

From Single-GPU to Distributed Training: A Framework for Making the Right Call

Distributed Data Parallel: How It Actually Works

Tensor Parallelism and Sequence Parallelism

Pipeline Parallelism: How It Actually Works

ZeRO and FSDP: Model Sharding

Kinetic-4B: A 4-Billion Parameter Model That Outperforms Claude Haiku at Tool Calling

LLM Inference at the Edge