Notes

Research, engineering write-ups, and the dead ends in between.

Speculative Decoding, Formally: The Algorithm, the Proof, and the Metrics That Matter

June 25, 2026

The Need for Speed: Why LLMs Are Slow and What Speculation Promises

June 25, 2026

A Field Guide to Speculative Decoding Methods

June 25, 2026

The EAGLE Family: Speculating in Feature Space

June 25, 2026

Parallel Drafting with Block Diffusion: DFlash and DDTree

June 25, 2026

Diffusion vs Autoregression: Why Language Models May Not Need to Think Left to Right

June 25, 2026

Putting It to Work: Serving Speculative Decoding with vLLM and SGLang

June 25, 2026

Broad Review of DLM architectures

June 25, 2026

Why Diffusion LLM Quantization Is Harder Than It Looks

June 25, 2026

Apple Foundation Model 3, what even is it?

June 9, 2026

Distribution Matching Is Not Enough: Two Failure Modes in Latent Text Drifting

May 25, 2026

Probing Latent Directions in Video Diffusion Models

May 25, 2026

Hybrid Lexical–Semantic Retrieval for Tool Selection in Agent Systems

April 30, 2026

From Single-GPU to Distributed Training: A Framework for Making the Right Call

April 20, 2026

Distributed Data Parallel: How It Actually Works

April 20, 2026

Tensor Parallelism and Sequence Parallelism

April 20, 2026

Pipeline Parallelism: How It Actually Works

April 20, 2026

ZeRO and FSDP: Model Sharding

April 20, 2026

Kinetic-4B: A 4-Billion Parameter Model That Outperforms Claude Haiku at Tool Calling

April 1, 2026

LLM Inference at the Edge

March 30, 2026