DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

Article summary
Quick briefing — cleaned from the original RSS feed
UC San Diego's DFlash replaces autoregressive drafting with a lightweight block diffusion model for speculative decoding. It drafts whole token blocks in a single forward pass and conditions on target hidden features through KV injection. The paper reports up to 6.08x lossless speedup on Qwen3-8B, while NVIDIA reports up to 15x throughput on Blackwell at fixed interactivity. DFlash ships 20 checkpoints and supports SGLang, vLLM, and TensorRT-LLM.
1Key Takeaways
- UC San Diego's DFlash replaces autoregressive drafting with a lightweight block diffusion model for speculative decoding.
- It drafts whole token blocks in a single forward pass and conditions on target hidden features through KV injection.
- The paper reports up to 6.08x lossless speedup on Qwen3-8B, while NVIDIA reports up to 15x throughput on Blackwell at fixed interactivity.
- DFlash ships 20 checkpoints and supports SGLang, vLLM, and TensorRT-LLM.
2AIWedia Score
7/10
Solid update — useful context for the AI space
Based on source trust, recency, category impact, and story depth.
3Why it matters
Video AI is reshaping ads, social content, and entertainment with faster generation pipelines. MarkTechPost Video reports that uC San Diego's DFlash replaces autoregressive drafting with a lightweight block diffusion model for speculative decoding.
Explore related
Browse toolsVideo AI news
Explore curated video ai tools on AIWedia — compare, rank, and launch from our directory.
Full story on MarkTechPost Video
Read full articleHeadlines aggregated via RSS for discovery on AIWedia. Original content © MarkTechPost Video. We link to the source and do not republish full articles.