List of Flash News about autoregression
| Time | Details |
|---|---|
|
2025-10-20 18:58 |
Karpathy on Text Diffusion for LLMs (2025): Bidirectional Attention Raises Training Cost vs Autoregression
According to @karpathy, text diffusion for language can be implemented with a vanilla transformer using bidirectional attention that iteratively re-masks and re-samples all tokens on a noise schedule. Source: @karpathy. He states diffusion is the pervasive generative paradigm in image and video, while autoregression remains dominant in text and audio shows a mix of both. Source: @karpathy. He adds that removing heavy formalism reveals simple baseline algorithms, with discrete diffusion closer to flow matching in continuous settings. Source: @karpathy. He explains that autoregression appends tokens while attending backward, whereas diffusion refreshes the entire token canvas while attending bidirectionally. Source: @karpathy. He notes bidirectional attention yields stronger language models but makes training more expensive because sequence dimension parallelization is not possible. Source: @karpathy. He suggests it may be possible to interpolate or generalize between diffusion and autoregression in the LLM stack. Source: @karpathy. For traders, the actionable takeaway is the compute cost trade-off of bidirectional text diffusion versus autoregression, which directly affects training efficiency assumptions. Source: @karpathy. |