bidirectional attention Flash News List | Blockchain.News
Flash News List

List of Flash News about bidirectional attention

Time Details
2025-10-20
22:13
Andrej Karpathy: DeepSeek-OCR Signals 4 Reasons Pixels May Beat Text Tokens for LLM Inputs — Efficiency, Shorter Context Windows, Bidirectional Attention, No Tokenizer

According to Andrej Karpathy, the DeepSeek-OCR paper is a strong OCR model and more importantly highlights why pixels might be superior to text tokens as inputs to large language models, emphasizing model efficiency and input fidelity, source: Andrej Karpathy on X, Oct 20, 2025. He states that rendering text to images and feeding pixels can deliver greater information compression, enabling shorter context windows and higher efficiency, source: Andrej Karpathy on X, Oct 20, 2025. He adds that pixel inputs provide a more general information stream that preserves formatting such as bold and color and allows arbitrary images alongside text, source: Andrej Karpathy on X, Oct 20, 2025. He argues that image inputs enable bidirectional attention by default instead of autoregressive attention at the input stage, which he characterizes as more powerful for processing, source: Andrej Karpathy on X, Oct 20, 2025. He advocates removing the tokenizer at input due to the complexity and risks of Unicode and byte encodings, including security or jailbreak issues such as continuation bytes and semantic mismatches for emojis, source: Andrej Karpathy on X, Oct 20, 2025. He frames OCR as one of many vision-to-text tasks and suggests many text-to-text tasks can be reframed as vision-to-text, while the reverse is not generally true, source: Andrej Karpathy on X, Oct 20, 2025. He proposes a practical setup where user messages are images while the assistant response remains text and notes outputting pixels is less obvious, and he mentions an urge to build an image-input-only version of nanochat while referencing the vLLM project, source: Andrej Karpathy on X, Oct 20, 2025.

Source
2025-10-20
18:58
Karpathy on Text Diffusion for LLMs (2025): Bidirectional Attention Raises Training Cost vs Autoregression

According to @karpathy, text diffusion for language can be implemented with a vanilla transformer using bidirectional attention that iteratively re-masks and re-samples all tokens on a noise schedule. Source: @karpathy. He states diffusion is the pervasive generative paradigm in image and video, while autoregression remains dominant in text and audio shows a mix of both. Source: @karpathy. He adds that removing heavy formalism reveals simple baseline algorithms, with discrete diffusion closer to flow matching in continuous settings. Source: @karpathy. He explains that autoregression appends tokens while attending backward, whereas diffusion refreshes the entire token canvas while attending bidirectionally. Source: @karpathy. He notes bidirectional attention yields stronger language models but makes training more expensive because sequence dimension parallelization is not possible. Source: @karpathy. He suggests it may be possible to interpolate or generalize between diffusion and autoregression in the LLM stack. Source: @karpathy. For traders, the actionable takeaway is the compute cost trade-off of bidirectional text diffusion versus autoregression, which directly affects training efficiency assumptions. Source: @karpathy.

Source