V3.2 AI News List | Blockchain.News
AI News List

List of AI News about V3.2

Time Details
2026-04-26
08:07
Sparse Attention Breakthrough Slashes 128K Context Costs by 60%: Techniques to Scale LLM Context Windows [2026 Analysis]

According to @_avichawla on X, moving to sparse attention at 128K tokens cuts prefilling cost from about $0.65 to $0.35 per million tokens and decoding from about $2.4 to $0.8, with equal or better long-context performance on V3.2. As reported by the post, sparse attention can preserve quality when engineered carefully, opening room for larger context windows without prohibitive inference costs. According to research cited broadly in industry literature, additional techniques to extend context include Rotary or YaRN position scaling to stabilize very long sequences, linear attention variants such as Performer or Hyena to reduce quadratic complexity, retrieval-augmented generation to offload context to external memory, chunking with cross-attention bridges for hierarchical conditioning, sliding-window or recurrent state compression to maintain continuity, and test-time attention sinks or key-value cache eviction policies to cap memory growth. For businesses, these methods can lower serving costs, improve long-document QA, contract analysis, code comprehension, and multimodal transcripts, while maintaining accuracy at scale, according to common enterprise LLM deployment case studies.

Source
2026-04-26
08:07
DeepSeek V3.2 DSA Breakthrough: O(Lk) Sparse Attention Slashes 128K-Context Compute by Selecting Top‑k Tokens

According to @_avichawla on Twitter, DeepSeek’s V3.2 introduces DeepSeek Sparse Attention (DSA) that reduces attention complexity from O(L²) to O(Lk) by selecting only the top‑k key‑value pairs per query, capped at 2048 tokens regardless of a 128K context. As reported by @_avichawla, a lightweight Lightning Indexer ranks salient tokens using a small number of FP8 heads, enabling a compute‑cheap preselection step before running the expensive attention on the subset. According to the tweet, this design concentrates GPU FLOPs on useful tokens, offering lower latency and cost for long‑context inference and enabling scalable retrieval‑augmented generation and document intelligence workloads. As reported by the same source, the fixed k makes memory and compute predictable, which can translate into higher throughput per GPU and improved serving economics for enterprise long‑context applications.

Source