MMLU AI News List

MMLU AI News List | Blockchain.News

AI News List

List of AI News about MMLU

Time	Details
2026-04-12 16:53	DeepSeek V4 Latest Analysis: 1T MoE, 1M Token Context, Ascend 950PR Support, and 35x Inference Speed — 2026 Launch Insights According to God of Prompt on X, citing @xiangxiang103, DeepSeek V4 is reportedly slated for late April 2026 with a trillion-parameter MoE architecture that activates around 37B parameters at inference, claiming 35x speedup and 40% lower energy use compared with prior baselines; it also touts a 1,000,000-token lossless context window and native multimodal support across text, image, video, and audio (source: X post by God of Prompt referencing @xiangxiang103). According to the same source, the model is said to be trained and inferenced end-to-end on Huawei Ascend 950PR with roughly 85% compute utilization and one-third the deployment cost of Nvidia-based stacks, while reporting inference cost at about 1/70 of GPT-4, implying substantial TCO reduction for high-throughput workloads (source: X post by God of Prompt). As reported by God of Prompt, benchmark claims include AIME 2026 at 99.4%, MMLU at 92.8%, SWE-Bench at 83.7%, and HumanEval at 90% with support for 338 programming languages, alongside a self-developed mHC architecture and Engram memory module that purportedly lowers inference cost (source: X post by God of Prompt). According to the same X thread, the rollout plan includes a web client with Fast and Expert modes, OpenAI-compatible APIs with 5M free tokens for new users, and an intention to open-source model weights with local deployment support, which—if verified—could create new business opportunities in multilingual coding assistants, enterprise RAG at million-token scale, and low-cost multimodal agents for video and audio analytics (source: X post by God of Prompt referencing @xiangxiang103). Source
2026-03-26 18:54	Gemini 3.1 Flash and Live: Latest Benchmark Analysis and Business Impact for 2026 According to DemisHassabis, Google detailed Gemini 3.1 Flash and Live benchmark results, with the official Google blog reporting state-of-the-art or competitive scores across multimodal reasoning, long-context retrieval, and speech-to-speech interaction. According to Google, Gemini 3.1 Flash targets low-latency, high-throughput use cases while retaining strong performance on MMLU-style knowledge tests and image understanding, enabling cost-efficient deployments for customer support, analytics copilots, and creative tools. As reported by Google, Gemini 3.1 Live advances real-time voice agents with low-latency streaming ASR and TTS aligned to conversational grounding, showing gains on speech benchmarks that translate to smoother turn-taking and task completion for contact centers and voice commerce. According to Google, long-context benchmarks demonstrate robust retrieval over extended documents, suggesting opportunities for enterprise RAG pipelines, compliance review, and meeting assistants that require accurate citation over thousands of tokens. As reported by the Google blog, improved multimodal scores indicate stronger visual reasoning and chart interpretation, opening use cases in retail catalog QA, technical support with screenshots, and healthcare documentation review under proper governance. Source

Time

Details

2026-04-12
16:53

DeepSeek V4 Latest Analysis: 1T MoE, 1M Token Context, Ascend 950PR Support, and 35x Inference Speed — 2026 Launch Insights

According to God of Prompt on X, citing @xiangxiang103, DeepSeek V4 is reportedly slated for late April 2026 with a trillion-parameter MoE architecture that activates around 37B parameters at inference, claiming 35x speedup and 40% lower energy use compared with prior baselines; it also touts a 1,000,000-token lossless context window and native multimodal support across text, image, video, and audio (source: X post by God of Prompt referencing @xiangxiang103). According to the same source, the model is said to be trained and inferenced end-to-end on Huawei Ascend 950PR with roughly 85% compute utilization and one-third the deployment cost of Nvidia-based stacks, while reporting inference cost at about 1/70 of GPT-4, implying substantial TCO reduction for high-throughput workloads (source: X post by God of Prompt). As reported by God of Prompt, benchmark claims include AIME 2026 at 99.4%, MMLU at 92.8%, SWE-Bench at 83.7%, and HumanEval at 90% with support for 338 programming languages, alongside a self-developed mHC architecture and Engram memory module that purportedly lowers inference cost (source: X post by God of Prompt). According to the same X thread, the rollout plan includes a web client with Fast and Expert modes, OpenAI-compatible APIs with 5M free tokens for new users, and an intention to open-source model weights with local deployment support, which—if verified—could create new business opportunities in multilingual coding assistants, enterprise RAG at million-token scale, and low-cost multimodal agents for video and audio analytics (source: X post by God of Prompt referencing @xiangxiang103).

Source

2026-03-26
18:54

Gemini 3.1 Flash and Live: Latest Benchmark Analysis and Business Impact for 2026

According to DemisHassabis, Google detailed Gemini 3.1 Flash and Live benchmark results, with the official Google blog reporting state-of-the-art or competitive scores across multimodal reasoning, long-context retrieval, and speech-to-speech interaction. According to Google, Gemini 3.1 Flash targets low-latency, high-throughput use cases while retaining strong performance on MMLU-style knowledge tests and image understanding, enabling cost-efficient deployments for customer support, analytics copilots, and creative tools. As reported by Google, Gemini 3.1 Live advances real-time voice agents with low-latency streaming ASR and TTS aligned to conversational grounding, showing gains on speech benchmarks that translate to smoother turn-taking and task completion for contact centers and voice commerce. According to Google, long-context benchmarks demonstrate robust retrieval over extended documents, suggesting opportunities for enterprise RAG pipelines, compliance review, and meeting assistants that require accurate citation over thousands of tokens. As reported by the Google blog, improved multimodal scores indicate stronger visual reasoning and chart interpretation, opening use cases in retail catalog QA, technical support with screenshots, and healthcare documentation review under proper governance.

Source