GPQA AI News List

Time	Details
2026-04-19 05:01	Benchmark Accuracy Controversy: Latest Analysis on GPQA Scores for 2024–2026 Frontier Models According to Ethan Mollick on X, many viral model “leaks” fail to use real benchmark numbers, noting that GPQA accuracy exceeds 90 percent for recent models; as reported by Mollick’s post, this highlights a pattern of fabricated scorecards generated by image tools without data validation (source: Ethan Mollick, X, Apr 19, 2026). According to academic benchmark reports and model cards cited by Anthropic and OpenAI, top-tier reasoning models like Claude 3.5 and GPT-4-class systems report GPQA or GPQA-diamond performance near or above the 90 percent range under official evaluation settings, though exact figures vary by subset and prompting (sources: Anthropic model card, OpenAI research notes). As reported by community eval repositories such as lmsys leaderboards and EleutherAI discussions, discrepancies often come from inconsistent prompts, contamination controls, and subset selection, creating opportunities for misleading charts in marketing posts (sources: LMSYS Chatbot Arena docs, EleutherAI forum). For AI builders and investors, the business takeaway is to demand reproducible evals with declared prompts, random seeds, and contamination checks; enterprises should favor vendors that publish run scripts and raw logs, since reliable GPQA performance correlates with higher pass rates on enterprise knowledge retrieval and research assistant use cases (sources: Anthropic eval docs, OpenAI eval guidance). Source
2026-04-18 00:56	GDPval AA Benchmark Criticized: Ethan Mollick Challenges Gemini 3.1 Judging Method in Artificial Analysis Index According to @emollick, GDPval-AA is not a meaningful benchmark because it uses Gemini 3.1 to judge model outputs on public GDPval questions, which he argues adds little signal about true capability. As reported by Artificial Analysis, Claude Opus 4.7 leads GDPval-AA with 1,753 Elo and tops the Artificial Analysis Intelligence Index at 57.3, narrowly ahead of Gemini 3.1 Pro at 57.2 and GPT-5.4 at 56.8; the firm states GDPval-AA spans 44 occupations and 9 industries using an agentic loop with shell and browsing via the Stirrup harness. According to Artificial Analysis, Opus 4.7 improves on IFBench (+5.5 p.p.), TerminalBench Hard (+5.3 p.p.), HLE (+2.9 p.p.), SciCode (+2.6 p.p.), and GPQA Diamond (+1.8 p.p.), while reducing hallucinations to 36% and using ~35% fewer output tokens than Opus 4.6 to run the suite. For businesses, the dispute over GDPval-AA’s evaluator design highlights the need to diversify benchmarks (e.g., HLE, GPQA Diamond, TerminalBench, AA-Omniscience) and to audit judge-model dependence to avoid evaluator bias and overfitting, as indicated by both Ethan Mollick’s critique and Artificial Analysis’ published methodology. Source
2026-04-14 23:44	Claude 3.7 Benchmark Analysis: GPQA Gain Per Version Shows Mislabeling Trend in AI Model Names According to Ethan Mollick on X, a chart estimating GPQA gains per 0.1 version step across major AI model naming schemes shows that Claude 3.7 delivers performance more consistent with a 4.4-class release, highlighting inconsistent and marketing-driven version labels across the industry (source: Ethan Mollick tweet, Apr 14, 2026). As reported by Mollick, the analysis normalizes GPQA improvements despite skipped version numbers, indicating outsized step-changes for certain Anthropic releases and complicating vendor-to-vendor comparisons (source: Ethan Mollick). For AI buyers, this implies procurement should rely on standardized benchmarks like GPQA rather than nominal versioning, and institute model evaluation pipelines that track longitudinal benchmark deltas and task-specific win rates before upgrades (source: Ethan Mollick). Source
2026-03-14 04:36	GPQA Diamond Benchmark Analysis: OpenAI Lead, Meta Volatility, xAI Stagnation, and China’s Open-Weight LLMs According to Ethan Mollick on Twitter, the long-lived GPQA Diamond benchmark visualizes key shifts in the AI model race—showing OpenAI’s extended lead, Meta’s rapid rise and decline, xAI’s quick catch-up followed by stagnation, and the emergence of Chinese open-weight LLMs; as reported by Mollick’s post, this highlights competitive dynamics and research focus across general-problem solving under the GPQA Diamond evaluation. According to the GPQA benchmark documentation cited by the community, GPQA Diamond is a high-difficulty question-answering subset designed to test advanced reasoning, making it a credible proxy for progress in complex reasoning capabilities. As reported by Mollick’s visualization, business implications include model selection strategies for enterprises prioritizing reasoning accuracy, vendor diversification amid performance volatility, and opportunities for open-weight adoption where compliance and on-prem control are required. Source
2026-02-04 09:36	Stanford 2025 AI Index Report: Latest Benchmark Analysis Reveals Rapid Model Progress According to God of Prompt, the Stanford 2025 AI Index Report highlights that AI models are surpassing benchmarks at an unprecedented rate. The report notes significant year-over-year improvements, with MMMU scores increasing by 18.8 percentage points, GPQA by 48.9 points, and SWE-bench by 67.3 points. These results indicate remarkable advancements in AI model capabilities, though the report raises questions about whether these gains reflect genuine progress or potential data leakage, as cited in the original source. Source
2025-06-05 16:00	Gemini 2.5 Pro Update: Enhanced AI Coding, Reasoning, and Benchmark Performance Announced According to Sundar Pichai on Twitter, the Gemini 2.5 Pro update is now in preview and delivers significant improvements in AI coding, reasoning, scientific, and mathematical capabilities. The update demonstrates higher performance across key industry benchmarks such as AIDER Polyglot, GPQA, and HLE. Notably, Gemini 2.5 Pro leads the @lmarena_ai leaderboard with a 24-point Elo score increase compared to the previous version (source: Sundar Pichai, Twitter, June 5, 2025). These advancements signal new business opportunities for enterprises looking to integrate state-of-the-art AI for software development, scientific research, and data analysis. Source

2026-04-19
05:01

Benchmark Accuracy Controversy: Latest Analysis on GPQA Scores for 2024–2026 Frontier Models

According to Ethan Mollick on X, many viral model “leaks” fail to use real benchmark numbers, noting that GPQA accuracy exceeds 90 percent for recent models; as reported by Mollick’s post, this highlights a pattern of fabricated scorecards generated by image tools without data validation (source: Ethan Mollick, X, Apr 19, 2026). According to academic benchmark reports and model cards cited by Anthropic and OpenAI, top-tier reasoning models like Claude 3.5 and GPT-4-class systems report GPQA or GPQA-diamond performance near or above the 90 percent range under official evaluation settings, though exact figures vary by subset and prompting (sources: Anthropic model card, OpenAI research notes). As reported by community eval repositories such as lmsys leaderboards and EleutherAI discussions, discrepancies often come from inconsistent prompts, contamination controls, and subset selection, creating opportunities for misleading charts in marketing posts (sources: LMSYS Chatbot Arena docs, EleutherAI forum). For AI builders and investors, the business takeaway is to demand reproducible evals with declared prompts, random seeds, and contamination checks; enterprises should favor vendors that publish run scripts and raw logs, since reliable GPQA performance correlates with higher pass rates on enterprise knowledge retrieval and research assistant use cases (sources: Anthropic eval docs, OpenAI eval guidance).

Source

2026-04-18
00:56

GDPval AA Benchmark Criticized: Ethan Mollick Challenges Gemini 3.1 Judging Method in Artificial Analysis Index

According to @emollick, GDPval-AA is not a meaningful benchmark because it uses Gemini 3.1 to judge model outputs on public GDPval questions, which he argues adds little signal about true capability. As reported by Artificial Analysis, Claude Opus 4.7 leads GDPval-AA with 1,753 Elo and tops the Artificial Analysis Intelligence Index at 57.3, narrowly ahead of Gemini 3.1 Pro at 57.2 and GPT-5.4 at 56.8; the firm states GDPval-AA spans 44 occupations and 9 industries using an agentic loop with shell and browsing via the Stirrup harness. According to Artificial Analysis, Opus 4.7 improves on IFBench (+5.5 p.p.), TerminalBench Hard (+5.3 p.p.), HLE (+2.9 p.p.), SciCode (+2.6 p.p.), and GPQA Diamond (+1.8 p.p.), while reducing hallucinations to 36% and using ~35% fewer output tokens than Opus 4.6 to run the suite. For businesses, the dispute over GDPval-AA’s evaluator design highlights the need to diversify benchmarks (e.g., HLE, GPQA Diamond, TerminalBench, AA-Omniscience) and to audit judge-model dependence to avoid evaluator bias and overfitting, as indicated by both Ethan Mollick’s critique and Artificial Analysis’ published methodology.

Source

2026-04-14
23:44

Claude 3.7 Benchmark Analysis: GPQA Gain Per Version Shows Mislabeling Trend in AI Model Names

According to Ethan Mollick on X, a chart estimating GPQA gains per 0.1 version step across major AI model naming schemes shows that Claude 3.7 delivers performance more consistent with a 4.4-class release, highlighting inconsistent and marketing-driven version labels across the industry (source: Ethan Mollick tweet, Apr 14, 2026). As reported by Mollick, the analysis normalizes GPQA improvements despite skipped version numbers, indicating outsized step-changes for certain Anthropic releases and complicating vendor-to-vendor comparisons (source: Ethan Mollick). For AI buyers, this implies procurement should rely on standardized benchmarks like GPQA rather than nominal versioning, and institute model evaluation pipelines that track longitudinal benchmark deltas and task-specific win rates before upgrades (source: Ethan Mollick).

Source

2026-03-14
04:36

GPQA Diamond Benchmark Analysis: OpenAI Lead, Meta Volatility, xAI Stagnation, and China’s Open-Weight LLMs

According to Ethan Mollick on Twitter, the long-lived GPQA Diamond benchmark visualizes key shifts in the AI model race—showing OpenAI’s extended lead, Meta’s rapid rise and decline, xAI’s quick catch-up followed by stagnation, and the emergence of Chinese open-weight LLMs; as reported by Mollick’s post, this highlights competitive dynamics and research focus across general-problem solving under the GPQA Diamond evaluation. According to the GPQA benchmark documentation cited by the community, GPQA Diamond is a high-difficulty question-answering subset designed to test advanced reasoning, making it a credible proxy for progress in complex reasoning capabilities. As reported by Mollick’s visualization, business implications include model selection strategies for enterprises prioritizing reasoning accuracy, vendor diversification amid performance volatility, and opportunities for open-weight adoption where compliance and on-prem control are required.

Source

2026-02-04
09:36

Stanford 2025 AI Index Report: Latest Benchmark Analysis Reveals Rapid Model Progress

According to God of Prompt, the Stanford 2025 AI Index Report highlights that AI models are surpassing benchmarks at an unprecedented rate. The report notes significant year-over-year improvements, with MMMU scores increasing by 18.8 percentage points, GPQA by 48.9 points, and SWE-bench by 67.3 points. These results indicate remarkable advancements in AI model capabilities, though the report raises questions about whether these gains reflect genuine progress or potential data leakage, as cited in the original source.

Source

2025-06-05
16:00

Gemini 2.5 Pro Update: Enhanced AI Coding, Reasoning, and Benchmark Performance Announced

According to Sundar Pichai on Twitter, the Gemini 2.5 Pro update is now in preview and delivers significant improvements in AI coding, reasoning, scientific, and mathematical capabilities. The update demonstrates higher performance across key industry benchmarks such as AIDER Polyglot, GPQA, and HLE. Notably, Gemini 2.5 Pro leads the @lmarena_ai leaderboard with a 24-point Elo score increase compared to the previous version (source: Sundar Pichai, Twitter, June 5, 2025). These advancements signal new business opportunities for enterprises looking to integrate state-of-the-art AI for software development, scientific research, and data analysis.

Source

List of AI News about GPQA