IFBench AI News List

IFBench AI News List | Blockchain.News

AI News List

List of AI News about IFBench

Time	Details
2026-04-18 00:56	GDPval AA Benchmark Criticized: Ethan Mollick Challenges Gemini 3.1 Judging Method in Artificial Analysis Index According to @emollick, GDPval-AA is not a meaningful benchmark because it uses Gemini 3.1 to judge model outputs on public GDPval questions, which he argues adds little signal about true capability. As reported by Artificial Analysis, Claude Opus 4.7 leads GDPval-AA with 1,753 Elo and tops the Artificial Analysis Intelligence Index at 57.3, narrowly ahead of Gemini 3.1 Pro at 57.2 and GPT-5.4 at 56.8; the firm states GDPval-AA spans 44 occupations and 9 industries using an agentic loop with shell and browsing via the Stirrup harness. According to Artificial Analysis, Opus 4.7 improves on IFBench (+5.5 p.p.), TerminalBench Hard (+5.3 p.p.), HLE (+2.9 p.p.), SciCode (+2.6 p.p.), and GPQA Diamond (+1.8 p.p.), while reducing hallucinations to 36% and using ~35% fewer output tokens than Opus 4.6 to run the suite. For businesses, the dispute over GDPval-AA’s evaluator design highlights the need to diversify benchmarks (e.g., HLE, GPQA Diamond, TerminalBench, AA-Omniscience) and to audit judge-model dependence to avoid evaluator bias and overfitting, as indicated by both Ethan Mollick’s critique and Artificial Analysis’ published methodology. Source

Time

Details

2026-04-18
00:56

GDPval AA Benchmark Criticized: Ethan Mollick Challenges Gemini 3.1 Judging Method in Artificial Analysis Index

According to @emollick, GDPval-AA is not a meaningful benchmark because it uses Gemini 3.1 to judge model outputs on public GDPval questions, which he argues adds little signal about true capability. As reported by Artificial Analysis, Claude Opus 4.7 leads GDPval-AA with 1,753 Elo and tops the Artificial Analysis Intelligence Index at 57.3, narrowly ahead of Gemini 3.1 Pro at 57.2 and GPT-5.4 at 56.8; the firm states GDPval-AA spans 44 occupations and 9 industries using an agentic loop with shell and browsing via the Stirrup harness. According to Artificial Analysis, Opus 4.7 improves on IFBench (+5.5 p.p.), TerminalBench Hard (+5.3 p.p.), HLE (+2.9 p.p.), SciCode (+2.6 p.p.), and GPQA Diamond (+1.8 p.p.), while reducing hallucinations to 36% and using ~35% fewer output tokens than Opus 4.6 to run the suite. For businesses, the dispute over GDPval-AA’s evaluator design highlights the need to diversify benchmarks (e.g., HLE, GPQA Diamond, TerminalBench, AA-Omniscience) and to audit judge-model dependence to avoid evaluator bias and overfitting, as indicated by both Ethan Mollick’s critique and Artificial Analysis’ published methodology.

Source