predict.info — Premium Domain For Sale Domain only: USD 200,000. Prediction platform technology priced separately. predict.info

Inquire

Latest Update

6/18/2026 11:51:00 PM

AA Briefcase Benchmark Reveals Claude Leads, Costs

According to emollick, AA Briefcase ranks Claude Fable 5 top by Elo and shows wide price gaps, but asks where human comparison scores are.

Source

Analysis

Artificial Analysis announced AA-Briefcase on June 18 2026 as a new benchmark designed to evaluate AI models on long-horizon knowledge work tasks drawn from realistic multi-week projects. The benchmark stands out because it uses private hold-out tests and complex scenarios built by industry experts from Google McKinsey and BCG.

Key Takeaways

AA-Briefcase tests models across thousands of fragmented inputs including emails Slack messages and documents requiring sustained reasoning over weeks rather than single prompts.
Claude Fable 5 leads with an Elo score of 1587 while cost per task ranges from over 31 dollars down to just four cents highlighting wide price performance gaps.
Even the top model satisfies all rubric criteria on only three percent of tasks showing that real-world knowledge work remains a significant challenge for current AI systems.

Deep Dive into AA-Briefcase Structure

The benchmark features four private scenarios that mirror corporate projects with deliverables such as financial models and board presentations. A public fifth scenario called AA-Briefcase Lite is available on Hugging Face for demonstration only. Tasks build sequentially and draw on shared institutional context that includes contradictions and messy data typical of actual organizations.

Evaluation Methodology

AA-Briefcase combines binary rubric checks for factual correctness with pairwise grading on analytical and presentation quality. This dual approach reveals cases where polished outputs lack rigor or accuracy. Pass rates decline sharply as the number of required input files increases exposing limitations in handling large fragmented context.

Business Impact and Opportunities

Companies can use AA-Briefcase results to select models for high-stakes knowledge work such as strategy consulting and product management. Open-weight models like GLM-5.2 offer strong performance at lower cost creating monetization opportunities for fine-tuning services and specialized agent platforms. Implementation challenges include managing high token costs and integrating models into existing workflows while maintaining compliance with data privacy rules. Organizations should start with pilot projects on the public Lite version before scaling to private evaluations.

Future Outlook

AA-Briefcase signals a shift toward benchmarks that better reflect enterprise needs. As models improve on long-horizon tasks competitive pressure will increase on providers to reduce costs and raise reliability. Regulatory considerations around AI accountability in professional outputs will grow while ethical best practices will emphasize transparency about model limitations in ambiguous contexts. The benchmark is one to watch for ongoing progress in agentic capabilities.

Frequently Asked Questions

What makes AA-Briefcase different from earlier agentic evaluations?

It uses private hold-out tests and multi-week projects with thousands of real-world source files instead of isolated prompts.

Which model currently leads the benchmark?

Claude Fable 5 leads with an Elo score of 1587 according to the Artificial Analysis announcement.

Are human performance scores available for comparison?

No human comparison score was included in the initial release.

How does task difficulty scale with input volume?

Pass rates drop as the number of required source files increases across all models tested.

Anthropic Claude Deepseek Fable5 GLM52

Ethan Mollick

@emollick

Professor @Wharton studying AI, innovation & startups. Democratizing education using tech