Harvey Unveils Initial Results of Legal AI Benchmark LAB

Harvey has released the first results from its Legal Agent Benchmark (LAB), an open-source framework designed to evaluate AI agents on complex, long-horizon legal tasks. The initial findings, published May 26, 2026, underscore significant limitations in current-generation AI models. Despite rapid advancements, frontier models completed less than 10% of LAB tasks end-to-end under a strict all-or-nothing evaluation standard.

LAB, launched earlier this month, evaluates AI agents across over 1,200 tasks spanning 24 legal practice areas. Each task mirrors real-world law firm workflows, requiring AI models to produce review-ready legal work products graded against 75,000 expert-created rubric criteria. Harvey's "all-pass" scoring system demands perfection—every rubric criterion must be satisfied for a task to pass.

Key Findings: Frontier AI Falls Short

Among the evaluated models, Claude Opus 4.7 led with a 7.1% success rate, followed by Sonnet 4.6 at 5.4%, Opus 4.6 at 4.2%, GPT-5.5 at 2.1%, and Gemini 3.5 Flash at just 0.8%. While these figures suggest progress, they also highlight how far legal AI lags behind human capabilities. "Legal work is far from saturated," the report notes, especially given the high stakes and precision required in domains like corporate law, IP, and regulatory compliance.

The findings also revealed uneven competence across practice areas. Models displayed "jagged intelligence," excelling in some specialties while failing catastrophically in others. For instance, GPT-5.5 performed well in regulated and emerging-company tasks reliant on heavy research, while Opus 4.7 outperformed in corporate transactions requiring synthesis and analysis. No single model dominated across all categories, reinforcing the need for multi-model strategies in AI deployments.

Cost and Latency Constraints

Another major hurdle is operational efficiency. The best-performing model, Opus 4.7, costs approximately $50.90 per task and has a latency of 22 minutes—far from feasible for high-volume legal operations. Faster alternatives like Gemini 3.5 Flash offer lower latency (under six minutes) but at the expense of accuracy, with a mere 0.8% success rate. These trade-offs present challenges for firms looking to deploy AI in production environments, where both cost and speed must balance with quality.

Behavioral Insights: What Sets Successful Models Apart

Harvey’s study also analyzed agent behavior, identifying key patterns that improve task performance. The most effective agents demonstrated behaviors akin to those of skilled human associates: thorough research before drafting, post-draft validation, and iterative revisions. For example, agents that validated and revised their outputs after drafting improved their pass rates by 1.5 points on average. In contrast, skipping review steps led to a 1.2-point drop in success rates.

Interestingly, models like Opus 4.7 showed strong self-corrective tendencies, frequently revising drafts and achieving higher scores on drafting-related tasks. Meanwhile, GPT-5.5 excelled in research-intensive activities, leveraging extensive document search capabilities to outperform competitors in knowledge-heavy domains.

The Road Ahead

Harvey's LAB represents a significant step toward domain-specific AI benchmarking, but the results are a sobering reminder of the gap between current AI capabilities and the demands of professional environments like law. The benchmark's next phases will focus on expanding its task library, improving cost-efficiency, and fostering collaboration with AI labs to refine model performance.

For law firms and enterprises considering AI adoption, LAB provides a crucial lens into where AI can realistically add value today. Multi-model strategies, combining specialized capabilities from different AI families, are likely to dominate in the near term. However, the high cost and latency of frontier models remain barriers to widespread deployment, limiting AI’s potential to fully automate high-stakes legal work.

Image source: Shutterstock

Bookmark

Harvey Unveils Initial Results of Legal AI Benchmark LAB

Key Findings: Frontier AI Falls Short

Cost and Latency Constraints

Behavioral Insights: What Sets Successful Models Apart

The Road Ahead

Premium Sponsors

Flash News