List of Flash News about evals
| Time | Details |
|---|---|
|
2025-10-16 16:56 |
Andrew Ng on AI Agents: Evals and Error Analysis Are the Biggest Predictor of Progress — Best Practices and Metrics for Agentic Workflows
According to @AndrewYNg, the strongest predictor of how quickly teams advance AI agents is a disciplined process for evals and error analysis rather than ad hoc fixes or chasing buzzy tools, enabling faster, measurable improvement in production systems, source: Andrew Ng on X, Oct 16, 2025. He explains that generative AI expands the output space and failure modes versus supervised learning, making iterative, tailored evals more important than relying solely on standard metrics like accuracy, precision, recall, F1, and ROC, source: Andrew Ng on X, Oct 16, 2025. For enterprise workflows such as automated invoice processing, he recommends rapidly prototyping, manually inspecting outputs, then constructing objective or LLM-as-judge metrics that target high-risk fields like due date, amount, addresses, currency, and API call correctness, source: Andrew Ng on X, Oct 16, 2025. He advises building evals first to quantify system performance and then conducting error analysis to focus development, with detailed guidance in Module 4 of the Agentic AI course and The Batch Issue 323 on deeplearning.ai, source: deeplearning.ai (Agentic AI Module 4; The Batch issue 323, https://www.deeplearning.ai/the-batch/issue-323/). |
|
2025-10-06 17:35 |
Greg Brockman unveils agentkit: build AI agents in 8 minutes with visual builder, evals, and guardrails
According to @gdb, agentkit is introduced as a toolkit to build high-quality AI agents for any vertical using a visual builder, evals, guardrails, and other tools, with a live demo showing a working agent created in 8 minutes. Source: @gdb on X, Oct 6, 2025, https://twitter.com/gdb/status/1975253703180623921. Crypto and AI equity traders can reference the announcement time and the stated feature set to monitor sentiment around AI agent tooling segments. Source: @gdb on X, Oct 6, 2025, https://twitter.com/gdb/status/1975253703180623921. |