Anthropic has announced a new initiative aimed at funding third-party evaluations to better assess AI capabilities and risks, addressing the growing demand in the field, according to Anthropic.
Addressing Current Evaluation Challenges
The current landscape of AI evaluations is limited, making it challenging to develop high-quality, safety-relevant assessments. The demand for such evaluations is outpacing supply, prompting Anthropic to introduce this initiative to fund third-party organizations that can effectively measure advanced AI capabilities. The goal is to elevate the field of AI safety by providing valuable tools that benefit the entire ecosystem.
Focus Areas
Anthropic's initiative prioritizes three key areas:
- AI Safety Level assessments
- Advanced capability and safety metrics
- Infrastructure, tools, and methods for developing evaluations
AI Safety Level Assessments
Anthropic is seeking evaluations to measure AI Safety Levels (ASLs) defined in their Responsible Scaling Policy. These evaluations are crucial for ensuring responsible development and deployment of AI models. The focus areas include:
- Cybersecurity: Evaluations assessing models' capabilities in assisting or acting autonomously in cyber operations.
- Chemical, Biological, Radiological, and Nuclear (CBRN) Risks: Evaluations that assess models' abilities to enhance or create CBRN threats.
- Model Autonomy: Evaluations focusing on models' capabilities for autonomous operation.
- National Security Risks: Evaluations identifying and assessing emerging risks in national security, defense, and intelligence operations.
- Social Manipulation: Evaluations measuring models' potential to amplify persuasion-related threats.
- Misalignment Risks: Evaluations monitoring models' abilities to pursue dangerous goals and deceive human users.
Advanced Capability and Safety Metrics
Beyond ASL assessments, Anthropic aims to develop evaluations that assess advanced model capabilities and relevant safety criteria. These metrics will provide a comprehensive understanding of models' strengths and potential risks. Key areas include:
- Advanced Science: Developing evaluations that challenge models with graduate-level knowledge and autonomous research projects.
- Harmfulness and Refusals: Enhancing evaluations of classifiers' abilities to detect harmful outputs.
- Improved Multilingual Evaluations: Supporting capability benchmarks across multiple languages.
- Societal Impacts: Developing nuanced assessments targeting concepts like biases, economic impacts, and psychological influence.
Infrastructure, Tools, and Methods for Developing Evaluations
Anthropic is interested in funding tools and infrastructure that streamline the development of high-quality evaluations. This includes:
- Templates/No-code Evaluation Platforms: Enabling subject-matter experts without coding skills to develop robust evaluations.
- Evaluations for Model Grading: Improving models' abilities to review and score outputs using complex rubrics.
- Uplift Trials: Running controlled trials to measure models' impact on task performance.
Principles of Good Evaluations
Anthropic emphasizes several characteristics of good evaluations, including sufficient difficulty, exclusion from training data, efficiency, scalability, and domain expertise. They also recommend documenting the development process and iterating on initial evaluations to ensure they capture the desired behaviors and risks.
Submitting Proposals
Anthropic invites interested parties to submit proposals through their application form. The team will review submissions on a rolling basis and offer funding options tailored to each project's needs. Selected proposals will have the opportunity to interact with domain experts from various teams within Anthropic to refine their evaluations.
This initiative aims to advance the field of AI evaluation, setting industry standards and fostering a safer and more reliable AI ecosystem.
Image source: Shutterstock