Key Takeaways
- The transition to testing non-deterministic voice agents requires an LLM-as-judge evaluation model to accurately measure goal completion and detect audio divergence.
- Structuring adversarial safety testing at scale involves separating guardrail verification from capability checks by using automated prompt variations.
- Diagnosing long-horizon search agents is more effective when the evaluation architecture isolates search state management from semantic decision-making.
Read Today’s Notes
- AWS has released an open-source framework for testing Amazon Nova Sonic voice agents. This tool automates multi-turn conversation testing without requiring manual interaction or microphone inputs. The framework utilizes an LLM-as-judge pattern to score conversation logs against custom evaluation rubrics. It systematically detects audio hallucinations, which occur when the generated spoken audio diverges factually from the expected text transcript.
- Anthropic launched a formal program providing external researchers with pre-release access to Claude models for safety testing. Their testing process relies on tens of thousands of automated prompt variations designed to verify safety guardrails rather than model capabilities.
- Researchers from multiple universities published Harness-1, an open-weights search agent architecture. Harness-1 separates the environment-side search state management from the agent’s semantic decision-making process. This separation allows testers to pinpoint whether an agent is failing at data retrieval or failing at logical synthesis. The researchers introduced a metric called curated recall, which measures if an agent successfully gathered the correct evidence, rather than simply evaluating if it completed the overall task.
Companion Newsletter
Evaluating non-deterministic systems like voice agents requires a shift from simple text assertions to rubric-based evaluation.
Manual checking of conversational AI outputs creates a bottleneck for testers. By treating evaluation criteria as a rubric and using a large language model to score the results, teams can automate and scale their testing coverage. This framework allows for the verification of goal completion rather than relying on exact phrasing matches.
The newly open-sourced AWS voice test harness demonstrates this approach by simulating users and scoring multi-turn interactions based on defined goals.
Similarly, testing long-horizon AI agents that search and synthesize data over time introduces the need for specialized metrics. The Harness-1 architecture highlights the importance of measuring the specific evidence an agent retrieves. This separates the evaluation of its underlying search mechanics from its decision-making capabilities, providing clearer diagnostic signals when tests fail.
A practical experiment for testers today involves selecting a single conversational AI feature and defining one core user goal. By establishing a simple five-point evaluation rubric, testers can run a test conversation and pass the resulting log to an LLM for automated scoring, taking the first step toward scalable evaluation.
Research and References
- AWS Releases Open-Source Test Harness for Voice AI Agents with LLM-as-Judge Evaluation
https://aws.amazon.com/blogs/machine-learning/evaluate-your-amazon-nova-sonic-voice-agent-at-scale-no-microphone-required/ - Anthropic Opens Claude Safety Testing to External Researchers in Formal Red-Teaming Program
https://uncensoredhub.ai/news/2026-06-05-anthropic-opens-claude-safety-testing-to-external-researchers - University Researchers Release Harness-1 Architecture for Testing Long-Horizon Search Agents
https://github.com/pat-jj/harness-1
