Home

A daily 5-minute podcast and newsletter for software testing practitioners and leaders—helping you spot blind spots, think clearer, test smarter, keep up with AI-driven change and new ways of working, and get ideas you can use the same day—narrated by an AI twin trained on my voice, reasoning style, and coaching approach.

What · Why · Who →

Today podcast:

  • Evaluation Is the New Foundation of AI Testing

    Key Takeaways

    Testing AI systems requires structured evaluation, not intuition.
    Whether you’re shipping a chatbot, RAG system, or AI-generated test suite, measurable checks like faithfulness, hallucination detection, and prompt injection testing must be integrated into your workflow.

    Read Today’s Notes

    1️⃣ Shipping a Chatbot Without Evals: What Went Wrong

    Source: Data Science Collective

    A team launched a conversational LLM chatbot without a formal evaluation framework. The result:

    • Unexpected behavior in edge conversations
    • Manual firefighting in production
    • No visibility into quality drift

    Core Lessons

    • You cannot test conversational AI like a CRUD application.
    • Conversation state matters:
      • User profile
      • Prior messages
      • Context variables
    • Evaluation must be continuous, not one-time.

    Their Fix: Two-Part Evaluation System

    Unit Evaluations (YAML-based)

    • Define conversation states in structured files
    • Assert expected response characteristics
    • Lightweight and reproducible

    User Simulation + Production Monitoring

    • Simulated user flows
    • Observability in live traffic

    2️⃣ Three Production Evaluation Patterns

    Source: DEV Community case study using DeepEval

    After an AI agent fabricated product features in production, a team introduced three CI-integrated checks:

    Pattern 1: Hallucination Detection

    • Validates factual consistency
    • Flags invented information

    Pattern 2: Faithfulness Scoring (for RAG)

    • Ensures answers align with retrieved documents
    • Detects unsupported claims

    Pattern 3: Tool Correctness

    • Validates:
      • Correct function selection
      • Correct parameters
      • Correct execution order

    Result:

    • Deployed across 80+ production agents
    • AI quality moved from “gut feeling” to measurable metrics

    3️⃣ Security: Prompt Injection Is Not Optional to Test

    According to OWASP Top 10 for LLM Applications, Prompt Injection is the #1 risk.

    Tool spotlight: Promptfoo Red Teaming

    Capabilities:

    • Automated adversarial testing
    • 50+ vulnerability types
    • Jailbreaks
    • PII leakage detection
    • Instruction override attempts

    Security testing for LLMs is now a QA responsibility.

    4️⃣ Playwright AI Agents: Reality Check

    Source: ScrollTest experiment

    A week-long evaluation of Playwright AI agents:

    • Planner → Excellent at mapping test scenarios
    • Generator → Good draft code, requires human cleanup
    • Healer → Strong at fixing broken locators

    Key insight:
    AI is a productivity multiplier, not a replacement for judgment.

    Recommended usage pattern:

    1. Use Planner for exploration
    2. Review generated spec
    3. Use Generator carefully
    4. Human review required

    5️⃣ Practical 30-Minute Experiment

    Pick ONE:

    • Create 5–10 YAML conversation test cases
    • Install DeepEval and run Faithfulness metric
    • Run Promptfoo red-team setup
    • Use Playwright Planner to generate a test map

    Small eval > No eval.

    Companion Newsletter

    Evaluation Is Now Your Core Testing Skill

    • We are past the phase of “let’s ship the AI and see what happens.”
    • Three independent stories this week point to the same shift:
    • AI quality must be engineered, not observed.

    The Pattern Behind All the Failures

    • A chatbot was released without structured evaluation → production chaos.
    • An AI agent fabricated product features → trust damage.
    • A tester evaluated AI-generated Playwright tests → found they still require review.
    • Different systems. Same problem:
    • No measurable definition of “correct.”
    • Traditional testing relies on deterministic expectations.
    • AI systems require probabilistic validation.

    That means we must define:

    • What counts as hallucination?
    • What counts as faithful?
    • What counts as correct tool use?
    • What counts as secure prompt handling?

    If you cannot answer those, you are not testing AI — you are observing it.


    Three Questions Every Tester Should Start Asking

    • Where does hallucination show up in my system?
    • If I use RAG, how do I prove responses are grounded?
    • If my AI calls tools or APIs, how do I validate correctness?

    If those checks are not automated, they will not scale.


    Why This Matters for Your Career

    AI evaluation is becoming a specialization.

    The teams integrating:

    • Faithfulness scoring
    • Tool correctness checks
    • Red-team automation
    • CI-integrated eval metrics

    are moving from reactive QA to AI reliability engineering.

    This is a skill gap right now.


    What You Can Try Today (30 Minutes)

    Choose one:

    Chatbot?
    Write 5 YAML test cases capturing:

    • Context
    • Expected intent
    • Forbidden output

    RAG system?
    Install DeepEval.
    Run Faithfulness metric on 3 queries.
    See what fails.

    Using Playwright?
    Run the Planner agent.
    Audit the exploration map it generates.

    Security concerned?
    Install Promptfoo red-team setup.
    Run one injection scenario.

    The goal isn’t perfection.

    It’s moving from intuition to instrumentation.


    If you walk away with one mindset shift:

    AI testing starts when you define the metric.

    Research & References


Subscribe to AI in Testing Daily