Key Takeaways
As AI agents increasingly write and modify code, QA becomes the final authority on safety, correctness, and release readiness.
Modern testing now requires security scanning, explainable evaluation, and structured gatekeeping, not just execution.
Read Today’s Notes
What changed this week
Three signals point to the same shift: AI-driven development only works if QA evolves into an enforcement and evaluation layer.
Signal breakdown
1. Promptfoo brings LLM security into CI/CD
- New scanner detects prompt injection, PII leakage, and excessive agent autonomy
- Works as GitHub Action, VS Code extension, or CLI
- Traces user input → prompt construction → model invocation
- Enables true “shift-left” security testing for LLM apps
Why it matters:
Traditional SAST/DAST tools cannot see prompt-level vulnerabilities. QA now owns this gap.
2. QA as the “Gate Keeper” in agentic coding workflows
- Emerging Kanban-style workflow:
- Dev agent writes code
- QA agent validates:
- unit tests exist and pass
- integration tests
- UI automation
- QA becomes a decision gate, not a downstream executor
Why it matters:
Speed without validation creates silent risk. QA defines the stop/go criteria for AI output.
3. TIGERScore enables explainable, reference-free evaluation
- Evaluates AI-generated text without golden answers
- Uses instruction-based rubrics (accuracy, relevance, comprehension)
- Produces multi-dimensional, explainable scores
Why it matters:
Most real-world LLM outputs don’t have a “correct” answer. TIGERScore gives QA a defensible way to say why something failed.
Core insight
AI doesn’t remove QA—it formalizes QA as governance.
Companion Newsletter
QA Is Becoming the Gatekeeper of AI Systems
AI agents can now write code, generate tests, and ship features faster than ever.
But speed introduces a new problem: who decides what is safe to release?
This week’s signals show that decision increasingly belongs to QA.
Promptfoo’s new code scanner makes LLM-specific vulnerabilities visible inside CI/CD pipelines—something traditional security tools simply miss. At the same time, practitioner-led agentic workflows explicitly position QA as the gate that all AI-generated code must pass through.
Finally, TIGERScore addresses a long-standing pain point in AI testing: judging outputs when no perfect answer exists. By using explainable, reference-free metrics, QA teams can evaluate quality without relying on subjective “vibe checks.”
Together, these trends redefine the role of testing. QA is no longer about executing steps after development. It is about enforcing standards, explaining failures, and deciding when AI output is trustworthy enough to ship.
What to try today
- Add LLM security scanning to your CI/CD
- Define explicit quality gates for AI-generated code
- Use explainable evaluation frameworks instead of binary pass/fail
QA isn’t slowing AI down.
QA is what makes AI usable in production.
Research & References
- Promptfoo Code Scanning Documentation
https://www.promptfoo.dev/docs/code-scanning/
- Agentic Coding Workflow & QA Gatekeeping
https://medium.com/@mathieu.veron_70170/writing-my-own-ai-agent-coding-method-4b0ea46d83aa
- TIGERScore: Explainable Evaluation for AI-Generated Text
https://arxiv.org/abs/2310.00752?utm_source=chatgpt.com
- Promptfoo VS Code Integration
https://www.promptfoo.dev/docs/code-scanning/vscode-extension/?utm_source=chatgpt.com
- Promptfoo GitHub Action
https://www.promptfoo.dev/docs/integrations/github-action/
- OWASP Top 10 for LLM Applications
https://owasp.org/www-project-top-10-for-large-language-model-applications/
