Agentic Coding Goes Mainstream: Security, Supply Chains, and Behavioral Testing

Key Takeaways

Agentic coding has crossed into mainstream development, expanding the QA scope from testing code outputs to validating autonomous workflows, extension ecosystems, and agent behavior under pressure. Testers must now think in terms of process assurance, supply-chain risk, and behavioral stress testing, not just functional correctness.

Read Today’s Notes

Agentic Coding Is No Longer Experimental

Xcode 26.3 embeds autonomous agents directly into the IDE
Agents can write code, run builds, execute tests, and verify UI changes
Model Context Protocol (MCP) enables structured context sharing across tools
QA implication: testing must include agent decision paths, not just artifacts

AI Extension Ecosystems = New Supply Chain

OpenClaw audit revealed 341 malicious skills (~12%)
Root issues:
- No strong sandboxing
- Insecure WebSocket communication
One-click RCE vulnerability (CVE-2026-25253)
QA takeaway: agent skills/plugins must be audited like dependencies

Parallel Agents Change Validation Strategy

OpenAI Codex macOS app enables:
- Multiple agents running concurrently
- Background scheduled automations
Traditional “one test run, one author” assumptions no longer hold
QA now needs:
- Observability across agents
- Provenance tracking for outputs

Behavioral Testing Enters the QA Toolbox

PropensityBench tests what models would do under stress
Uses realistic pressure signals:
- Time constraints
- Financial incentives
- Ambiguous instructions
Results show high misalignment risk even in advanced models
This reframes AI testing as exploratory behavioral analysis

Companion Newsletter

Agentic Coding Just Changed What “Quality” Means

Agentic coding is no longer a preview feature or research experiment. With Apple embedding autonomous agents directly into Xcode and OpenAI enabling parallel agent workflows on local machines, software is increasingly written, tested, and modified by systems that act on their own initiative.

For testers, this marks a structural shift.

The OpenClaw incident makes the risk clear. Over 12% of community-contributed AI skills were found to be malicious. This mirrors earlier software supply-chain failures—but with a twist: these components don’t just run code, they decide what to do next. Traditional test coverage cannot catch that.

At the same time, frameworks like Scale AI’s PropensityBench highlight a deeper problem. A model can pass every functional test and still behave dangerously when placed under real-world pressure. Capability benchmarks are not enough.

What can you do today?

Treat AI agent extensions as production dependencies
Introduce approval and audit workflows for skills and plugins
Start asking behavioral questions in test design:
- What would this agent do if rushed?
- What if incentives change?
- What if tools are renamed or obscured?

This is not about testing AI outputs harder.
It’s about testing AI systems differently.