Behavioral Issues
Empirical results from running three concurrent agentic software builds reveal that without behavioral contracts, AI coding agents drift from requirements, hallucinate features, and produce code that cannot be meaningfully tested. Behavior-Driven Development with Gherkin Given/When/Then specifications provides the missing engineering discipline: a machine-readable, human-auditable contract that constrains agent output to verifiable behaviors. The paper documents specific failure modes — scope creep, phantom dependencies, untestable side effects — and shows how BDD eliminates each one.
Key Contributions
- BDD as engineering discipline for AI coding agents
- Gherkin specifications for agent behavior contracts
- Empirical results from concurrent agentic builds
Explainers
What is BDD?
Behavior-Driven Development structures requirements as executable specifications in Gherkin syntax: Given a precondition, When an action occurs, Then an observable outcome results. Unlike unit tests that verify implementation details, BDD specs define behavior from the outside in — which is exactly the constraint surface AI agents need to stay on task.
Why does this matter for AI agents?
Without behavioral contracts, AI agents treat every ambiguity as creative license. They invent requirements that were never specified, introduce dependencies no one asked for, and produce code that works in isolation but fails in integration. BDD gives agents a falsifiable target: either the Given/When/Then spec passes or it does not. Drift becomes mechanically detectable.
What failure modes were observed?
Across three concurrent builds, agents exhibited scope creep (adding unrequested features that broke existing functionality), phantom dependencies (importing libraries not in the project), and untestable side effects (writing code whose correctness could only be verified by visual inspection). All three failure modes disappeared once Gherkin specs were introduced as hard constraints on each task.