When Adopt AI upgraded its internal reporting assistant from Claude Sonnet 4.0 to 4.5, a sizable portion of requests began to fail. The model started inserting the request payload into the description field and, for ambiguous queries, returned clarifying questions instead of the expected JSON. Because the surrounding infrastructure assumed every model call would produce a complete API call, missing parameters caused back‑end services to return all‑time data, empty results, or HTTP 500 errors.
The root cause was an under‑specified prompt. Earlier model versions inferred that the description should remain a plain‑text string, but 4.5 interpreted the instruction more helpfully, merging structured data into the narrative. The team had no guardrails for such semantic drift, and their regression suite lacked tests for this failure mode. The incident forced a costly rollback and a re‑qualification of newly added integrations against the older model.
The authors argue that reliable LLM‑backed systems require treating evaluation suites as the formal specification. By defining input‑output invariants—such as prohibiting serialized payloads in free‑form fields—and gating model upgrades behind those tests, teams can bound the "blast radius" of changes. As AI agents take on more autonomous tasks, robust eval‑driven CI/CD will become essential for production safety.



