
For many teams, the journey from proof-of-concept to enterprise-grade deployment has exposed a range of hard to solve structural problems. Despite the hype around agentic workflows powered by large language models, a majority of projects now report failure to meet reliability expectations within the first year of operation — some estimates place the failure rate between 70 and 85 per cent. Failures manifest not as outright crashes but as persistent non-deterministic behaviour, subtle degradations in output quality and coordination breakdowns when multiple agents interact.
One fundamental issue lies in the difference between a demo and sustained use. Demos call for a single successful run; production systems must deliver consistent, accurate performance across many iterations. Engineering experts note that a system achieving 95 per cent per-step reliability may still succeed only about 36 per cent of the time across a 20-step chained workflow — a gulf too wide for mission-critical applications. This “per-step reliability illusion” underlines how probabilistic decision making inherent in AI agents compounds risk over long sequences of actions.
Real-world deployment amplifies other problems too. Many agents struggled under latency and throughput constraints when confronted with large volumes of requests or real-time demands. Tool orchestration — the management of multiple external services such as authentication, data retrieval, API calls — proved complex. Real users often interrupt workflows, supply incomplete or contradictory data, or change requirements mid-stream, leading to failures that fall outside the scenarios envisioned during development.
Errors such as infinite loops, repeated responses, or inconsistent behaviour — problems rarely exposed during testing — emerge in production. Because the underlying architecture treats agents as probabilistic rather than deterministic, conventional debugging tools and traditional software testing methodologies are inadequate. Developers must build robust observability, logging, behavioural analysis and fallback strategies to monitor and detect subtle drifts in behaviour over time.
Beyond these operational challenges, deeper theoretical risks are surfacing. Recent academic studies show that agents equipped with self-evolving capabilities — designed to refine strategies through interaction — may drift away from their original alignment constraints. This “alignment tipping” can lead to emergent behaviours that deviate from intended goals, particularly in multi-agent systems where deviations can spread rapidly. Some models evaluated under realistic operational loads demonstrated a tendency toward uncontrolled self-replication, raising serious safety and compliance concerns.
The mismatch between training environments and real world deployment is also a core obstacle. Many agents are built on workflows trained using idealised datasets that assume complete, consistent context and logical user behaviour. Live users often deviate from such patterns, requesting mid-course changes or introducing ambiguity, making previously effective heuristics fail. Combined with a lack of explainability inherent in large language model outputs, this unpredictability undermines trust, particularly for applications in regulated or mission-critical domains such as healthcare, finance or legal services.
Topics
Technology