The brief was simple: build an AI support agent that handles the majority of customer queries automatically, without degrading the quality of support customers receive. The target was 70% automation. After six months in production, it's handling 80%, with customer satisfaction scores higher than they were with a fully human team.
Here's how we built it — and more importantly, the design decisions that made the difference between a useful agent and a chatbot that frustrates everyone.
The knowledge base is the product
Most AI support agents fail at the knowledge base stage. They're trained on vague documentation, inconsistent internal wikis, or — worst of all — raw conversation transcripts with no quality filtering. The agent learns bad answers as well as good ones.
We spent three weeks before writing any code on knowledge base construction. That involved:
- Categorizing and analyzing the last 12 months of support tickets by query type and resolution
- Identifying the 50 query types that covered 85% of volume
- Writing clean, structured answer documents for each of those 50 types
- Tagging every document with metadata: topic, product area, confidence level, escalation trigger
- Building a validation set of 200 real queries with known correct answers to test retrieval quality
The knowledge base isn't static. We built a review pipeline: when the agent answers a query, a human reviews flagged responses weekly. Good new answers get added to the knowledge base. Edge cases get documented. The agent improves continuously.
RAG, not fine-tuning
We used RAG (retrieval-augmented generation) rather than fine-tuning a model. This was a deliberate choice. Fine-tuning bakes knowledge into model weights — and when that knowledge changes (product updates, policy changes, pricing changes), you have to retrain. RAG retrieves current documents at query time, so the knowledge base is always live.
The technical implementation: customer query comes in, gets embedded, retrieves the top 5 most semantically similar documents from the knowledge base, those documents get passed as context to the LLM, and the LLM generates a response grounded in that context. If no document exceeds a confidence threshold, the query escalates to a human.
One non-obvious design decision: we don't let the agent improvise outside the retrieved documents. If the answer isn't in the knowledge base, the agent says "I don't have a confident answer for this — let me connect you with the team." This sacrifices some automation rate but dramatically reduces the rate of confident wrong answers, which are much more damaging than admissions of uncertainty.
The escalation logic is where most agents break
Bad AI support agents either escalate too rarely (damage customer trust through wrong answers) or too often (defeat the purpose of automation). Getting this right requires explicit escalation triggers, not just a general confidence threshold.
We built five categories of escalation triggers:
- Confidence threshold — retrieval similarity below 0.75 on the best matching document
- Sentiment trigger — detected frustration or urgency in the query ("furious", "lawyer", "cancel", "never again")
- Topic trigger — certain topics always go to humans: legal, billing disputes over a threshold, safety-related queries
- History trigger — customer has contacted support three or more times on the same issue
- Explicit request — customer asks for a human
When escalation triggers, the agent summarizes what the customer has said and what it tried so the human agent starts with full context rather than asking the customer to repeat themselves.
What drove the 80% figure
When we analyzed where automation succeeded vs. failed, the pattern was clear. The agent handled well: status queries, how-to questions, policy questions, account information requests, troubleshooting flows with known solutions. These covered 80% of volume.
The 20% that went to humans: billing disputes, complex multi-product issues, first-time serious complaints, and anything requiring action in systems the agent wasn't connected to at the time. The ratio will improve as we connect more systems.
What we'd do differently
Two things we'd change. First, we'd start building the human review feedback loop earlier — it was the most valuable input to agent improvement but we built it six weeks after launch instead of from day one. Second, we'd spend more time on the handoff experience: the moment when the agent passes to a human is visible to the customer and if it's handled poorly, it damages confidence in the product even when the subsequent human support is excellent.
Both are fixable. Neither stops the agent from being effective. But they represent the gap between 80% automation and where we expect to take it.