Designing the AI layer behind 27 agents and 6 regions for a national 3PL.
"The client wanted AI to write replies. I argued they needed tagging first."
Two years of work: email tagging, AI-drafted and auto-sent replies, dashboards for the client team, and a Root Cause Engine that helped leadership see the operational problems behind their inbound.
The client thought their problem was reply speed. They asked for AI to write replies for them. Sales agreed and proposed a phased plan: AI drafts first, then full auto-reply later.
Both versions of the plan made the same mistake. They assumed a faster reply was the fix.
You can't safely automate replies when nobody knows what's actually in the inbox. Leadership couldn't answer basic questions about their own operation: which region had the most issues, what percentage of emails were the same repeated question, which agents handled the most volume.
Faster replies would have made the inbox quicker, but not clearer. So I pushed for tagging first. The phased plan became a real journey instead of just a delivery schedule.
I had one day before the build started. I went into the client's actual email data (the full archive, not a sample), used AI to find patterns and category distribution, and brought the findings to our internal review with the Debales CEO and team.
The data did the work. The client's leadership couldn't answer basic questions about their own inbox: which region was overloaded, what percentage of emails were the same repeated issue, where the volume was actually coming from. Once the CEO saw this, the case for tagging-first was easy. He took it to the client, and they agreed quickly.
A few days later, in a working session where I walked the client through the proposed tags, they added their own. Operation-specific ones like "Yellow Alert" and "Good Morning message," native to how their team thought about the work.
The pivot wasn't political. It was timing. The pilot was scoped for AI drafts and auto-reply, with a small dashboard planned for those features. We hadn't written code yet. I got in early enough that the change was just expanding the foundation. From proposal to tagging running in production: about a week.
Tag every email by intent and region. Make the inbox legible.
AI suggests, agent approves. Tier behavior by category.
High-confidence categories only. Under 60s response.
Changing the scope had a real cost. Sales had budgeted for a phased rollout: AI drafts, then auto-reply, with a small dashboard planned to track those two features. What I was proposing was bigger. Tagging as the foundation, multiple dashboards on top of it, and later a Root Cause Engine.
To keep the project on track, I had to be clear about what was in the build and what wasn't. The hard part was saying no to features that were tempting but came too early.
These are the actual operator-facing surfaces used in production: overview, tag analysis, AI inbox, and the Root Cause engine. All four use the same tag system and visual language.
The classifier was the foundation. Every other surface depended on it.
I built the initial tag system by running the client's full email archive through AI to find patterns, then reviewing by hand to fix what AI got wrong and split up emails that didn't fit one category. Then I brought it to the client. Not for them to approve, but for a working session where we'd shape it together.
The hard call wasn't accuracy. It was granularity. Sixty fine-grained categories looks comprehensive on a slide and is useless in production. Six categories tells you nothing.
I landed on a two-level structure. Top-level for routing and reply, sub-causes for analysis.
Engineering checked the AI's accuracy. I decided which category got which behavior, and what would move a category between behaviors. We didn't use a strict accuracy percentage early on. Instead, we reviewed flagged replies with the client every week. A single bad reply in a sensitive category was enough to move that category back to "human only."
For Scheduling and ETA Request, the email AI works alongside the voice AI agent. If a customer's reply needs a phone confirmation or a quick rebook, the system can shift channels without losing context. The customer doesn't see the handoff. They just get the answer. Agents see all of this through Outlook itself: AI tags on every email, AI drafts ready in their reply field, an AI-Replied marker on auto-handled threads. No new tool for agents, no migration, no training needed.
Component patterns, reply states, visual hierarchy decisions, and the Outlook integration model. Annotated screens with the why behind each decision.
This was the design call I had to argue for, and the surface I'm most confident about in retrospect.
I proposed an overview-and-drill-down structure: shallow overview that answers "is anything wrong," dedicated dashboards that answer "what do I do about it."
I defined the initial KPI set, then ran a working session with the CEO and CTO to align business KPIs against the operational ones. Some of what they cared about, like volume trends as a leading indicator for renewal conversations, wasn't visible in the operational data and needed its own treatment.
In the first dashboard review, the CEO asked for everything visible on one page: all six regions, all twelve tags, all agent metrics. The reasoning was reasonable. He didn't want to hunt through screens to find what mattered.
I pushed back. The single-page version would be unreadable at this density, and any summary that fit would tell him nothing he could act on. I proposed level-based instead. A shallow overview that answered "is anything wrong," with each section clickable into a full drill-down dashboard answering "what now." That structure shipped, and is shown below.
Overview to drill-down, regional and tag-level views, KPI hierarchy, and the rejected single-dashboard approach in full detail. Annotated screens with the IA reasoning.
The engine groups the classifier's sub-categories into bigger operational problems, ranked by how often they appear and whether they're getting worse. Most of these are everyday issues that come up dozens of times a week: delivery windows that don't match what drivers can actually do, technicians being booked twice for the same time, customers not being told when their delivery is coming.
This changed the conversation with leadership. They stopped asking "how do we reply faster" and started asking "why are we getting these emails in the first place?" That was the better question.
The hardest part was holding back. The dashboard could have suggested fixes. For example, "fix the scheduling window problem first, it's a quarter of your inbound." I chose not to add this. The system shows what's happening; people decide what to fix.
If the engine had recommended a fix and got it wrong, leadership would have stopped trusting everything else it told them.
Tag Analysis surface, reasons leaderboard, regional heatmaps, the reason-tree breakdown, and the cost-forecasting layer. The full UI walkthrough of how 17,000+ classified emails turn into ~25 operational signals leadership can act on.
A friction moment we didn't see coming.
For the first two months after launch, leadership used the dashboard heavily. They saw the patterns we'd surfaced: repeated scheduling issues, regional volume gaps, agent load differences. And they started fixing things on their end. That was the point of the dashboard.
After those two months, dashboard views dropped sharply. The obvious problems had been fixed, and leadership had less reason to keep checking.
But new operational issues kept appearing. Without someone watching the dashboard, the system would notice them but no one would see them.
The fix was distribution. We added a daily alert email to the client manager and a weekly operations report to leadership. The dashboard didn't change. We just stopped expecting the client to come to it.
The takeaway was uncomfortable: we'd designed a tool that answered the right questions and assumed the client would ask them. They wouldn't, until we made the answers come to them.
The platform grew into two adjacent agents on the same operational backbone. WhatsApp for driver-side coordination, Voice for inbound calls. Each had its own design problem (drivers reply one-handed on the road; voice has to know when not to try) but both used the same tag system and dashboard structure, so context followed the customer across channels.
Conversation flow design, driver and helper threads, status states, and the bridge that lets a WhatsApp reply close an email loop.
Softphone interface, call logs, agent management, and cross-channel handoff from email when a phone call is the faster path.
At first I treated the email categories as a setup task. Something engineering would handle. But every part of the system used these categories: dashboards, replies, the Root Cause Engine. When I had to change a category later, every screen and every feature using it had to change too. It was the most important decision I made, not the smallest one.
The contract said AI auto-reply. The data said tagging. AI product design, I've come to think, is partly about renegotiating the brief once you can see what the customer actually has.
Telling the team the AI was accurate didn't move adoption. Showing them they could veto any reply, no friction, did.
When the CEO asked for one dashboard, his instinct was right. The execution he had in mind wouldn't have worked. What we shipped was a shallow overview that you could scan in thirty seconds, with four drill-downs underneath for the people whose job actually needed the detail.
If Root Cause had recommended fixes, the first wrong one would have killed it. The system can show what's happening. It shouldn't tell people what to do, because the engine doesn't have the full context for what's actually fixable. That's the discipline.
We built the dashboard. The dashboard had the answers. The client didn't open it. The fix was a daily email.