Harsh.
Case Study 2026 AI Product
DEBALES · LOGISTICS · 2024–2026

From Shared Inbox to Operations Layer

Designing the AI layer behind 27 agents and 6 regions for a national 3PL.

AI 27 Agents 6 Regions Tagging Auto-reply Dashboard Root Cause Logistics
debales.app/dashboard
Dashboard Regions Tag Analysis Agents AI Inbox Root Cause
FRT158m
CSAT4.2
Emails2,488
Saved39h

An AI operations layer built for a national 3PL. It turned a 27-agent shared inbox into a system that classifies, replies, and surfaces upstream issues. No new hires.

Project overview

"The client wanted AI to write replies. I argued they needed tagging first."

Two years of work: email tagging, AI-drafted and auto-sent replies, dashboards for the client team, and a Root Cause Engine that helped leadership see the operational problems behind their inbound.

About the work

Context

A 3PL handling last-mile white-glove delivery for a major national brand across six US regions. Twenty-seven agents worked the same shared inbox with no routing, tagging, or prioritization. Worst-case first response stretched past ten days.

Solution

An AI layer that classifies email by intent and region, drafts and sends replies, and gives leadership operational visibility they never had. All without forcing the team into a new tool.

By the numbers
The headline
10+ days → under 5 minutes
First response time, before AI vs. after. The transformation the rest of the system was built around.
17,000+ Emails classified by the AI tag system
~67% First-touch replies handled by AI
4.2 / 5 CSAT, first time the operation measured it
0 New hires. Volume grew during deployment.
The Brief 01 Reframe

The problem wasn't speed. It was that the system was scoped to the wrong fix.

Decision Tag the inbox before automating replies.It meant a slower path to auto-reply, but the team could trust what they were automating. Without tagging first, every other feature would be built on guesses about what was actually in the inbox.

The client thought their problem was reply speed. They asked for AI to write replies for them. Sales agreed and proposed a phased plan: AI drafts first, then full auto-reply later.

Both versions of the plan made the same mistake. They assumed a faster reply was the fix.

You can't safely automate replies when nobody knows what's actually in the inbox. Leadership couldn't answer basic questions about their own operation: which region had the most issues, what percentage of emails were the same repeated question, which agents handled the most volume.

Faster replies would have made the inbox quicker, but not clearer. So I pushed for tagging first. The phased plan became a real journey instead of just a delivery schedule.

How the reframe actually happened

I had one day before the build started. I went into the client's actual email data (the full archive, not a sample), used AI to find patterns and category distribution, and brought the findings to our internal review with the Debales CEO and team.

The data did the work. The client's leadership couldn't answer basic questions about their own inbox: which region was overloaded, what percentage of emails were the same repeated issue, where the volume was actually coming from. Once the CEO saw this, the case for tagging-first was easy. He took it to the client, and they agreed quickly.

A few days later, in a working session where I walked the client through the proposed tags, they added their own. Operation-specific ones like "Yellow Alert" and "Good Morning message," native to how their team thought about the work.

The pivot wasn't political. It was timing. The pilot was scoped for AI drafts and auto-reply, with a small dashboard planned for those features. We hadn't written code yet. I got in early enough that the change was just expanding the foundation. From proposal to tagging running in production: about a week.

1
Classification
Shipped

Tag every email by intent and region. Make the inbox legible.

2
Drafts & approval
Shipped

AI suggests, agent approves. Tier behavior by category.

3
Auto-reply
Shipped, scoped

High-confidence categories only. Under 60s response.

Strategic Focus 02 Tradeoffs

Make the inbox better fast. Don't try to fix everything at once.

Changing the scope had a real cost. Sales had budgeted for a phased rollout: AI drafts, then auto-reply, with a small dashboard planned to track those two features. What I was proposing was bigger. Tagging as the foundation, multiple dashboards on top of it, and later a Root Cause Engine.

To keep the project on track, I had to be clear about what was in the build and what wasn't. The hard part was saying no to features that were tempting but came too early.

In scope

What I built.

  • Tag system, two-level structure
  • AI visibility in two surfaces: AI Inbox UI for managers, Outlook tags & drafts for agents
  • Three reply behavior tiers, scoped by category
  • Overview dashboard with four sections
  • Four dedicated drill-down dashboards
  • Root Cause Engine surfacing upstream issues
  • Distribution layer: daily alerts & weekly reports
Out of scope

What I deferred.

  • Reply functionality inside the AI Inbox UI (action lives in Outlook by design — agents see AI tags & drafts there, not here)
  • AI confidence scores in the client-facing UI (kept internal — surfacing them caused anxiety without category context)
  • "Recommended fix" suggestions in Root Cause
  • One global accuracy score controlling all replies
  • Predicting future email volume
Ownership split Who designed, built, validated
Designed by me Tag system structure · Reply tier definitions · Dashboard IA · Root Cause UX · AI Journey framing (per CEO request) · Distribution layer (alerts, weekly reports)
Built & validated · Engineering Classifier scoring · Routing logic · Score validation · Infrastructure
Co-defined with CEO / CTO Business KPIs (I led, CEO guided) · Scope tradeoffs
Co-designed · Client leadership Tag refinement (working session) · Dashboard KPI selection
Selected Screens 03 Walkthrough

Four surfaces, one operational brain.

These are the actual operator-facing surfaces used in production: overview, tag analysis, AI inbox, and the Root Cause engine. All four use the same tag system and visual language.

Overview

Apr 02 to May 01 All regions
AI Journey · Level 3 99% complete
Classification
Drafts
Auto-reply
Total emails 2,488 ↑ 12.4% MoM
First response 158m ↓ 99.97% YoY
CSAT 4.2/5 ↑ first measure
Tagging saved 39.75h this period
Emails by tag
Top categories, current period
Emails by region
6 regions · distribution

Tag analysis

12 tags All time
Volume by tag · top-level
Drill into any tag for sub-cause breakdown
Follow-Up Required
554
AI Auto Replied
391
Urgent
277
Scheduling Request
155
Technician Assignment
118
ETA Request
81
Delivery Issue
61
Selected · Scheduling Request
Sub-cause breakdown · 155 emails
62
Window unclear
48
Reschedule needed
45
Confirm slot

AI inbox

All states Region 3
RC
Robert Chen
When will my delivery arrive?
Auto-sent 47s
MJ
Maria Jensen
Technician didn't arrive at scheduled window
Draft ready 2m
DW
David Wilcox
Third missed delivery, considering canceling
Human only 5m
AK
Aisha Khan
ETA confirmation for tomorrow's order
Auto-sent 12m
BL
Brendan Lopez
Need to reschedule, out of town next week
Draft ready 14m
SP
Sarah Patel
Refund request for damaged unit
Human only 22m
AI suggestion · Maria Jensen
"Hi Maria, I'm sorry the technician missed your window. I've reached out to dispatch to confirm a new slot…"

Root Cause Engine

By volume Last 90d
Upstream issues · 25 detected
01
Scheduling window mismatch
Customer expectation vs. driver allocation · all regions
624 emails
↑ 14%
02
Technician double-booking
Conflicting assignments · concentrated Region 3
281 emails
↑ 8%
03
Pre-delivery confirmation gap
Customers not notified of arrival · Regions 4, 5
194 emails
→ flat
04
Damaged unit rate
Above benchmark · Travel Markets · last 30d
112 emails
↑ 22%
No "recommended action" column. By design.
01 Overview Dashboard: operational vitals at a glance.
Surface · Foundation 04 Classifier

The classifier categories.

The classifier was the foundation. Every other surface depended on it.

I built the initial tag system by running the client's full email archive through AI to find patterns, then reviewing by hand to fix what AI got wrong and split up emails that didn't fit one category. Then I brought it to the client. Not for them to approve, but for a working session where we'd shape it together.

The hard call wasn't accuracy. It was granularity. Sixty fine-grained categories looks comprehensive on a slide and is useless in production. Six categories tells you nothing.

I landed on a two-level structure. Top-level for routing and reply, sub-causes for analysis.

Working session · first-pass categories 23 raw categories · before consolidation
Customer asking about delivery time ETA Request
Customer wants to reschedule Window not specified by driver Time slot confirmation needed
Scheduling Request
Driver running late Driver no-show Damage on delivery
Delivery Issue
Wrong item delivered Item missing from order
Missing Merchandise
Item return needed Refund request Disposal / pickup request
Refund / Disposal
Customer angry about service Customer angry about technician Service failure complaint
Complaint / Service Failure
Technician confirmation needed Technician Assignment
Parts damaged on arrival Wrong parts shipped Technician parts question
Parts / Equipment
COI / Certificate of Insurance Insurance documents needed Compliance / regulatory
COI Request
General inquiry Need additional info
Follow-Up Required
First cut had 23 overlapping categories. A working session with client leadership consolidated them into 12 top-level tags. That's the structure that shipped, with sub-causes for analysis underneath.
Top-level categories by volume N = 2,488 · single month
Follow-Up Required
554
AI Auto Replied
391
Urgent
277
Scheduling Request
155
Technician Assignment
118
Parts / Equipment Issue
89
ETA Request
81
Delivery Issue
61
Missing Merchandise
60
Refund / Disposal
31
Complaint / Service Failure
16
COI Request
3
Surface · Reply behavior 05 Tiering

Reply behavior, scoped by category.

Decision Three reply behaviors, decided by email category, not by one universal accuracy score.My job was deciding which behavior each category got, and what would move a category between them. The goal wasn't to make the AI as accurate as possible. It was to make sure that when it got something wrong, the wrongness was acceptable for that type of email.
Full resolution
Automation: end-to-end
ETA Request Scheduling Request
High pattern repetition. AI handles the request from first reply through resolution. The email AI works with the voice AI agent when a call is the faster path to closing the loop.
First-level auto-reply
Automation: first touch
Follow-Up Required Technician Assignment Parts / Equipment Delivery Issue Missing Merchandise
AI sends the first reply automatically. Acknowledgment, status, or what we know. It continues if it can. Hands off to a human when the customer needs something the AI can't safely do.
Human-only
Automation: none
Complaint / Service Failure Refund / Disposal
Client asked us not to auto-reply here. Tone and stakes too high. These route straight to a human.

Engineering checked the AI's accuracy. I decided which category got which behavior, and what would move a category between behaviors. We didn't use a strict accuracy percentage early on. Instead, we reviewed flagged replies with the client every week. A single bad reply in a sensitive category was enough to move that category back to "human only."

For Scheduling and ETA Request, the email AI works alongside the voice AI agent. If a customer's reply needs a phone confirmation or a quick rebook, the system can shift channels without losing context. The customer doesn't see the handoff. They just get the answer. Agents see all of this through Outlook itself: AI tags on every email, AI drafts ready in their reply field, an AI-Replied marker on auto-handled threads. No new tool for agents, no migration, no training needed.

Inbox · three reply behaviors, side by side
AK
Aisha Khan, Region 2 Central
Need confirmation on tomorrow's installation
First reply sent · 2m monitoring
RC
Robert Chen, Region 3 NorthEast
When will my delivery arrive? Order #4421-A
AI resolved · 47s 8 min ago
MJ
Maria Jensen, Region 1 West
Technician didn't arrive at scheduled window, need reschedule
AI handled · 4m → voice callback 2pm
DW
David Wilcox, Region 4 SouthEast
This is the third missed delivery. Considering canceling our contract.
Human only routed → mbarriero
Deep Dive · 01
// Email — Inbox UI

The inbox surface, in detail.

Component patterns, reply states, visual hierarchy decisions, and the Outlook integration model. Annotated screens with the why behind each decision.

Component system Reply states Outlook integration
Surface · Visibility 06 Dashboard system

The dashboard system.

Decision The overview answers one question: "Is anything wrong right now?" The drill-down dashboards answer the next one: "What do I do about it?"Leadership wanted everything on one screen. I pushed back. A single dashboard trying to serve regional managers, team coaches, and executive review all at once ends up serving none of them well.

This was the design call I had to argue for, and the surface I'm most confident about in retrospect.

I proposed an overview-and-drill-down structure: shallow overview that answers "is anything wrong," dedicated dashboards that answer "what do I do about it."

I defined the initial KPI set, then ran a working session with the CEO and CTO to align business KPIs against the operational ones. Some of what they cared about, like volume trends as a leading indicator for renewal conversations, wasn't visible in the operational data and needed its own treatment.

Original request · rejected "Everything on one screen"
Operations Dashboard · Single View 6 regions · 27 agents · all metrics
Total emails2,488
FRT avg158m
FRT max6,860m
Auto-reply67%
CSAT4.2
Unanswered249
Active agents27
Hours saved39.7
West412
Central387
NorthEast524
SouthEast461
TCA Build.298
Travel Mkts406
Follow-Up Req.554
AI Auto Replied391
Urgent277
Scheduling155
Tech Assign.118
Parts/Equip.89
ETA Request81
Delivery Iss.61
Missing Mdse.60
Refund/Disp.31
Complaint16
COI Request3
8 KPIs at 60px wide, illegible
Regional split has no room for trend
12 tags → just numbers, no action

In the first dashboard review, the CEO asked for everything visible on one page: all six regions, all twelve tags, all agent metrics. The reasoning was reasonable. He didn't want to hunt through screens to find what mattered.

I pushed back. The single-page version would be unreadable at this density, and any summary that fit would tell him nothing he could act on. I proposed level-based instead. A shallow overview that answered "is anything wrong," with each section clickable into a full drill-down dashboard answering "what now." That structure shipped, and is shown below.

Layer 01
Overview Dashboard single-screen · scannable in 30s
AI Journey Level 1 / 2 / 3 progress on the path to autonomy.
Executive Snapshot FRT, CSAT, agent gap, unanswered count, hours saved.
Email Distribution Emails by tag, by region. The two cuts leadership reached for.
Daily Flow Hourly volume by tag, agent, region. Drives shift planning.
↓ drill down
Layer 02
Region OverviewFor regional managers
Tag AnalysisSub-cause breakdown
AgentsPer-agent coaching
AI InboxLive email review
Deep Dive · 02
// Email — Dashboard system

Dashboard architecture, end to end.

Overview to drill-down, regional and tag-level views, KPI hierarchy, and the rejected single-dashboard approach in full detail. Annotated screens with the IA reasoning.

Overview + 4 drill-downs IA breakdown KPI hierarchy
Surface · Upstream 07 Root Cause

The Root Cause Analysis Engine.

What changed The system shifted from responding to emails to fixing operations.Replying faster is the obvious win. Reducing why the emails happen at all is harder, but more valuable. It's what made leadership stop asking "how do we reply faster" and start asking "what's making this happen."

The engine groups the classifier's sub-categories into bigger operational problems, ranked by how often they appear and whether they're getting worse. Most of these are everyday issues that come up dozens of times a week: delivery windows that don't match what drivers can actually do, technicians being booked twice for the same time, customers not being told when their delivery is coming.

This changed the conversation with leadership. They stopped asking "how do we reply faster" and started asking "why are we getting these emails in the first place?" That was the better question.

The hardest part was holding back. The dashboard could have suggested fixes. For example, "fix the scheduling window problem first, it's a quarter of your inbound." I chose not to add this. The system shows what's happening; people decide what to fix.

If the engine had recommended a fix and got it wrong, leadership would have stopped trusting everything else it told them.

17,000+ Classified emails Raw inbound
12 Top-level tags Routed & replied
~25 Upstream issues What leadership fixes
Deep Dive · 03
// Email — Root Cause Engine

From categories to operational fixes.

Tag Analysis surface, reasons leaderboard, regional heatmaps, the reason-tree breakdown, and the cost-forecasting layer. The full UI walkthrough of how 17,000+ classified emails turn into ~25 operational signals leadership can act on.

Tag Analysis surface Reason tree Cost forecasting
Friction 08 Six months in

The system worked. The dashboard didn't.

A friction moment we didn't see coming.

What we learned The dashboard worked. So well that, after two months, they had less reason to check it.Leadership saw the patterns, fixed the obvious operational issues on their side, and dashboard views naturally dropped. The system was doing its job. But new issues kept coming up, and without distribution, no one would see them.

For the first two months after launch, leadership used the dashboard heavily. They saw the patterns we'd surfaced: repeated scheduling issues, regional volume gaps, agent load differences. And they started fixing things on their end. That was the point of the dashboard.

After those two months, dashboard views dropped sharply. The obvious problems had been fixed, and leadership had less reason to keep checking.

But new operational issues kept appearing. Without someone watching the dashboard, the system would notice them but no one would see them.

The fix was distribution. We added a daily alert email to the client manager and a weekly operations report to leadership. The dashboard didn't change. We just stopped expecting the client to come to it.

Client Manager
Before
4 views/mo
After
Daily logins
Leadership
Before
0 views/mo
After
Weekly review
650% increase in dashboard engagement, after we stopped expecting the client to come to it.
From 4 views/mo → daily (manager) · 0 → weekly (leadership)

The takeaway was uncomfortable: we'd designed a tool that answered the right questions and assumed the client would ask them. They wouldn't, until we made the answers come to them.

Adjacent 09 What came next

The email layer became the foundation.

The platform grew into two adjacent agents on the same operational backbone. WhatsApp for driver-side coordination, Voice for inbound calls. Each had its own design problem (drivers reply one-handed on the road; voice has to know when not to try) but both used the same tag system and dashboard structure, so context followed the customer across channels.

EmailThis case
Classifier, tiered drafts, dashboards, Root Cause Engine.
Auto / Suggest / Manual tiers 17K+ emails classified ~67% AI first-touch
WhatsAppAdjacent
Proactive customer ETAs. Driver coordination via structured replies on the road.
3 buttons: Confirm / Late / Help 3-min auto-nudge 7-min human escalation
VoiceAdjacent
AI screens inbound calls and resolves common inquiries.
<60s resolution target Linked to email context Routes complex calls to humans
Shared layer · inherited by all agents Tag system Dashboard structure Customer context Escalation patterns
Closing 10 Lessons

What I'd carry to the next system.

  1. The categories are the foundation, not setup work.

    At first I treated the email categories as a setup task. Something engineering would handle. But every part of the system used these categories: dashboards, replies, the Root Cause Engine. When I had to change a category later, every screen and every feature using it had to change too. It was the most important decision I made, not the smallest one.

  2. Reframing the scope is part of the job.

    The contract said AI auto-reply. The data said tagging. AI product design, I've come to think, is partly about renegotiating the brief once you can see what the customer actually has.

  3. Operators trust override, not accuracy.

    Telling the team the AI was accurate didn't move adoption. Showing them they could veto any reply, no friction, did.

  4. Leadership wants one number. The operation runs on five.

    When the CEO asked for one dashboard, his instinct was right. The execution he had in mind wouldn't have worked. What we shipped was a shallow overview that you could scan in thirty seconds, with four drill-downs underneath for the people whose job actually needed the detail.

  5. Don't recommend, surface.

    If Root Cause had recommended fixes, the first wrong one would have killed it. The system can show what's happening. It shouldn't tell people what to do, because the engine doesn't have the full context for what's actually fixable. That's the discipline.

  6. Distribution beats availability.

    We built the dashboard. The dashboard had the answers. The client didn't open it. The fix was a daily email.