Harsh.

Case Study 2026 AI Product

DEBALES · LOGISTICS · 2024–2026

From Shared Inbox to Operations Layer

Designing the AI layer behind 27 agents and 6 regions for a national 3PL.

AI 27 Agents 6 Regions Tagging Auto-reply Dashboard Root Cause Logistics

debales.app/dashboard

Dashboard Regions Tag Analysis Agents AI Inbox Root Cause

FRT158m

CSAT4.2

Emails2,488

Saved39h

An AI operations layer built for a national 3PL. It turned a 27-agent shared inbox into a system that classifies, replies, and surfaces upstream issues. No new hires.

Project overview

"The client wanted AI to write replies. I argued they needed tagging first."

Two years of work: email tagging, AI-drafted and auto-sent replies, dashboards for the client team, and a Root Cause Engine that helped leadership see the operational problems behind their inbound.

About the work

Context

A 3PL handling last-mile white-glove delivery for a major national brand across six US regions. Twenty-seven agents worked the same shared inbox with no routing, tagging, or prioritization. Worst-case first response stretched past ten days.

Solution

An AI layer that classifies email by intent and region, drafts and sends replies, and gives leadership operational visibility they never had. All without forcing the team into a new tool.

By the numbers

The headline

10+ days → under 5 minutes

First response time, before AI vs. after. The transformation the rest of the system was built around.

17,000+ Emails classified by the AI tag system

~67% First-touch replies handled by AI

4.2 / 5 CSAT, first time the operation measured it

0 New hires. Volume grew during deployment.

The Brief 01 Reframe

The problem wasn't speed. It was that the system was scoped to the wrong fix.

Decision Tag the inbox before automating replies.It meant a slower path to auto-reply, but the team could trust what they were automating. Without tagging first, every other feature would be built on guesses about what was actually in the inbox.

The client thought their problem was reply speed. They asked for AI to write replies for them. Sales agreed and proposed a phased plan: AI drafts first, then full auto-reply later.

Both versions of the plan made the same mistake. They assumed a faster reply was the fix.

You can't safely automate replies when nobody knows what's actually in the inbox. Leadership couldn't answer basic questions about their own operation: which region had the most issues, what percentage of emails were the same repeated question, which agents handled the most volume.

Faster replies would have made the inbox quicker, but not clearer. So I pushed for tagging first. The phased plan became a real journey instead of just a delivery schedule.

How the reframe actually happened

I had one day before the build started. I went into the client's actual email data (the full archive, not a sample), used AI to find patterns and category distribution, and brought the findings to our internal review with the Debales CEO and team.

The data did the work. The client's leadership couldn't answer basic questions about their own inbox: which region was overloaded, what percentage of emails were the same repeated issue, where the volume was actually coming from. Once the CEO saw this, the case for tagging-first was easy. He took it to the client, and they agreed quickly.

A few days later, in a working session where I walked the client through the proposed tags, they added their own. Operation-specific ones like "Yellow Alert" and "Good Morning message," native to how their team thought about the work.

The pivot wasn't political. It was timing. The pilot was scoped for AI drafts and auto-reply, with a small dashboard planned for those features. We hadn't written code yet. I got in early enough that the change was just expanding the foundation. From proposal to tagging running in production: about a week.

Classification

Shipped

Tag every email by intent and region. Make the inbox legible.

Drafts & approval

Shipped

AI suggests, agent approves. Tier behavior by category.

Auto-reply

Shipped, scoped

High-confidence categories only. Under 60s response.

Strategic Focus 02 Tradeoffs

Make the inbox better fast. Don't try to fix everything at once.

Changing the scope had a real cost. Sales had budgeted for a phased rollout: AI drafts, then auto-reply, with a small dashboard planned to track those two features. What I was proposing was bigger. Tagging as the foundation, multiple dashboards on top of it, and later a Root Cause Engine.

To keep the project on track, I had to be clear about what was in the build and what wasn't. The hard part was saying no to features that were tempting but came too early.

In scope

What I built.

Tag system, two-level structure
AI visibility in two surfaces: AI Inbox UI for managers, Outlook tags & drafts for agents
Three reply behavior tiers, scoped by category
Overview dashboard with four sections
Four dedicated drill-down dashboards
Root Cause Engine surfacing upstream issues
Distribution layer: daily alerts & weekly reports

Out of scope

What I deferred.

Reply functionality inside the AI Inbox UI (action lives in Outlook by design — agents see AI tags & drafts there, not here)
AI confidence scores in the client-facing UI (kept internal — surfacing them caused anxiety without category context)
"Recommended fix" suggestions in Root Cause
One global accuracy score controlling all replies
Predicting future email volume

Ownership split Who designed, built, validated

Designed by me Tag system structure · Reply tier definitions · Dashboard IA · Root Cause UX · AI Journey framing (per CEO request) · Distribution layer (alerts, weekly reports)

Built & validated · Engineering Classifier scoring · Routing logic · Score validation · Infrastructure

Co-defined with CEO / CTO Business KPIs (I led, CEO guided) · Scope tradeoffs

Co-designed · Client leadership Tag refinement (working session) · Dashboard KPI selection

Selected Screens 03 Walkthrough

Four surfaces, one operational brain.

These are the actual operator-facing surfaces used in production: overview, tag analysis, AI inbox, and the Root Cause engine. All four use the same tag system and visual language.

Overview

Apr 02 to May 01 All regions

AI Journey · Level 3 99% complete

✓Classification

✓Drafts

✓Auto-reply

Total emails 2,488 ↑ 12.4% MoM

First response 158m ↓ 99.97% YoY

CSAT 4.2/5 ↑ first measure

Tagging saved 39.75h this period

Emails by tag

Top categories, current period

Emails by region

6 regions · distribution

Tag analysis

12 tags All time

Volume by tag · top-level

Drill into any tag for sub-cause breakdown

Follow-Up Required

554

AI Auto Replied

391

Urgent

277

Scheduling Request

155

Technician Assignment

118

ETA Request

Delivery Issue

Selected · Scheduling Request

Sub-cause breakdown · 155 emails

Window unclear

Reschedule needed

Confirm slot

AI inbox

All states Region 3

Robert Chen

When will my delivery arrive?

Auto-sent 47s

Maria Jensen

Technician didn't arrive at scheduled window

Draft ready 2m

David Wilcox

Third missed delivery, considering canceling

Human only 5m

Aisha Khan

ETA confirmation for tomorrow's order

Auto-sent 12m

Brendan Lopez

Need to reschedule, out of town next week

Draft ready 14m

Sarah Patel

Refund request for damaged unit

Human only 22m

AI suggestion · Maria Jensen

"Hi Maria, I'm sorry the technician missed your window. I've reached out to dispatch to confirm a new slot…"

Root Cause Engine

By volume Last 90d

Upstream issues · 25 detected

Scheduling window mismatch

Customer expectation vs. driver allocation · all regions

624 emails
↑ 14%

Technician double-booking

Conflicting assignments · concentrated Region 3

281 emails
↑ 8%

Pre-delivery confirmation gap

Customers not notified of arrival · Regions 4, 5

194 emails
→ flat

Damaged unit rate

Above benchmark · Travel Markets · last 30d

112 emails
↑ 22%

No "recommended action" column. By design.

01 Overview Dashboard: operational vitals at a glance.

Surface · Foundation 04 Classifier

The classifier categories.

The classifier was the foundation. Every other surface depended on it.

I built the initial tag system by running the client's full email archive through AI to find patterns, then reviewing by hand to fix what AI got wrong and split up emails that didn't fit one category. Then I brought it to the client. Not for them to approve, but for a working session where we'd shape it together.

The hard call wasn't accuracy. It was granularity. Sixty fine-grained categories looks comprehensive on a slide and is useless in production. Six categories tells you nothing.

I landed on a two-level structure. Top-level for routing and reply, sub-causes for analysis.

Working session · first-pass categories 23 raw categories · before consolidation

Customer asking about delivery time → ETA Request

Customer wants to reschedule Window not specified by driver Time slot confirmation needed

→ Scheduling Request

Driver running late Driver no-show Damage on delivery

→ Delivery Issue

Wrong item delivered Item missing from order

→ Missing Merchandise

Item return needed Refund request Disposal / pickup request

→ Refund / Disposal

Customer angry about service Customer angry about technician Service failure complaint

→ Complaint / Service Failure

Technician confirmation needed → Technician Assignment

Parts damaged on arrival Wrong parts shipped Technician parts question

→ Parts / Equipment

COI / Certificate of Insurance Insurance documents needed Compliance / regulatory

→ COI Request

General inquiry Need additional info

→ Follow-Up Required

First cut had 23 overlapping categories. A working session with client leadership consolidated them into 12 top-level tags. That's the structure that shipped, with sub-causes for analysis underneath.

Top-level categories by volume N = 2,488 · single month

Follow-Up Required

554

AI Auto Replied

391

Urgent

277

Scheduling Request

155

Technician Assignment

118

Parts / Equipment Issue

ETA Request

Delivery Issue

Missing Merchandise

Refund / Disposal

Complaint / Service Failure

COI Request

Surface · Reply behavior 05 Tiering

Reply behavior, scoped by category.

Decision Three reply behaviors, decided by email category, not by one universal accuracy score.My job was deciding which behavior each category got, and what would move a category between them. The goal wasn't to make the AI as accurate as possible. It was to make sure that when it got something wrong, the wrongness was acceptable for that type of email.

Full resolution

Automation: end-to-end

ETA Request Scheduling Request

High pattern repetition. AI handles the request from first reply through resolution. The email AI works with the voice AI agent when a call is the faster path to closing the loop.

First-level auto-reply

Automation: first touch

Follow-Up Required Technician Assignment Parts / Equipment Delivery Issue Missing Merchandise

AI sends the first reply automatically. Acknowledgment, status, or what we know. It continues if it can. Hands off to a human when the customer needs something the AI can't safely do.

Human-only

Automation: none

Complaint / Service Failure Refund / Disposal

Client asked us not to auto-reply here. Tone and stakes too high. These route straight to a human.

Engineering checked the AI's accuracy. I decided which category got which behavior, and what would move a category between behaviors. We didn't use a strict accuracy percentage early on. Instead, we reviewed flagged replies with the client every week. A single bad reply in a sensitive category was enough to move that category back to "human only."

For Scheduling and ETA Request, the email AI works alongside the voice AI agent. If a customer's reply needs a phone confirmation or a quick rebook, the system can shift channels without losing context. The customer doesn't see the handoff. They just get the answer. Agents see all of this through Outlook itself: AI tags on every email, AI drafts ready in their reply field, an AI-Replied marker on auto-handled threads. No new tool for agents, no migration, no training needed.

Inbox · three reply behaviors, side by side

Aisha Khan, Region 2 Central

Need confirmation on tomorrow's installation

First reply sent · 2m monitoring

Robert Chen, Region 3 NorthEast

When will my delivery arrive? Order #4421-A

AI resolved · 47s 8 min ago

Maria Jensen, Region 1 West

Technician didn't arrive at scheduled window, need reschedule

AI handled · 4m → voice callback 2pm

David Wilcox, Region 4 SouthEast

This is the third missed delivery. Considering canceling our contract.

Human only routed → mbarriero

Deep Dive · 01

// Email — Inbox UI ↗

The inbox surface, in detail.

Component patterns, reply states, visual hierarchy decisions, and the Outlook integration model. Annotated screens with the why behind each decision.

Component system Reply states Outlook integration

Surface · Visibility 06 Dashboard system

The dashboard system.

Decision The overview answers one question: "Is anything wrong right now?" The drill-down dashboards answer the next one: "What do I do about it?"Leadership wanted everything on one screen. I pushed back. A single dashboard trying to serve regional managers, team coaches, and executive review all at once ends up serving none of them well.

This was the design call I had to argue for, and the surface I'm most confident about in retrospect.

I proposed an overview-and-drill-down structure: shallow overview that answers "is anything wrong," dedicated dashboards that answer "what do I do about it."

I defined the initial KPI set, then ran a working session with the CEO and CTO to align business KPIs against the operational ones. Some of what they cared about, like volume trends as a leading indicator for renewal conversations, wasn't visible in the operational data and needed its own treatment.

Original request · rejected "Everything on one screen"

Operations Dashboard · Single View 6 regions · 27 agents · all metrics

Total emails2,488

FRT avg158m

FRT max6,860m

Auto-reply67%

CSAT4.2

Unanswered249

Active agents27

Hours saved39.7

West412

Central387

NorthEast524

SouthEast461

TCA Build.298

Travel Mkts406

Follow-Up Req.554

AI Auto Replied391

Urgent277

Scheduling155

Tech Assign.118

Parts/Equip.89

ETA Request81

Delivery Iss.61

Missing Mdse.60

Refund/Disp.31

Complaint16

COI Request3

8 KPIs at 60px wide, illegible

Regional split has no room for trend

12 tags → just numbers, no action

In the first dashboard review, the CEO asked for everything visible on one page: all six regions, all twelve tags, all agent metrics. The reasoning was reasonable. He didn't want to hunt through screens to find what mattered.

I pushed back. The single-page version would be unreadable at this density, and any summary that fit would tell him nothing he could act on. I proposed level-based instead. A shallow overview that answered "is anything wrong," with each section clickable into a full drill-down dashboard answering "what now." That structure shipped, and is shown below.

Layer 01

Overview Dashboard single-screen · scannable in 30s

AI Journey Level 1 / 2 / 3 progress on the path to autonomy.

Executive Snapshot FRT, CSAT, agent gap, unanswered count, hours saved.

Email Distribution Emails by tag, by region. The two cuts leadership reached for.

Daily Flow Hourly volume by tag, agent, region. Drives shift planning.

↓ drill down

Layer 02

Region OverviewFor regional managers

Tag AnalysisSub-cause breakdown

AgentsPer-agent coaching

AI InboxLive email review

Deep Dive · 02

// Email — Dashboard system ↗

Dashboard architecture, end to end.

Overview to drill-down, regional and tag-level views, KPI hierarchy, and the rejected single-dashboard approach in full detail. Annotated screens with the IA reasoning.

Overview + 4 drill-downs IA breakdown KPI hierarchy

Surface · Upstream 07 Root Cause

The Root Cause Analysis Engine.

What changed The system shifted from responding to emails to fixing operations.Replying faster is the obvious win. Reducing why the emails happen at all is harder, but more valuable. It's what made leadership stop asking "how do we reply faster" and start asking "what's making this happen."

The engine groups the classifier's sub-categories into bigger operational problems, ranked by how often they appear and whether they're getting worse. Most of these are everyday issues that come up dozens of times a week: delivery windows that don't match what drivers can actually do, technicians being booked twice for the same time, customers not being told when their delivery is coming.

This changed the conversation with leadership. They stopped asking "how do we reply faster" and started asking "why are we getting these emails in the first place?" That was the better question.

The hardest part was holding back. The dashboard could have suggested fixes. For example, "fix the scheduling window problem first, it's a quarter of your inbound." I chose not to add this. The system shows what's happening; people decide what to fix.

If the engine had recommended a fix and got it wrong, leadership would have stopped trusting everything else it told them.

17,000+ Classified emails Raw inbound

→

12 Top-level tags Routed & replied

→

~25 Upstream issues What leadership fixes

Deep Dive · 03

// Email — Root Cause Engine ↗

From categories to operational fixes.

Tag Analysis surface, reasons leaderboard, regional heatmaps, the reason-tree breakdown, and the cost-forecasting layer. The full UI walkthrough of how 17,000+ classified emails turn into ~25 operational signals leadership can act on.

Tag Analysis surface Reason tree Cost forecasting

Friction 08 Six months in

The system worked. The dashboard didn't.

A friction moment we didn't see coming.

What we learned The dashboard worked. So well that, after two months, they had less reason to check it.Leadership saw the patterns, fixed the obvious operational issues on their side, and dashboard views naturally dropped. The system was doing its job. But new issues kept coming up, and without distribution, no one would see them.

For the first two months after launch, leadership used the dashboard heavily. They saw the patterns we'd surfaced: repeated scheduling issues, regional volume gaps, agent load differences. And they started fixing things on their end. That was the point of the dashboard.

After those two months, dashboard views dropped sharply. The obvious problems had been fixed, and leadership had less reason to keep checking.

But new operational issues kept appearing. Without someone watching the dashboard, the system would notice them but no one would see them.

The fix was distribution. We added a daily alert email to the client manager and a weekly operations report to leadership. The dashboard didn't change. We just stopped expecting the client to come to it.

Daily, to client manager Sent 7:00 AM

From: Debales Operations

To: Client manager

Subject: Daily Alert · Yesterday's email summary

Daily Ops Report

Yesterday · April 28, 2026

342Emails

3.2mFRT avg

3Unanswered

67%AI handled

Volume by region

Region 3

89 ↑18%

Region 1

Region 4

Region 2

Top tags today

Needs attention

3 emails unanswered after 4 hours
Maria J. at 2x average load today

Open dashboard →

Weekly, to leadership Monday 8:00 AM

From: Debales Operations

To: Client leadership

Subject: Weekly Operations Report · Apr 22–28

Weekly Operations Report

Apr 22 to Apr 28, 2026

2,488Emails

158mFRT avg

4.2/5CSAT

67%AI 1st-touch

Top operational issues

Scheduling window mismatch

624 emails · ↑14% vs last week

Technician double-booking

281 emails · mostly Region 3

Pre-delivery confirmation gap

194 emails · Regions 4 and 5

Wins this week

FRT improved to 158m (was 168m last week)
CSAT held at 4.2/5
AI handled 67% of replies on first touch

Open dashboard →

Client Manager

Before

4 views/mo

After

Daily logins

Leadership

Before

0 views/mo

After

Weekly review

650% increase in dashboard engagement, after we stopped expecting the client to come to it.

From 4 views/mo → daily (manager) · 0 → weekly (leadership)

The takeaway was uncomfortable: we'd designed a tool that answered the right questions and assumed the client would ask them. They wouldn't, until we made the answers come to them.

Adjacent 09 What came next

The email layer became the foundation.

The platform grew into two adjacent agents on the same operational backbone. WhatsApp for driver-side coordination, Voice for inbound calls. Each had its own design problem (drivers reply one-handed on the road; voice has to know when not to try) but both used the same tag system and dashboard structure, so context followed the customer across channels.

EmailThis case

Classifier, tiered drafts, dashboards, Root Cause Engine.

Auto / Suggest / Manual tiers 17K+ emails classified ~67% AI first-touch

WhatsAppAdjacent

Proactive customer ETAs. Driver coordination via structured replies on the road.

3 buttons: Confirm / Late / Help 3-min auto-nudge 7-min human escalation

VoiceAdjacent

AI screens inbound calls and resolves common inquiries.

<60s resolution target Linked to email context Routes complex calls to humans

Shared layer · inherited by all agents Tag system Dashboard structure Customer context Escalation patterns

Deep Dive · 04

// WhatsApp UI ↗

Driver-side comms.

Conversation flow design, driver and helper threads, status states, and the bridge that lets a WhatsApp reply close an email loop.

Conversation flow Status states

Deep Dive · 05

// Voice UI ↗

Softphone & call handoff.

Softphone interface, call logs, agent management, and cross-channel handoff from email when a phone call is the faster path.

Softphone UI Cross-channel handoff

Closing 10 Lessons

What I'd carry to the next system.

The categories are the foundation, not setup work.

At first I treated the email categories as a setup task. Something engineering would handle. But every part of the system used these categories: dashboards, replies, the Root Cause Engine. When I had to change a category later, every screen and every feature using it had to change too. It was the most important decision I made, not the smallest one.
Reframing the scope is part of the job.

The contract said AI auto-reply. The data said tagging. AI product design, I've come to think, is partly about renegotiating the brief once you can see what the customer actually has.
Operators trust override, not accuracy.

Telling the team the AI was accurate didn't move adoption. Showing them they could veto any reply, no friction, did.
Leadership wants one number. The operation runs on five.

When the CEO asked for one dashboard, his instinct was right. The execution he had in mind wouldn't have worked. What we shipped was a shallow overview that you could scan in thirty seconds, with four drill-downs underneath for the people whose job actually needed the detail.
Don't recommend, surface.

If Root Cause had recommended fixes, the first wrong one would have killed it. The system can show what's happening. It shouldn't tell people what to do, because the engine doesn't have the full context for what's actually fixable. That's the discipline.
Distribution beats availability.

We built the dashboard. The dashboard had the answers. The client didn't open it. The fix was a daily email.