Inbox · Rebuilt

From 35,000 unread to inbox zero

An email inbox rebuilt as a production system — Gmail API, a SQLite sender catalog, a 4×/day cron, a human-in-the-loop over Telegram, and Claude Code as the interface layer. Ten days of focused evenings closed a battle I'd been losing for years.

Inbox Zero — a cosmic architecture diagram showing the 4×/day agent loop with Gmail, OAuth, cron, catalog, and Telegram

35,000

Unread at start

Unread today

4×/day

Cron cadence

Domain filters

60+

Senders decided

~5 min

Upkeep / week

🧭The reframe that actually worked

Gmail's UI treats every email as a decision. The real decisions are about senders.

For years I built up filters one promotional email at a time, usually at 1 AM after a particularly irritating newsletter. Forty-plus filters later, I couldn't tell what they did without reading each one. When Google Pay sent me a confirmation that I'd added a card, I couldn't find the email.

The unlock was realizing my inbox isn't a list of messages — it's a flow of senders. Each message from a sender I've already decided about doesn't need to be decided again. So I made the sender the primary key: a SQLite catalog with one row per sender and a stored decision (archive / mark_read / keep_inbox / unsubscribe). Every other piece of the system became a dumb executor of that catalog.

🔑The core idea, in one sentence

Gmail becomes dumb storage. The catalog becomes the brain. Every decision I make is leveraged across every future email from that sender — the work compounds instead of repeating.

🏗The architecture, in four pieces

Gmail API · SQLite catalog · triage script (+ sweeps) · me.

1. Gmail API

OAuth2 to a GCP project, scopes: mail.google.com, gmail.send, gmail.settings.basic. Token cached on disk, refresh-token-based. App published to production mode so tokens don't expire every 7 days.

2. Sender catalog (SQLite)

One row per sender. Columns: email, domain, decision, decided_by (user / rule / gemini), confidence, lifetime counts, sample subjects. This file is the source of truth.

3. Triage script + sweeps

Core loop: fetch unread → look each sender up in catalog → apply decision. Plus two sweeps: spam purge (create domain filters for new primary domains, permanently delete current spam) and badge sweep (clear UNREAD on archived-but-unread messages so Gmail's sidebar badges don't lie).

4. Me, when it matters

New senders enter a review queue. Security keywords always surface even from auto-archived senders (so password-reset emails don't slip through). Telegram pings only when there's something genuinely new — silence is the feature.

text

email           = 'specials@eml.cigar.com'
domain          = 'eml.cigar.com'
display_name    = 'CIGAR.com Specials'
decision        = 'archive'         # archive | mark_read | keep_inbox | unsubscribe
category        = 'Marketing'
decided_by      = 'user'            # user | rule | gemini | inferred
confidence      = 100
total_count     = 483               # lifetime
unread_count    = 0                 # this run
sample_subjects = ['50% off...', ...]
first_seen      = 2024-02-12
last_seen       = 2026-04-22

↕ Scroll

A single catalog row

🔍Four nuances that cost me days to work out

The kind of invariants you don't know you need until you ship the bug.

#1 — Primary-domain collapse

Filtering specials@eml.cigar.com, support@cigar.com, and rewards@cigar.com as three separate rules is bureaucracy. Collapse the sender to its primary registered domain (cigar.com) using a public-suffix-aware heuristic. One domain, one rule.

#2 — The SKIP list (a.k.a. the day I almost nuked every gmail.com sender)

An early version of the script auto-created a filter for every primary domain in Spam. One spam message came from a gmail.com address. The script happily created a filter that would have trashed every future email from any gmail.com sender. I also caught:

insidetracker.com — a paid subscription
ne.jp / org.bd / edu.pl — public suffixes, not registered domains (like .co.uk)
frame.io — a tool I use; only the marketing subdomain should be filtered

The fix: two persistent lists baked into the script — a public-suffix list so the collapse doesn't over-shorten, and a DEFAULT_SKIP list of multi-tenant email hosts (gmail, yahoo, icloud, outlook…) and real services I use. Never blanket-filter these.

#3 — Auto-mark-read for transactional mail

A huge chunk of a real inbox isn't spam — it's transactional receipts. Shipping notifications, order confirmations, utility bills, card alerts. I don't want them archived (I'll need to search later), but I also don't want them unread (false urgency). The mark_read decision handles exactly this class: stay in inbox, don't contribute to the counter. This was the single biggest contributor to going from 35,000 to zero.

#4 — The sidebar-badge lie Gmail doesn't talk about

One morning my Updates sidebar badge said 2. I clicked — empty. Gmail's sidebar counts every message with CATEGORY_UPDATES + UNREAD, regardless of INBOX state. But the Updates tab filters to INBOX. Some older filters had archived messages without also clearing UNREAD, leaving phantom, unreachable, permanently-unread messages lighting up the badge.

python

# Clear UNREAD on any archived-but-unread message — these are unreachable from any tab
resp = svc.users().messages().list(
    userId='me',
    q='is:unread -in:inbox -in:trash -in:spam -in:drafts',
    maxResults=500,
).execute()
ids = [m['id'] for m in resp.get('messages', [])]
if ids:
    svc.users().messages().batchModify(
        userId='me',
        body={'ids': ids, 'removeLabelIds': ['UNREAD']},
    ).execute()

↕ Scroll

The ten-line fix — runs at the end of every triage

🤝Human-in-the-loop, designed for silence

A Telegram bot that pings you only when something actually needs you.

Every triage run classifies mail into three buckets: known senders (rules apply automatically), new senders (added as pending and queued for my review), and security items (always surfaced, even if the sender is auto-archived — because a password-reset email from a marketing-archived domain still matters).

When I'm at my desk, I review the pending queue by typing something like “archive joolca, keep apple card, unsubscribe starlink” and the system translates that into --decide calls. When I'm not at my desk, Telegram pings me — only if there's a new sender, a security item, or a pending unread count above a threshold. A bot that pings every four hours with “nothing new” trains you to ignore the bot. Silence is the feature.

The human-in-the-loop design matters more than the specific messaging channel. The point is: the agent handles anything it's seen before; novel decisions route to me with just enough context to decide quickly (sample subjects + lifetime volume).

⏰4×/day, about three seconds each

launchd cron → triage → spam sweep → badge sweep → notify (maybe).

Scheduled via launchd at 07:00, 12:00, 17:00, and 22:00 — the hours of my day when checking an inbox is actually useful. Each run does four things in order: apply the catalog's decisions to current unread, scan Spam and purge (creating filters for new primary domains in the process), sweep the ghost-unread archive, and fire a Telegram notification only if there's something worth firing about.

The wrapper is 30 lines of bash. The triage itself is around 400 lines of Python. The spam sweep is 150. The badge sweep is ten. All the interesting behavior lives in the catalog schema and the decision logic — the plumbing is boring on purpose.

🧠Claude Code as the interface layer

Natural-language triage beats remembering shell commands.

I didn't build this in one sitting. I built it through conversation. When I type “check gmail”, Claude invokes the triage skill and returns structured output: N unread, M auto-archived, K pending. If there are pending senders, it surfaces them with sample subjects, and I can reply “archive X, keep Y” in plain English. Claude translates that into a batch of --decide calls.

When I have ideas like “always delete everything in the spam folder after you apply the rules”, Claude understands that's a standing behavior change, wires it into the launchd wrapper, and tests it end-to-end. The next scheduled run at 17:00 just does it. This is what I mean when I call LLMs engineering partners rather than autocomplete.

🪤Mistakes I made, and what's next

The stuff I'd do differently if I started again.

Mistakes

✗Filtering by sender address instead of primary domain — 40 rules where I needed 8
✗No SKIP list at first (the gmail.com near-miss)
✗Treating the catalog as code instead of data — every rule change was a commit
✗Building a terminal UI before realizing Claude Code is a better UI
✗Not logging run-level stats until a week in

What's next

→Gmail-filter-to-catalog import — merge ~30 legacy Gmail filters into the single source of truth
→Thread-level dedup (right now a 7-message thread counts 7×)
→Mobile-editable sender preferences (a way to append decisions from my phone)
→Deeper Telegram flow — approve/reject queued senders without a desk

💬The lesson

The reason 35,000 emails felt impossible wasn't volume. It was that I was being asked to make decisions about individual messages, and most of those decisions weren't about the message — they were about the sender. Turn a list-of-messages problem into a table-of-senders problem and the same inbox becomes tractable.

Related Architecture

🧩

Agent Frameworks — compared →

LangGraph, CrewAI, AutoGen, LlamaIndex, Swarm, Pydantic AI — an apples-to-apples comparison with code.

🏭

The AI Factory Pattern →

How to industrialize an agent workflow: input queue → agent processing → validated output, at scale.

Back to AI/ML Overview