AI Decision Architecture

Design Rationale, Competitive Research & Implementation Notes

2026-03-19 · Internal Document

Table of Contents

Three-Layer Decision Architecture
Override Flywheel — Corrected Design
Data Retrieval Strategy
Sparse Reward Problem
Cross-Industry Competitive Research
SaaS Value Layer Analysis

1. Three-Layer Decision Architecture

The agent system uses a three-layer hierarchy for decision-making, not a binary rules-vs-LLM split:

Layer	Coverage	Mechanism	Cost
L1: Rules Engine	~80%	Deterministic SOP state machines, configurable per store	$0
L2: Contextual Bandit	~15%	Thompson Sampling, cross-store shared model, learns from Yes/No + business outcomes	$0 (CPU-only)
L3: LLM	~5%	Claude Sonnet for truly novel/unknown situations	~$12/store/month

Why Contextual Bandit (L2) Instead of Direct LLM?

Cost: Bandit runs on CPU — no inference cost. LLM calls cost tokens.
Latency: Bandit decision < 1ms. LLM decision 1–3 seconds.
Learning: Bandit explicitly learns from each Yes/No signal. LLM is stateless (unless fine-tuned).
Determinism: Same context + same Bandit state = same recommendation. LLM can vary.

Candidate Pre-Filtering (Spotify Pattern)

The rules engine first narrows candidates (e.g., 3–5 eligible technicians based on skill, availability, certification). The Bandit then selects among qualified candidates only. This limits the negative impact of exploration — the Bandit cannot recommend someone unqualified.

Reference: Spotify pre-selects 100 most relevant items before contextual bandit explores, limiting UX impact. Result: 36.6% improvement in impression efficiency.

Two-Layer Reward Signal

reward = w1 × immediate_signal + w2 × delayed_signal

immediate_signal: store manager Yes=1, No=0 (available in seconds)
delayed_signal:   business outcome metrics (batch-computed daily)
  - customer satisfaction score
  - service duration vs. expected
  - rebooking rate
  - same-day revenue impact

Why Two Layers Matter

	Immediate = Yes	Immediate = No (Override)
Delayed = Good outcome	Strong positive — AI correct, human agrees	Weak negative — override worked, but AI may also have been fine
Delayed = Bad outcome	Most valuable case — AI wrong, human missed it too (blind spot)	Strong negative — AI wrong, override also didn't help

The lower-left quadrant (accepted but bad outcome) is the most valuable training signal — it reveals systematic blind spots in both AI and human judgment.

Reference: DoorDash uses daily batch reward computation rather than instant feedback. Stitch Fix uses stylist curation (immediate) + customer keep/return (delayed) as two-layer signal.

Cross-Store Sharing

All stores contribute to a single Bandit model with store_id as a context feature. This effectively multiplies training signal by the number of active stores.

Queens #3 learns something → Flushing #1 benefits immediately
50 stores × daily decisions = sufficient signal for convergence
New stores joining get immediate benefit from the shared model

Reference: ServiceTitan uses industry benchmarks as priors for first 3–4 months, then transitions to per-company models. Toast leverages 130K+ locations for cross-location insights.

2. Override Flywheel — Corrected Design

The original product-definition.md stated: "Every time an employee rejects an AI suggestion → system learns a new rule." This was oversimplified. A single override is noise, not signal. The corrected design below was established 2026-03-19.

What Overrides Are NOT

A single override is not a rule. It's a data point.
A single override is not "experience." It could be personal preference, a one-time situation, or a mistake.
Overrides should not automatically become rules — this would inject noise into the rules engine.

What Overrides ARE

Each override is a (state, action, reward) tuple for the Contextual Bandit
Aggregated overrides reveal patterns through statistical convergence
The aggregation happens automatically via Bandit probability distributions — no hardcoded thresholds

The Correct Flywheel

Stage 1: Data Collection (continuous, zero AI cost)
  Agent makes decision → Employee Yes/No → Store as raw structured record
  → Bandit distributions update immediately

Stage 2: Bandit Learning (automatic)
  Distributions converge over many observations
  → Recommendations improve incrementally
  → No explicit analysis needed

Stage 3: Convergence Detection (automatic)
  When a distribution becomes highly concentrated
  (one action dominates with high confidence, across multiple stores)
  → System generates "suggested rule" event

Stage 4: Human-in-the-Loop Rule Graduation
  Super Admin reviews suggested rule with supporting data:
  - Frequency and consistency across stores
  - Business outcome correlation
  - Which stores contributed data
  → Admin adopts / modifies / dismisses

Stage 5: Cost Reduction
  Adopted rules move from L2 (Bandit) to L1 (Rules Engine)
  → Zero marginal cost, deterministic execution
  → The system gets cheaper over time

The flywheel compounds on three axes: accuracy (more data → better Bandit), cost (rule graduation → fewer Bandit/LLM calls), and switching cost (accumulated decision data is non-portable).

3. Data Retrieval Strategy

Core Principle: No Vector Search Needed

Override data is highly structured (service type, time, employee, action). Use SQL, not embeddings.

Progressive Relaxation Query

When the Bandit has insufficient data for a specific context (cold start for a new combination), the system retrieves historical overrides using progressively relaxed SQL queries:

Round 1: Exact match
  service=gel_extension AND day=friday AND hour=19
  → 3 results → sufficient, use these

Round 1: Exact match (different case)
  service=acrylic_fullset AND day=thursday AND hour=15
  → 0 results → relax

Round 2: Relax service dimension, match on duration
  estimated_minutes > 60 AND day=thursday AND hour=15
  → 0 results → relax further

Round 3: Relax time dimension, match on traffic tier
  estimated_minutes > 60 AND is_peak_hour=true
  → 8 results → sufficient

Each round is a SQL query (< 10ms). Results are aggregated into a compact summary (< 100 tokens) before being fed to the LLM as few-shot context.

Why NOT Vector Search

Override data has clear structured dimensions — SQL is faster and more precise
Similarity thresholds for vector search are model-dependent and scenario-dependent — no universal correct value
Vector DB is additional infrastructure to maintain
Progressive relaxation achieves "fuzzy matching" through dimension hierarchy, not embedding similarity

Context Compression for LLM

❌ Wrong: Feed 20 raw override records (wastes tokens)
"3/7 Amy→Lisa, 3/14 Mike→Chen, 3/21 Amy→Lisa..."

✅ Right: Pre-aggregate in SQL, feed summary
"Past 60 days, Fri 18-21 gel extension walk-in assignment
 overridden 14 times, 86% junior→senior.
 Across 3 stores, 4 managers. Last: 3/21 Queens #3."

Result: < 100 tokens, same information density

4. Sparse Reward Problem

The Challenge

With ~1M daily operations across 50 stores, only a tiny fraction contain valuable learning signals. Processing every record with AI is wasteful.

Solution: Let Signal Emerge Through Bandit Convergence

The Contextual Bandit naturally solves the sparse signal problem:

No write-time processing: Store raw structured records only, zero AI cost
No batch analysis needed: Bandit distributions update on each Yes/No — signal accumulates automatically
No hardcoded thresholds: Convergence is detected by the shape of the probability distribution, not by arbitrary cutoffs
Small-sample robustness: Beta distributions are naturally conservative with low sample counts (high variance → don't over-commit)

Previous approaches considered and rejected:

LLM tagging at write time: Rejected — generates more data than originals, wastes tokens on 99.99% noise
Fixed statistical thresholds (e.g., "override rate > 30%"): Rejected — different scenarios have different baselines, hardcoded thresholds cannot adapt
Periodic batch SQL aggregation: Partially useful for reporting but not needed for the core learning loop — Bandit handles this automatically

5. Cross-Industry Competitive Research

Researched: 2026-03-19

Beauty/Salon Vertical — No AI Decision Loops Exist

Platform	AI Approach	Learns from Feedback?
Zenoti	Rule-based segmentation + NLP receptionist + Smart Marketing	No
Mindbody	Trigger-based automation + Attentive partnership	No
Phorest	Behavior-based triggers (Client Reconnect)	No
Boulevard	Manual tags, no AI decision layer	No
Vagaro	AI receptionist only	No
MaSe	None	No

Cross-Industry References

ServiceTitan — Dispatch Pro (Closest Analogy)

ML-powered dispatch simulation for home service technicians
Signals: actual revenue vs predicted, technician close rate, drive time accuracy
Cold start: industry benchmarks as priors for first 3–4 months, then per-company model
Directly applicable to Celoria's technician assignment problem

DoorDash — MAB Platform

Thompson Sampling for experimentation optimization
Daily batch reward computation (not real-time)
Chosen over UCB for robustness to delayed feedback
LLM-powered cross-vertical cold start (restaurant → grocery preferences)

Spotify — Contextual Bandits (Mar 2025)

Epsilon-greedy for homepage content-type calibration
Context: time of day, day of week, device, user/content embeddings
Key design: pre-select 100 candidates, then explore within that set
Result: 36.6% improvement in impression efficiency

Netflix — Artwork Personalization

Thompson Sampling with Doubly Adaptive correction (DATS)
20M+ personalized image requests/second at peak
Offline replay evaluation before online deployment

Stitch Fix — Human-in-the-Loop Gold Standard

AI generates candidates → human stylist curates → customer keep/return
Two-layer feedback: stylist selection (immediate) + customer decision (delayed)
Cold start via style quiz (initial profiling) + collaborative filtering
2024–2025: Added GenAI Style Assistant for pre-purchase dialogue feedback

Industry Maturity Spectrum

Approach	Who Uses It	Celoria Relevance
Rule-based automation	All salon SaaS (Zenoti, Phorest, etc.)	Our L1 — table stakes, not differentiator
Batch retraining	Healthcare (Viz.ai, Aidoc), Uber	Too heavy for our stage; relevant post-scale
Multi-armed bandits	DoorDash, Netflix, Stitch Fix	Our L2 — proven at scale, lightweight
Contextual bandits	Spotify, Netflix	Our L2 target — context-aware decisions
Full RL (value iteration)	Uber (matching), DoorDash (dispatch)	Future consideration for multi-store orchestration
Self-improving agent loops	Forethought, Presto	Interesting for SOP auto-generation

6. SaaS Value Layer Analysis

Three-Layer Value Model

┌─────────────────────────────────────────────┐
│  Layer 1: UI / Interaction                  │  ← AI Agents replacing this layer
│  (Booking, scheduling, admin CRUD, POS)     │
├─────────────────────────────────────────────┤
│  Layer 2: Business Logic / Permissions      │  ← Agent tools can partially replace
│  (RBAC, multi-tenant, workflow automation)   │
├─────────────────────────────────────────────┤
│  Layer 3: Data / Intelligence / Compliance  │  ← Enduring value, new access layer
│  (Domain models, decision data, audit)      │
└─────────────────────────────────────────────┘

Celoria's Current Position (as of 2026-03-19)

Layer 1–2: Live and validated — 35 locations, 26K+ transactions, 1,133 APIs, 4 frontends
Layer 3: Architecture designed (three-layer decision system, override flywheel, 106 defined actions), but agent code = 0

"SaaS is Dead" — Nuanced Take

UI-heavy CRUD SaaS: Genuinely threatened by AI agents
Data-heavy vertical SaaS: AI becomes a new access layer, not a replacement
Celoria's play: AI-native from Day 1 means the agent layer IS the product, not an add-on. The SaaS platform is the data/compliance substrate that agents operate on.

The key question for any SaaS in the AI era: "What percentage of your value is in Layer 3?" If most value is in Layers 1–2, you're vulnerable. If Layer 3 is where your differentiation lives, AI agents are your distribution channel, not your competitor.