Design Rationale, Competitive Research & Implementation Notes
2026-03-19 · Internal DocumentThe agent system uses a three-layer hierarchy for decision-making, not a binary rules-vs-LLM split:
| Layer | Coverage | Mechanism | Cost |
|---|---|---|---|
| L1: Rules Engine | ~80% | Deterministic SOP state machines, configurable per store | $0 |
| L2: Contextual Bandit | ~15% | Thompson Sampling, cross-store shared model, learns from Yes/No + business outcomes | $0 (CPU-only) |
| L3: LLM | ~5% | Claude Sonnet for truly novel/unknown situations | ~$12/store/month |
The rules engine first narrows candidates (e.g., 3β5 eligible technicians based on skill, availability, certification). The Bandit then selects among qualified candidates only. This limits the negative impact of exploration β the Bandit cannot recommend someone unqualified.
Reference: Spotify pre-selects 100 most relevant items before contextual bandit explores, limiting UX impact. Result: 36.6% improvement in impression efficiency.
reward = w1 Γ immediate_signal + w2 Γ delayed_signal
immediate_signal: store manager Yes=1, No=0 (available in seconds)
delayed_signal: business outcome metrics (batch-computed daily)
- customer satisfaction score
- service duration vs. expected
- rebooking rate
- same-day revenue impact
| Immediate = Yes | Immediate = No (Override) | |
|---|---|---|
| Delayed = Good outcome | Strong positive β AI correct, human agrees | Weak negative β override worked, but AI may also have been fine |
| Delayed = Bad outcome | Most valuable case β AI wrong, human missed it too (blind spot) | Strong negative β AI wrong, override also didn't help |
The lower-left quadrant (accepted but bad outcome) is the most valuable training signal β it reveals systematic blind spots in both AI and human judgment.
Reference: DoorDash uses daily batch reward computation rather than instant feedback. Stitch Fix uses stylist curation (immediate) + customer keep/return (delayed) as two-layer signal.
All stores contribute to a single Bandit model with store_id as a context feature. This effectively multiplies training signal by the number of active stores.
Reference: ServiceTitan uses industry benchmarks as priors for first 3β4 months, then transitions to per-company models. Toast leverages 130K+ locations for cross-location insights.
(state, action, reward) tuple for the Contextual BanditStage 1: Data Collection (continuous, zero AI cost)
Agent makes decision β Employee Yes/No β Store as raw structured record
β Bandit distributions update immediately
Stage 2: Bandit Learning (automatic)
Distributions converge over many observations
β Recommendations improve incrementally
β No explicit analysis needed
Stage 3: Convergence Detection (automatic)
When a distribution becomes highly concentrated
(one action dominates with high confidence, across multiple stores)
β System generates "suggested rule" event
Stage 4: Human-in-the-Loop Rule Graduation
Super Admin reviews suggested rule with supporting data:
- Frequency and consistency across stores
- Business outcome correlation
- Which stores contributed data
β Admin adopts / modifies / dismisses
Stage 5: Cost Reduction
Adopted rules move from L2 (Bandit) to L1 (Rules Engine)
β Zero marginal cost, deterministic execution
β The system gets cheaper over time
Override data is highly structured (service type, time, employee, action). Use SQL, not embeddings.
When the Bandit has insufficient data for a specific context (cold start for a new combination), the system retrieves historical overrides using progressively relaxed SQL queries:
Round 1: Exact match
service=gel_extension AND day=friday AND hour=19
β 3 results β sufficient, use these
Round 1: Exact match (different case)
service=acrylic_fullset AND day=thursday AND hour=15
β 0 results β relax
Round 2: Relax service dimension, match on duration
estimated_minutes > 60 AND day=thursday AND hour=15
β 0 results β relax further
Round 3: Relax time dimension, match on traffic tier
estimated_minutes > 60 AND is_peak_hour=true
β 8 results β sufficient
Each round is a SQL query (< 10ms). Results are aggregated into a compact summary (< 100 tokens) before being fed to the LLM as few-shot context.
β Wrong: Feed 20 raw override records (wastes tokens)
"3/7 AmyβLisa, 3/14 MikeβChen, 3/21 AmyβLisa..."
β
Right: Pre-aggregate in SQL, feed summary
"Past 60 days, Fri 18-21 gel extension walk-in assignment
overridden 14 times, 86% juniorβsenior.
Across 3 stores, 4 managers. Last: 3/21 Queens #3."
Result: < 100 tokens, same information density
With ~1M daily operations across 50 stores, only a tiny fraction contain valuable learning signals. Processing every record with AI is wasteful.
The Contextual Bandit naturally solves the sparse signal problem:
Researched: 2026-03-19
| Platform | AI Approach | Learns from Feedback? |
|---|---|---|
| Zenoti | Rule-based segmentation + NLP receptionist + Smart Marketing | No |
| Mindbody | Trigger-based automation + Attentive partnership | No |
| Phorest | Behavior-based triggers (Client Reconnect) | No |
| Boulevard | Manual tags, no AI decision layer | No |
| Vagaro | AI receptionist only | No |
| MaSe | None | No |
| Approach | Who Uses It | Celoria Relevance |
|---|---|---|
| Rule-based automation | All salon SaaS (Zenoti, Phorest, etc.) | Our L1 β table stakes, not differentiator |
| Batch retraining | Healthcare (Viz.ai, Aidoc), Uber | Too heavy for our stage; relevant post-scale |
| Multi-armed bandits | DoorDash, Netflix, Stitch Fix | Our L2 β proven at scale, lightweight |
| Contextual bandits | Spotify, Netflix | Our L2 target β context-aware decisions |
| Full RL (value iteration) | Uber (matching), DoorDash (dispatch) | Future consideration for multi-store orchestration |
| Self-improving agent loops | Forethought, Presto | Interesting for SOP auto-generation |
βββββββββββββββββββββββββββββββββββββββββββββββ
β Layer 1: UI / Interaction β β AI Agents replacing this layer
β (Booking, scheduling, admin CRUD, POS) β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 2: Business Logic / Permissions β β Agent tools can partially replace
β (RBAC, multi-tenant, workflow automation) β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 3: Data / Intelligence / Compliance β β Enduring value, new access layer
β (Domain models, decision data, audit) β
βββββββββββββββββββββββββββββββββββββββββββββββ