Infrastructure Audit — 共享基础设施审计

RBAC | Messaging | Audit Log | Multi-Tenancy | Agent System

2026-03-22 | Spec 098-business-dag-audit

Pass (fully covered)

Partial (gaps exist)

Fail (missing / broken)

Overall Coverage

Total Checks

Pass

Partial

Fail

1. RBAC & Permission System

All protected API routes must use permission-middleware.js (not deprecated auth-middleware.js). Checks resource:action pairs via requirePermission / requireAnyPermission / requireAllPermissions.

655 permission middleware calls

77 API files with RBAC guards

~1197 total route definitions (153 files)

Status	Check Item	Verification Result
✓	Core API RBAC coverage	77 of ~106 non-public API files have `requirePermission` guards. Major modules (appointments, employees, schedules, marketing, reports, admin, stores, payroll, loyalty, settings) all covered.
✓	Payment routes permission guard	`payment/index.js` applies `createExplicitRoutePermissionGuard` with 40+ explicit route-permission rules. All JWT-protected payment routes go through this guard before handler execution.
✓	POS device authentication	POS device routes (`pos-device-routes.js`) use `authenticateDevice()` from `device-auth.js` middleware. These are intentionally before the permission guard (device-token auth, not JWT).
✓	Public routes intentionally unprotected	`api/public/booking.js` (21 routes) and `api/public/ipad-showcase.js` (1 route) are intentionally public, protected by rate limiters (`publicGeneralLimiter`, `publicSlotsLimiter`, `publicConfirmLimiter`) and optional guest JWT where needed.
✓	Webhook routes intentionally unprotected	Webhook endpoints (`webhooks/stripe.js`, `webhooks/twilio-sms.js`, `webhooks/resend-email.js`) validate via provider-specific signature verification (Stripe signature, Twilio signature). No user JWT needed.
✓	Agent API RBAC coverage	`api/agent/content.js` (8), `api/agent/seo.js` (9), `api/agent/oauth.js` (5), `api/agent/review.js` (7) all have `requirePermission` guards.
⚠	Guest auth routes	`api/guest-auth.js` has 16 route definitions but only 4 use `tenantDb`/`platformDb`. Guest auth routes are intentionally public-facing (login, register, send-code) but some internal management endpoints may lack permission checks. Needs manual review.
⚠	Deprecated auth-middleware usage	`auth-middleware.js` still exports permission-related methods (with deprecation warnings). Some older files may still import from it. Codebase search shows most have migrated, but legacy references remain in a few modules.

2. Messaging & Circuit Breakers

All external service calls must be wrapped with CircuitBreaker. Breakers must not be bypassed. Multi-provider services must auto-failover when primary circuit opens.

16 files with CircuitBreaker

4 messaging clients with send-guard

13+ registered breakers

Status	Check Item	Verification Result
✓	SMS circuit breakers	`twilio-client.js` has `twilioBreaker` (CircuitBreaker 'twilio'). `telnyx-client.js` has `telnyxBreaker` (CircuitBreaker 'telnyx'). Both exported and checked in `sms-sender.js` for failover.
✓	SMS auto-failover	`sms-sender.js` `getActiveClient()` checks both readiness AND `breaker.isOpen()`. If preferred provider circuit is open, automatically falls back to alternate provider.
✓	Email circuit breakers	`ses-client.js` has `sesBreaker` (CircuitBreaker 'ses'). `resend-client.js` has `resendBreaker` (CircuitBreaker 'resend').
⚠	Email auto-failover	`email-sender.js` only initializes `sesClient`. Unlike SMS sender, there is no automatic failover from SES to Resend when SES circuit opens. Resend client exists but is not wired into the high-level sender as a fallback.
✓	Send-guard coverage (SMS)	Both `twilio-client.js` (line 203) and `telnyx-client.js` (line 215) call `checkPhone()` from `send-guard.js` before sending. All SMS sends go through the guard.
✓	Send-guard coverage (Email)	Both `ses-client.js` (line 290) and `resend-client.js` (line 169) call `checkEmail()` from `send-guard.js` before sending. All email sends go through the guard.
✓	AI service circuit breakers	All AI services wrapped: `ai-service.js` ('openai-email-ai'), `deepseek-template-service.js` ('deepseek-template-ai'), `insight-llm-service.js` ('insight-llm'), `llm-processor.js` ('openai-voice-processor'), `intent-handler.js` ('openai-voice-intent').
✓	Acquisition service circuit breakers	`gbp-sync-service.js` ('gbp_api'), `gbp-review-poller.js` ('gbp_api'), `review-reply-service.js` ('anthropic-sonnet'), `content-publisher.js` ('meta_api' + 'gbp_posts_api'). All external API calls wrapped.
✓	Voice/Voicebot circuit breakers	`retell-client.js` ('retell-ai'), `llm-processor.js` ('openai-voice-processor'). External voice AI calls wrapped.
✓	Holiday API circuit breaker	`holidayService.js` has `nager-holiday-api` breaker for external Nager.Date API calls.
✗	CodePay payment gateway circuit breaker	`pos-payment-service.js` and `codepay-query-service.js` make HTTP calls to CodePay API but have NO `CircuitBreaker` wrapper. This is a constitution violation for a critical payment path.
✓	Circuit breaker registry	`circuit-breaker-registry.js` provides global registry with `getAllStatuses()`, `hasOpenCircuit()`, `getOpenCircuits()`. Pre-registers `gbp_api`, `anthropic_sonnet`, `google_places`. Monitoring endpoint: `/api/monitoring/circuits`.

3. Audit Log System

Critical operations (payment, refund, permission changes, auth events) must be logged to the audit trail. Global audit middleware captures write operations automatically.

457 routes in audit-route-map

51 files reference auditLogger

Status	Check Item	Verification Result
✓	Global audit middleware	`server.js` (line 1159) loads `createAuditMiddleware` from `middleware/audit-middleware.js`. Applied globally to all routes. Automatically logs POST/PUT/PATCH/DELETE operations.
✓	Audit route map completeness	`config/audit-route-map.js` maps 457 write operations with explicit `{ resource, action, summary }` metadata. Covers auth, appointments, payments, permissions, employees, stores, marketing, and more.
✓	Permission change audit	`services/permission-audit-service.js` and `api/admin/permissions.js` both reference audit logging. Role/permission changes are tracked.
⚠	Payment/refund explicit audit	Payment routes in `api/payment/` have no explicit `auditLogger` calls. Relies entirely on global audit middleware auto-capture. While functional, explicit audit calls with enriched context (amount, transaction ID, refund reason) would provide higher-quality audit trails for financial operations.
✓	Audit alert system	`services/audit/audit-alert-service.js` + `jobs/auditAlertJob.js` provide automated alerting on suspicious audit patterns. `admin/audit-alerts.js` API for alert management.
✓	Tenant-level audit isolation	`services/audit/tenant-audit.js` ensures audit logs are stored within tenant schemas. Cross-tenant audit leakage prevented by schema isolation.

4. Multi-Tenancy & Data Access Layer

All database queries must use the unified data access layer (tenantDb.query / platformDb.query). Legacy pool.query calls violate tenant isolation guarantees.

1849 tenantDb/platformDb calls (213 files)

103 files with legacy pool.query

9 API files still using pool.query

Status	Check Item	Verification Result
✓	Unified data access layer exists	`database/data-access/` provides `tenantDb` and `platformDb` with schema-aware query routing, transaction support, and `queryWithTenant()` for background jobs. 1849 usages across 213 files.
✗	Legacy pool.query elimination	103 non-test files still use `pool.query` directly. Breakdown: 9 API files, ~40 service files, ~16 job files, plus scripts, middleware, and utilities. Major offenders include `sms-sender.js` (4 calls), `email-sender.js` (4 calls), `booking-service.js`, `sync-engine.js` (21 calls), multiple loyalty services.
⚠	API layer migration	9 API files still use `pool.query`: `public/booking.js` (64 calls - largest offender), `dev.js` (9), `voice/outbound.js` (9), `voice/streaming-stt.js` (7), `logs.js` (6), `webhooks/resend-email.js` (5), `tenant-activate.js` (1), `tenants-validate.js` (1), `webhooks/twilio-sms.js` (2).
⚠	Service layer migration	40 service files still use `pool.query`. Includes critical paths: `booking-service.js`, `sms-sender.js`, `email-sender.js`, `sync-engine.js`, `checkinService.js`, multiple loyalty services, voice services. These risk cross-tenant data leakage.
⚠	Background job migration	16 job files use `pool.query`. Includes `appointmentReminderJob.js`, `benchmarkCalculationJob.js`, `subscriptionExpirationJob.js`, `trial-expiration-job.js` (3 calls), etc. Jobs should use `tenantDb.queryWithTenant()`.
✓	Tenant context middleware	`middleware/tenant-context.js` extracts tenant from JWT and injects into `req.tenant`. Applied globally before route handlers.

5. Agent System

Agent routing via store-network.js and orchestrator-network.js. 3-level routing: deterministic ROUTE_MAP, SOP escalation rules, LLM classification (stub). Escalation via escalation.js utility.

5 agent definitions

12 event types in ROUTE_MAP

3 acquisition agents not in network

Status	Check Item	Verification Result
✓	Store agent coverage	Store network `agents` array includes 4 agents: `appointment-agent`, `day-sop-agent`, `supervisor-agent`, `finance-agent`. All have corresponding agent files in `agents/agents/`.
✗	Acquisition agents in store network	ROUTE_MAP references `review_agent`, `content_agent`, `local_seo_agent` for 5 event types, but these 3 agents are NOT in the `agents: [...]` array of `storeNetwork`. Route resolution will fail: `network.agents.find(a => a.name === 'review_agent')` returns `undefined` because the acquisition agents are not registered.
✓	Orchestrator network	`orchestrator-network.js` routes all multi-store commands through `orchestratorAgent`. Single-agent routing via `network.agents[0]`. Simple and correct.
✓	SOP definitions loaded	5 SOP definitions loaded at module level: `appointment-checkout`, `day-start`, `day-end`, `walkin-to-appointment`, `daily-reconciliation`. Used for Level 2 escalation rule matching.
⚠	Escalation rules coverage	`escalation.js` provides generic `createEscalationContext()` and `getEscalationPrompt()`. It is a utility library, not a rule engine. Escalation scenarios are embedded in agent system prompts via `getEscalationPrompt()`. No structured rule definitions exist for specific failure scenarios (e.g., payment fails, SMS fails, appointment conflict).
✓	Agent permission checker	`agents/lib/permission-checker.js` exists for agent-level permission validation. Uses `tenantDb` for queries.
✓	Agent event logging	`agents/lib/log-decision.js` and `agents/lib/event-emitter.js` provide structured logging and event emission for agent decisions. Uses `tenantDb`.
⚠	ROUTE_MAP multi-target support	TODO comment in `store-network.js`: `checkout.completed` needs multi-target routing (both review_agent and appointment-agent). Current architecture only supports single agent per event type. Workaround: routes to review_agent only.

Loose Points Summary

Critical (must fix)

CodePay circuit breaker missing: pos-payment-service.js and codepay-query-service.js call CodePay API without CircuitBreaker. Constitution mandates all external service calls be wrapped. This is the payment critical path.
Acquisition agents not registered in store network: store-network.js ROUTE_MAP maps 5 event types to acquisition agents (review_agent, content_agent, local_seo_agent) but these agents are not in the agents array. Agent resolution returns undefined.
Legacy pool.query in 103 files: Direct pool.query calls bypass tenant schema isolation. Highest risk in public/booking.js (64 calls), sync-engine.js (21 calls), and multiple loyalty/payment services.

Warnings (should fix)

Email failover not wired: email-sender.js only uses SES. Unlike SMS sender, no automatic failover to Resend when SES circuit opens. Resend client exists but is standalone.
Payment audit enrichment: Payment/refund routes rely on global audit middleware. No explicit audit calls with financial context (amount, transaction ID, refund reason). Audit trail for financial operations could be richer.
ROUTE_MAP single-target limitation: checkout.completed event can only route to one agent. Multi-target support needed for events that trigger multiple agent workflows.
Escalation rules are implicit: Escalation scenarios are embedded in LLM system prompts, not structured as inspectable/testable rules. Hard to audit coverage.
Guest auth route review needed: Some guest auth management endpoints may lack permission checks. Manual review required.
Background jobs using pool.query: 16 job files bypass tenant data access layer. Should migrate to tenantDb.queryWithTenant() for proper schema routing.

Verification Commands

Run these commands to reproduce the audit findings:

What to check	Command
RBAC permission middleware usage count	`grep -r "requirePermission\\|requireAnyPermission\\|requireAllPermissions" backend/api/ \| wc -l`
Files with CircuitBreaker	`grep -rl "CircuitBreaker" backend/services/`
Legacy pool.query in non-test files	`grep -rl "pool\.query" backend/ --include="*.js" \| grep -v test \| grep -v node_modules \| wc -l`
Send-guard integration	`grep -rn "checkPhone\\|checkEmail" backend/services/`
Audit route map entry count	`grep -E "^ '(POST\|PUT\|PATCH\|DELETE)" backend/config/audit-route-map.js \| wc -l`
Agent ROUTE_MAP entries	`grep -A1 "ROUTE_MAP" backend/agents/networks/store-network.js`
RBAC permission-map test	`cd backend && npm run test:permission-map`
Explicit permission guard test	`cd backend && npm run test:explicit-permission`