Platform Design

Architecture Decisions

Living ADRs that capture the platform design decisions — content & versioning, the real-time sim engine, and infrastructure — with their context, the decision, and the trade-offs.

How to read these

Each record is a lightweight ADR: the context (the forces at play), the decision, and the consequences. Status is Accepted (we're building toward it) or Proposed (agreed in principle, details still open). These evolve as the design matures.

Decision records

ADR-0001Accepted

Internal content model is structured and semantic

Context: Lessons need granular, meaningful version history and must serve authoring, runtime rendering, and AI grounding from one source.
Decision: Model a lesson as structured, typed fields — not opaque blobs or HTML strings — stored with full revision history.
Consequences: Diffs become semantic ('Vitals: HR 92 → 110'), enabling field-level merges and friendly history. One store serves authors, clients, and AI agents. Requires a well-designed content schema up front.

ADR-0002Accepted

Independently versioned planes (content vs capability)

Context: Shipping everything as one blob forces every customer to upgrade at once, breaking forked lessons and frustrating buyers.
Decision: Split the product into independently versioned axes: platform/runtime, content schema, content package, and tenant pin/channel.
Consequences: Upgrades become explicit per-axis bumps, never forced. Adds version-resolution complexity that must be hidden behind clear UX and tooling.

ADR-0003Proposed

Package-manager distribution model

Context: Tenants copy our default-library scenarios and customize them, then can't benefit from our later updates without losing changes.
Decision: Treat the library like upstream packages and tenant customizations like forks with lockfiles: publish, fork, sync-from-upstream (merge), pin, and subscribe to release channels.
Consequences: Decouples our release cadence from tenant adoption and preserves customizations across updates. Requires a registry, semantic versioning discipline, and merge tooling.

ADR-0004Accepted

Capability contracts govern compatibility

Context: New runtime/service features can break older content, and heterogeneous clients (web, VR, game) update on cadences we don't control.
Decision: Content declares required capabilities/schema; runtimes advertise provided capabilities. A lesson launches only if satisfiable. Breaking changes require a major version bump and an explicit, migration-gated opt-in.
Consequences: Old content keeps working; failures surface as clear messages instead of silent breaks. Requires a capability registry and negotiation at launch.

ADR-0005Proposed

Wikipedia-style authoring UX over git-like semantics

Context: Authors are clinicians, not developers, but we want full traceability, transparency, and revert.
Decision: Use git-like semantics underneath (immutable revisions, branch/merge, lineage) but present a Wikipedia-style surface: page history, who-changed-what, diffs, one-click revert, update prompts.
Consequences: Authors get powerful versioning without VCS literacy. Upstream sync becomes a friendly field-level three-way merge. Requires custom diff/merge UI rather than exposing raw VCS.

ADR-0006Accepted

Standards posture: clean core, standards at the edges

Context: Interop with customer LMSs matters, but legacy standards (SCORM) would contaminate a multi-client core.
Decision: Keep a clean internal model. Use xAPI as the universal tracking spine and cmi5 as the launch/packaging contract; provide SCORM export/import and LTI 1.3 connect as edge adapters only.
Consequences: Core fits web, VR, and game clients uniformly. Interop is satisfied at boundaries. Adapters add surface area but isolate legacy concerns.

ADR-0007Accepted

Provider-neutral AI abstraction with capability profiles

Context: AI models change constantly (e.g. ElevenLabs v2→v3 adds emotion tags but breaks escaping), and language support varies across LLM/TTS/STT providers.
Decision: Lessons depend on capabilities, not model strings. Configure LLM/TTS/STT per lesson and per agent via constraint-driven selection (language + features filter to compatible providers). Provider adapters are versioned.
Consequences: Pinned lessons keep working; capability opt-in is explicit. Avoids runtime language/feature footguns. Requires an abstraction layer (e.g. Vercel AI SDK for LLM) and maintained capability profiles.

ADR-0008Accepted

BYOK — lessons declare, tenants supply, runtime resolves

Context: Customers may bring their own provider keys/endpoints, but lessons must stay portable, shareable, and exportable.
Decision: Lessons reference providers abstractly with no secrets. Tenants store BYOK credentials in an encrypted, tenant-scoped secret store. The runtime binds abstract refs to concrete credentials at launch; vendor keys are the default with BYOK override.
Consequences: Content stays portable and groundable; credentials stay isolated in the PHI/runtime plane. Adds a secret-resolution step and per-subprocessor BAA considerations.

ADR-0009Accepted

Localization is a first-class axis

Context: Localization touches UI labels, lesson content, and AI capability/prompts — and retrofitting it later is very costly.
Decision: Treat language as a first-class axis across three layers: UI chrome (ICU i18n), content (per-locale variants with per-locale version lineage), and AI (language-constrained provider selection + per-language prompt tuning).
Consequences: Stale-translation detection and language-aware provider selection work natively. Increases content-model and authoring complexity from the start.

ADR-0010Proposed

Isolation strategy differs per plane

Context: Content and learner data have very different sensitivity and distribution needs in a healthcare (HIPAA) context.
Decision: Use a different isolation model per plane: pooled/CDN-distributable for the content/catalog plane; strong (silo/bridge) isolation for the runtime/results plane where PHI and BYOK secrets live.
Consequences: Optimizes cost and reach for content while concentrating compliance effort where PHI lives. Requires a clear data-classification boundary between planes.

ADR-0011Accepted

Sim engine is a separate authoritative real-time service

Context: The simulation state (vitals, 3D positions, events) has a different scaling profile and failure domain than the LiveKit media SFU, and we want to extract this logic out of the Unreal game engine.
Decision: Build the sim engine as a standalone, server-authoritative, ticking service — not inside the LiveKit container. Clients are thin renderers that send inputs and interpolate; the server owns world state.
Consequences: Media and simulation scale and fail independently. Logic leaves the game engine, enabling AI-built iteration and freeing clients (Unreal, Unity, R3F, Expo, chat/voice-only) from app-store distribution friction. Adds a service to operate.

ADR-0012Accepted

Sim state syncs over LiveKit data channels

Context: Clients already hold one authenticated WebRTC connection that traverses NAT/TURN; a parallel transport would duplicate auth and connectivity concerns.
Decision: The sim engine joins each room as a server-side participant and publishes state over LiveKit data channels: lossy/unordered packets for high-frequency 3D transforms, reliable packets for discrete events.
Consequences: One transport, one auth model, one connectivity story for media and sim state. Couples the sim transport to LiveKit; mitigated by keeping the engine core transport-agnostic.

ADR-0013Accepted

ECS + fixed-timestep authoritative tick with an authoritative clock

Context: The engine must track entities (patient, avatars, equipment), fire time-based events ('code at t=95s'), and support debrief replay.
Decision: Model world state as an Entity-Component-System; run a fixed-timestep, drift-corrected tick loop; derive an authoritative sim clock from a monotonic source, decoupled from wall-clock to allow pause/resume and time-scaling.
Consequences: Deterministic-ish, AI-legible state; accurate event timing well within medical-sim tolerances. Requires disciplined separation of sim logic from transport and rendering.

ADR-0014Proposed

Event-sourced sim state for replay and xAPI

Context: Debrief is a core modality, and the content model already defines xAPI as the universal tracking spine.
Decision: Log every state-changing event against sim time as an append-only stream. Replay reconstructs sessions for debrief; the same stream projects into xAPI statements.
Consequences: Debrief replay and analytics share one substrate. Adds storage and event-schema versioning concerns tied to the capability contract.

ADR-0015Accepted

Sim engine language: TypeScript with a native escape hatch

Context: The engine must be built and run by AI agents, integrate with LiveKit Node Agents, and serve thin clients — while real-time demands are modest (low-frequency physiology, client-originated transforms, few server-driven entities).
Decision: Write the sim engine in TypeScript/Node to unify the stack (web, R3F, Expo, agents, infra), share types end-to-end, and maximize AI-codegen velocity. Keep the core transport-agnostic and ECS-structured so any CPU hot-path can later be extracted to a Go sidecar or Rust/WASM module.
Consequences: Fast iteration and one language company-wide; shared client/server schemas. Accepts a tick-rate/entity-count ceiling from Node's GC and single thread, with a defined path to promote hot paths to native code if profiling demands it.

ADR-0016Proposed

Sim engine runs as an autoscaled worker pool

Context: Many sessions run concurrently and must scale with demand; a container per session is slow to start and resource-heavy.
Decision: Run the sim engine as a pool of worker processes, each hosting several sessions, with sessions sharded across workers and the pool autoscaled — mirroring LiveKit's own node and agent-worker model. A control plane assigns sessions to workers at start.
Consequences: Cheaper and faster than per-session containers. Requires session scheduling and a clear in-process isolation story; per-session pods remain an option if hard isolation is later needed.

ADR-0017Accepted

Scenarios are data-driven, runtime-composable ECS worlds

Context: Customers (with our AI agents) must author and modify any scenario — define the environment (e.g. an ICU with 2 patients, 3 nurses, a doctor, a family member, and equipment) and spawn, place, or remove entities both at design time and live at runtime (e.g. spawn a defib in VR and set it on the med cart).
Decision: Represent a scenario as declarative data describing an initial ECS world (entities + components) plus scheduled events and rules. Authoring is the same operation as runtime mutation: adding, removing, or repositioning entities edits the entity/component graph. The engine that runs the sim is the engine that authors it.
Consequences: One model serves both authoring and runtime; live in-VR authoring is natural; AI agents author by generating and editing the graph. Requires stable entity/component schemas and a clear prefab/instantiation model.

ADR-0018Accepted

Entity prefabs form a versioned, capability-gated catalog

Context: Authors spawn equipment and actors (defib, BVM, nurse, patient) from a menu; each needs a consistent component set, client-side render/animation assets, and may require specific client capabilities.
Decision: Define entity types as prefabs (component templates); spawning instantiates a prefab. Prefabs are versioned content packages with declared capability requirements (per the content model and capability contract), so a scenario only loads prefabs its runtime and clients can support.
Consequences: Reuses the content versioning + capability machinery; thin clients map prefab ids to local render assets. Requires keeping prefab schemas and client asset bundles in sync via the contract.

ADR-0019Proposed

Customer/AI-authored behavior via declarative rules + sandboxed scripts

Context: Customers and their AI agents will define scenario logic (event triggers, vitals transitions, branching). Running arbitrary author-supplied code on our servers is a security and stability risk in a multi-tenant, healthcare context.
Decision: Prefer declarative rules / event graphs for common logic; where scripting is needed, execute it in a sandboxed, resource-limited environment exposing a capability-scoped API to the ECS. AI agents generate rules and scripts against this safe, versioned surface.
Consequences: Safe multi-tenant authoring with reviewable, deterministic-ish logic. Requires building a rules engine and/or a hardened script sandbox, with the scripting API versioned by the capability contract.

ADR-0020Accepted

Kubernetes is the workload orchestrator; Pulumi is the single IaC layer

Context: Requirements demand multi-cloud portability (AWS/GCP/Azure, EU sovereignty), elastic autoscaling for 50–200 concurrent sessions, geographic distribution, multiple concurrent app versions, AI-provisionable infra, and security by design.
Decision: Use Kubernetes (managed control planes — AKS/GKE/EKS) as the workload orchestrator, with Pulumi (TypeScript) as the single IaC layer provisioning both cloud resources and cluster workloads (including Helm releases via the Pulumi Kubernetes provider). Pulumi and K8s are complementary layers, not alternatives.
Consequences: K8s provides the portability + autoscaling substrate; Pulumi keeps everything as AI-writable TypeScript IaC. Adds K8s operational complexity, mitigated by managed control planes. LiveKit ships a Helm chart that fits this model.

ADR-0021Accepted

One single-tenant, region-pinned cluster per customer

Context: Customers operate in a single region and run their own independent version of the app (e.g. one on 1.4.5, another on 1.4.9). LiveKit clusters are homogeneous (one version), and healthcare tenants need strong isolation.
Decision: The deployment cell is a per-customer, single-tenant cluster pinned to the customer's region, running that customer's pinned version. A lightweight control plane routes a customer's sessions to their cluster; cross-region routing within a single customer is not required. Within a customer's region, LiveKit's region-aware node selector remains available but is typically unnecessary.
Consequences: Clean version independence, data residency, and tenant isolation (aligns with HIPAA and the per-plane isolation ADR), with a small blast radius per customer. Higher baseline cost/ops from many clusters — mitigated by IaC templating, managed control planes, and scale-to-low node pools; namespace-per-tenant was the cheaper alternative consciously traded away for version and isolation independence.

ADR-0022Accepted

LiveKit media plane: host networking, one pod per node, via the official Helm chart

Context: LiveKit's K8s docs require host networking so the node's rtc.udp/tcp ports are handled directly by livekit-server; this limits deployment to one LiveKit pod per node (other workloads may co-reside).
Decision: Deploy via the official LiveKit Helm chart with host networking (one livekit-server pod per node, scaled at node granularity). Terminate signal-connection TLS at the cloud load balancer/Ingress (GKE LB or AWS ALB via the AWS Load Balancer Controller; nginx-ingress + cert-manager elsewhere), and provide an SSL cert for the embedded TURN/TLS server. Sim-engine and agent workers run as ordinary autoscaled pods.
Consequences: Matches LiveKit's documented K8s model; media path is correct and performant; media scales at node granularity while compute workers scale as normal pods. Requires per-cloud LB/Ingress setup and certificate management.

ADR-0023Accepted

Multi-version delivery via per-cell Helm releases + native draining

Context: Customers take updates on their own schedule, so multiple application versions run simultaneously; a single LiveKit cluster is homogeneous (one version), so versions are separated by cell (cluster/namespace/release).
Decision: Ship versioned container images and per-cell Helm releases; the control plane binds a tenant's pinned version/channel to the cell running it, selecting the runtime that satisfies the lesson's capability requirements (per the content model). Use LiveKit's native connection draining for upgrades — on SIGTERM it keeps active rooms, rejects new ones, and shuts down when empty (the Helm chart sets terminationGracePeriodSeconds to 5h).
Consequences: Versions coexist cleanly per cell, runtime selection reuses the capability contract, and upgrades never interrupt active sessions. Requires image/version lifecycle management, cross-cell routing, and accounting for long pod-termination windows in autoscaling and cost.

ADR-0024Accepted

Observability via Prometheus/Grafana + OpenTelemetry

Context: We need dashboards showing live instances, session counts, geographic distribution, and performance.
Decision: Scrape LiveKit's native Prometheus metrics and instrument the sim engine, agents, and control plane with OpenTelemetry; visualize in Grafana (live sessions, per-region, per-version, node/room utilization, worker queue depth) with alerting.
Consequences: Unified, portable observability across clouds. Adds a metrics/tracing stack to operate, with managed options available.

ADR-0025Accepted

Security designed in from the start

Context: Healthcare (HIPAA/PHI) and multi-tenant operation require security as a baseline, not an afterthought.
Decision: Default-deny NetworkPolicies; namespace isolation aligned to the content vs runtime/PHI planes (ADR-0010); KMS-backed secret stores for BYOK and platform secrets; private clusters, RBAC, encryption in transit and at rest, and audit logging; cloud BAAs per region.
Consequences: Compliance and tenant isolation become structural. Adds policy and key-management overhead that must be enforced in the Pulumi/K8s definitions.

ADR-0026Accepted

Per-customer infrastructure provisioned automatically at onboarding

Context: Each new organization needs its own region-pinned, single-tenant cluster and supporting resources, sometimes with special requirements (a specific cloud, region, or version).
Decision: Model each customer as a parameterized Pulumi stack (cloud, region, version, sizing). Onboarding a new org triggers automated provisioning of that stack; AI agents can generate or adjust per-customer stacks for special requirements.
Consequences: Repeatable, auditable, AI-operable onboarding where special requirements become stack parameters. Requires a robust stack template, per-customer state/secret management, and a provisioning pipeline tied to org signup.

ADR-0027Proposed

Tiered isolation: dedicated cluster vs shared namespace-isolated tier

Context: A dedicated cluster per customer (ADR-0021) duplicates a ~$600/mo fixed baseline (control plane, Redis, ingress, observability, system + HA-floor nodes) for every customer, regardless of load. For a small org on a $30–50K license this baseline dominates cost; meanwhile egress and real compute are usage-proportional and not reducible by sharing. Most customers don't require physical isolation, but all still need version independence and logical isolation.
Decision: Offer two deployment tiers behind the same control plane and Pulumi templates. Tier A — Dedicated cluster (ADR-0021): full physical isolation, any version, region-pinned; priced for high-security/military/strict-residency customers. Tier B — Shared, namespace-isolated: shared generic infrastructure (LiveKit media pool, managed control plane, Redis with per-tenant logical isolation, ingress, observability) with each customer in its own namespace running its own pinned sim-engine/agent version. Version independence lives at the app layer (per-namespace Helm releases/images), so it is preserved in both tiers; LiveKit server is treated as shared infrastructure, not the per-customer product. PHI/strict-isolation customers are placed on Tier A.
Consequences: Tier B amortizes the fixed baseline across N tenants (≈ baseline/N per customer), making the unit economics work for the majority while keeping logical isolation and version independence; Tier A remains a paid feature for those who need physical isolation. Adds a second deployment template, namespace-level NetworkPolicy/RBAC/secret isolation, per-tenant Redis keyspace/ACL separation, and a placement policy deciding which tier a customer lands on. Shared LiveKit must run a single homogeneous server version across co-located tenants.

ADR-0028Proposed

Two-plane Postgres topology: a global control plane and per-tenant data planes

Context: The platform needs two very different kinds of data: a small, highly available registry of customers and how to provision/route them (read and written by our IaC and control plane), and large, sensitive per-customer runtime data (users, sessions, forked lesson library, PHI) that must respect each org's cloud, region, residency, and isolation tier.
Decision: Run two distinct Postgres planes. (1) Control plane — a single global Postgres in our Azure home region holding the customer registry, infra/provisioning state, Auth0 mappings, and billing/license; it contains NO PHI. (2) Data plane — one Postgres per customer, deployed where that org's cluster lives (cloud + region + version pin), holding users, sessions, content forks, and PHI. Dedicated DB on Tier A; logically isolated on Tier B (ADR-0027). The hard rule: the control plane never stores PHI, and a tenant DB never stores another tenant's data.
Consequences: Clean compliance and blast-radius story; data residency follows the tenant DB; the control plane stays small and globally available. Costs a cross-plane access pattern — the control plane holds connection references (secret refs) to tenant DBs and the app resolves the right tenant DB per request. Requires per-tenant migration orchestration so tenants on pinned versions can migrate on their own schedule.

ADR-0029Proposed

Organizations registry is the central control point and onboarding seam

Context: Adding a customer means choosing an isolation tier, cloud, region, and version; provisioning infrastructure; wiring SSO; and forking the starter content — all of which must be coordinated and auditable, eventually self-serve from a sign-up/trial button.
Decision: Make `organizations` the top-level entity in the control-plane DB and the system of record for every customer. A row carries identity (name, slug, status), placement (tier, cloud, region, version_pin), infra binding (pulumi_stack + written-back outputs like cluster endpoint and tenant-DB connection ref), BYO-cloud credential references (secret store, never raw), Auth0 binding (auth0_org_id, connection ids, verified domains), and compliance flags. Onboarding is an ordered pipeline keyed off this row: insert org → trigger its Pulumi stack → fork Lumeto starter content into its tenant DB → create its Auth0 Organization + SSO ticket → write all outputs back onto the org.
Consequences: One auditable place ties together IaC (ADR-0026), tiers (ADR-0027), identity (ADR-0030), and content forking. Special requirements ('must be AWS', 'pin to 1.4.5') become org fields/stack parameters. Requires the org row to be the durable orchestration record (status per pipeline step, idempotent retries) rather than a passive table.

ADR-0030Proposed

Auth0 Organizations + Self-Service Enterprise Configuration for delegated SSO

Context: Customers are hospitals and medical schools that require SSO against their own directories (Entra ID, Okta, Google Workspace, ADFS, PingFederate, generic OIDC/SAML) and want to configure it themselves. We have an Auth0 subscription with unlimited enterprise connections and want a hands-off, documentation-driven setup.
Decision: Model one Auth0 Organization per customer org (1:1 with the `organizations` row). Delegate SSO setup using Auth0 Self-Service Enterprise Configuration. A profile is a reusable whitelist (max 20), not a per-customer object; the per-customer objects are the access ticket and the resulting connection. Start with a SINGLE permissive profile enabling Entra ID (OIDC), Generic SAML, and Google Workspace, so customers (mostly Microsoft Entra) can choose OIDC or configure Entra as SAML themselves. Add a second profile only to RESTRICT (e.g. a regulated profile that forces SAML + requires SCIM and domain verification), never to enable per-customer variation. At onboarding, mint an access ticket from a profile bound to the customer's Auth0 Organization; send the URL plus our docs; the customer's IT admin uses the assistant to create their IdP app, map claims, assign access, and test SSO. On completion we store auth0_org_id, connection_id, and matching_domains back on the org. App login is org-aware server-side OIDC.
Consequences: Near-zero-touch enterprise onboarding that scales with unlimited connections; customers own their IdP config and testing. Starting with one permissive profile avoids the 20-profile ceiling and template sprawl. We must maintain the profile(s) and a Management-API integration to mint Organizations and tickets. Auth0 Organizations is the membership/login anchor; our app trusts the Auth0 user id (sub) as the identity key. Enterprise customers are SSO-only; database connections are reserved for Lumeto staff and self-serve trials.

ADR-0031Proposed

Domain Discovery for email routing; SCIM + a defined User Attribute Profile for provisioning

Context: With many enterprise connections we need users to reach their own IdP without picking a tenant manually, and we need user lifecycle (create/update/deactivate) and a known set of attributes to populate the tenant `users` table reliably.
Decision: Enable Organization Domain Discovery so verified domains populate both the Auth0 Organization's domain list and the connection's matching_domains — a user enters their work email and is routed to their company IdP (home-realm discovery). Define a User Attribute Profile (UAP) capturing the claims we depend on (email, name, and a role/group signal for instructor vs learner). Use SCIM (offered in the self-service assistant) as the primary user-lifecycle path into the tenant, with JIT-on-login provisioning as the fallback for connections without SCIM.
Consequences: Frictionless email-based login and proper deprovisioning when a customer disables a user in their IdP. Requires us to define and version the UAP and to handle SCIM inbound provisioning mapped to the correct tenant DB, plus a JIT path on first login. Role mapping depends on customers emitting a usable group/role claim; where they don't, roles are assigned in-app.

ADR-0032Proposed

Lumeto is modeled as its own organization with control-plane staff roles

Context: Lumeto's own employees need to authenticate and operate the platform (provisioning, support, cross-tenant access). Lumeto runs day-to-day on Google Workspace and also has an Entra directory. We must decide the source of truth for staff identity and where staff authorization lives.
Decision: Model Lumeto as its own Auth0 Organization backed by a single Google Workspace enterprise connection — the same pattern customers use (dogfooding). Google Workspace is the single source of truth for staff identity (not split with Entra; our Entra directory is kept only as an internal test IdP). Staff are control-plane operators: their identities and elevated roles (provisioning, support, cross-tenant access) live in the control plane, NOT as users in any customer's tenant DB. Authentication (who) comes from Google Workspace; authorization (what they may do to a given org/infra) comes from control-plane roles.
Consequences: One place to deprovision a departing employee; a consistent identity model across Lumeto and customers; clean separation of staff operators from customer learners/instructors. Requires a control-plane roles/permissions model and audit logging for cross-tenant staff actions.

ADR-0033Accepted

Control plane is a pnpm-workspaces TypeScript monorepo

Context: The control plane spans a web client, an API, a database layer, shared domain logic, and infrastructure — all TypeScript, and they must share types end to end. The existing web app already uses pnpm.
Decision: Adopt a single pnpm-workspaces monorepo: apps `web/` (React Router 7) and `api/` (Hono), shared `packages/db` (Drizzle schema + migrations + client), `packages/core` (domain/service layer), `packages/api-client` (typed client + OpenAPI spec), and `infra/` (existing Pulumi + a new control-plane stack). A shared tsconfig.base.json and `pnpm -r` scripts drive typecheck/build/test.
Consequences: End-to-end type sharing and one install/CI surface. Only `api/` may import `packages/db`/`core`; `web/` speaks HTTP — keeping the boundary honest. Requires workspace tooling discipline (source-TS exports, bundling at build).

ADR-0034Accepted

Control plane is API-first with a separate Hono API and React Router client

Context: The product vision is an API-first, client-agnostic realtime sim engine: game engines, multiple web UIs, and AI agents all connect via the API, and we also build our own clients. We want to practice and prove that topology starting with the control plane.
Decision: Run the control-plane API (Hono) and web client (React Router 7) as two separate services/containers. The web app is 'just another client' that speaks HTTP to the API; only the API touches the database via the service layer. This deliberately establishes the API-first discipline (clean contract, versioning, uniform auth) even though the control plane itself is low-traffic CRUD — the realtime performance lessons will come from the sim-engine API, which follows the same pattern.
Consequences: Honest client/server boundary and a reusable pattern for the sim engine; the web client validates the API like any external consumer. Costs two deployables, an internal network hop, and two-service CI/CD. Not the place we'll learn realtime throughput — that's the sim engine.

ADR-0035Accepted

OpenAPI-first API contract via @hono/zod-openapi

Context: A client-agnostic API must serve TypeScript clients (our web app) and non-TS clients (Unreal/Unity game engine, other languages). We need runtime validation and a machine-readable contract.
Decision: Define API routes with Zod schemas using @hono/zod-openapi: this gives runtime request/response validation and an auto-generated OpenAPI document (served with Swagger UI). The TS web client uses a typed client generated from the spec; non-TS clients codegen from the same OpenAPI document. The Zod schemas are the single source of truth for the contract.
Consequences: One contract drives validation, docs, and every client. Slightly more ceremony than Hono RPC, but it is the only option that cleanly serves non-TS clients — which is the whole point of API-first. Schemas live with the API and are versioned under /v1.

ADR-0036Accepted

Token-based API auth: Auth0 JWT bearer for all clients; web uses OIDC + BFF

Context: Every kind of client (browser, game engine, automation) must authenticate against the same API. Browsers also have stricter token-handling security needs than native clients.
Decision: The API authenticates every caller uniformly via Auth0-issued JWT access tokens, validated against Auth0's JWKS (org-aware via ADR-0030). The web client uses OIDC login and a Backend-for-Frontend pattern: the React Router server holds the session and attaches the access token when its loaders/actions call the API server-side, so tokens never reach browser JS. Native clients acquire tokens via the appropriate OAuth flow and call the same API.
Consequences: Uniform, client-agnostic authorization with no special-casing in the API; browser security preserved by keeping tokens server-side. Requires JWKS validation + audience/issuer checks in the API and an OIDC/session layer in the web server. In local dev, auth can be bypassed behind an explicit flag.

ADR-0037Proposed

One OCI image per service on Azure Container Apps via GHCR, with GitHub Actions CI/CD

Context: The control plane is ours to deploy however we like, but we want to practice the same cloud-portable, container-based path we'll use for customer apps — deployable to Azure now and AWS/GCP later, favoring one way over many.
Decision: Build one cloud-neutral OCI image per service and push to GHCR (neutral registry). Deploy to Azure Container Apps now (closest managed model to AWS App Runner / GCP Cloud Run, so the same image runs elsewhere); Postgres is Azure Database for PostgreSQL Flexible Server; local dev uses docker compose. A new Pulumi `infra/control-plane` stack wires it, mirroring the existing per-cloud pattern. GitHub Actions runs PR checks (typecheck, lint, test, build, migration validation); on squash-merge to main it builds + pushes images, runs migrations, and `pulumi up`s the dev environment, with staging/prod gated.
Consequences: The deployable artifact is identical across clouds (portability proof), only the Pulumi wiring differs. Practices the customer deploy path on our own low-risk service. Azure Container Apps is an Azure-specific control surface, but the image and the mental model stay portable.

ADR-0038Proposed

Control plane is deployed in Canada; data residency may force per-region control-plane replicas

Context: Lumeto operates in Canada, so some data must reside here regardless. Strict customers (e.g. US, EU) may require that ALL data about their organization — potentially including control-plane records that describe the org — stay within their own country. Today the control plane is a single global registry.
Decision: Deploy the control plane in a Canadian Azure region (Canada Central; Canada East lacks Container Apps) as the default home. Treat the control plane as potentially regionalizable: keep the per-organization data it stores minimal and clearly catalogued, so that if a regulated customer requires in-country residency we can stand up a per-region control-plane replica (or shard) for those tenants rather than redesigning. Re-evaluate once we know how much org-identifying data the control plane actually holds.
Consequences: Canada-first satisfies our own baseline. Deferring multi-region avoids premature complexity, but we must consciously limit and document control-plane PII so a future regional split stays tractable. Cross-region operation (staff in Canada operating EU tenants) will need explicit data-handling rules.