Zebra Labs Backend Default Design — Outline (v1)

August 19, 2025

by William Warne, Software Engineer | Fractional CTO | Founder

Introduction

This document describes our default approach to backend API code design and architecture with FastAPI. The focus is on outcomes: reliability, maintainability, performance, security, and developer speed. Every choice should tie back to those goals and be easy to reason about.

We prefer Domain-Driven Design (DDD) and a Hexagonal Architecture (also called Ports and Adapters). We adopt Command Query Responsibility Segregation (CQRS) to separate write and read concerns in the application layer. We also adopt a code‑first approach to data (types, identifiers, constraints, and migrations are defined in code), and we capture audit data for changes by default.

Aim

To provide an opinionated, practical baseline for building FastAPI services at Zebra Labs that can be used as the default starting point for new APIs. Deviations are welcome when reasoned for a project’s needs.

Our Goals

Fast local and full local setup — one command to run the API and dependencies.
Stable, explicit contracts — predictable request/response shapes and error envelopes.
Domain-first code — clear boundaries that scale with features and teams.
Code-first data — identifiers, constraints, and migrations defined in code; reproducible environments.
Operational excellence — structured logs, metrics, and traces with actionable alerts.
Security by default — authentication, authorization, rate limits, and safe defaults.
Smooth deploys — Infrastructure as Code (IaC), Fly.io as default, Amazon Web Services (AWS) as alternative, with a migration path.

Technology Constraints & Baseline

Language & runtime: Python 3.12+
Web framework: FastAPI (ASGI)
Data modelling & validation: Pydantic v2 (request/response schemas)
Database access: SQLAlchemy 2.x (asynchronous engine and sessions)
Migrations: Alembic (generated from models, reviewed in code)
HTTP client: httpx (async)
Server/Process model: Uvicorn (development) / Gunicorn with Uvicorn workers (production)
Cache/queues: Redis (caching, rate limiting, lightweight queues/locks)
Primary database: PostgreSQL
Logging: structlog (JSON logs with correlation IDs)
Tracing & metrics: OpenTelemetry (traces) + Prometheus-compatible metrics endpoint
Lint/format/type: Ruff, Black, MyPy
Testing: Pytest, pytest-asyncio, coverage, Schemathesis (OpenAPI-based contract testing)
Configuration: pydantic-settings (environment-variable driven)

Why these? Mature ecosystem, async performance, strong typing/validation, first-class observability, and simple containerization.

Code Design (Domain-Driven, Hexagonal, CQRS)

We separate concerns into layers and keep dependencies pointing inward. The application layer adopts CQRS: commands (writes) and queries (reads) have separate handlers and DTOs (Data Transfer Objects). This improves clarity and allows different optimization strategies for reading versus writing.

Folder & Code Organization

app/
  domain/                     # Pure domain code: entities, value objects, domain services
    activities/
      entities.py
      value_objects.py
      services.py
      repository.py           # Ports: repository interfaces
  application/                # Use-cases (CQRS): commands/queries and handlers
    activities/
      dto.py                  # Input/Output DTOs
      commands.py             # Command models (write intents)
      command_handlers.py     # Handlers for commands (use-cases)
      queries.py              # Query models (read intents)
      query_handlers.py       # Handlers for queries
      uow.py                  # Unit of Work (transaction boundary) interface
  infrastructure/             # Adapters (implement ports), external systems
    db/
      models.py               # SQLAlchemy models (code-first)
      repositories.py         # Repository implementations (SQLAlchemy)
      uow.py                  # Unit of Work implementation (sessions/transactions)
      migrations/             # Alembic migrations
    messaging/
      outbox.py               # Outbox pattern for reliable events
    http/
      clients.py              # httpx clients
    auth/
      oidc.py                 # OpenID Connect helpers (optional)
  presentation/
    api/
      deps.py                 # FastAPI dependencies (auth, db session, request id)
      errors.py               # Error mappers and exception handlers
      pagination.py           # Cursor pagination helpers
      activities/
        router.py             # Routes composing application handlers
  settings.py                 # pydantic-settings
  main.py                     # FastAPI app factory, middleware, routes

scripts/
  seed.py                     # Dev data seed

infra/
  fly/                        # Fly.io configuration and deploy scripts
  aws/                        # Terraform modules and environment configs (alt hosting)

tests/
  unit/                       # Pure domain tests
  integration/                # Adapters (db, http) tests
  api_contract/               # Schemathesis/OpenAPI contract tests
  e2e/                        # End-to-end smoke tests

Makefile                      # developer tasks
.env.example                  # environment variables

Rules

Inner layers (domain, application) do not import outer layers (presentation, infrastructure).
Repositories and Unit of Work are defined as interfaces (ports) in the domain/application; infrastructure implements them.
CQRS by default: command handlers do not return domain entities; they return minimal results (IDs, summaries) or nothing. Query handlers return read-optimized DTOs.
No cross-domain imports except via a public API module per domain.

Example: Unit of Work and Repository Port

# app/application/activities/uow.py
from __future__ import annotations
from typing import Protocol, ContextManager

class AbstractActivitiesRepo(Protocol):
    async def add(self, activity: "Activity") -> None: ...
    async def get(self, activity_id: str) -> "Activity | None": ...
    async def list(self, *, cursor: str | None, limit: int) -> tuple[list["Activity"], str | None]: ...

class UnitOfWork(Protocol):
    activities: AbstractActivitiesRepo
    async def __aenter__(self) -> "UnitOfWork": ...
    async def __aexit__(self, exc_type, exc, tb) -> None: ...
    async def commit(self) -> None: ...
    async def rollback(self) -> None: ...

# app/infrastructure/db/uow.py
from contextlib import asynccontextmanager
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker
from .repositories import ActivitiesRepo

class SqlAlchemyUoW:
    def __init__(self, session_factory: async_sessionmaker[AsyncSession]):
        self._session_factory = session_factory
        self.session: AsyncSession | None = None
        self.activities: ActivitiesRepo | None = None

    async def __aenter__(self):
        self.session = self._session_factory()
        self.activities = ActivitiesRepo(self.session)
        return self

    async def __aexit__(self, exc_type, exc, tb):
        if exc:
            await self.session.rollback()
        else:
            await self.session.commit()
        await self.session.close()

    async def commit(self):
        await self.session.commit()

    async def rollback(self):
        await self.session.rollback()

CQRS Handlers (Commands and Queries)

# app/application/activities/commands.py
from pydantic import BaseModel

class CreateActivity(BaseModel):
    id: str
    title: str
    description: str | None = None

# app/application/activities/command_handlers.py
from .uow import UnitOfWork
from .commands import CreateActivity
from app.domain.activities.entities import Activity

async def handle_create(cmd: CreateActivity, uow: UnitOfWork) -> str:
    activity = Activity(id=cmd.id, title=cmd.title, description=cmd.description)
    async with uow:
        await uow.activities.add(activity)
    return activity.id

# app/application/activities/queries.py
from pydantic import BaseModel

class GetActivity(BaseModel):
    id: str

# app/application/activities/query_handlers.py
from .uow import UnitOfWork
from .queries import GetActivity

async def handle_get(q: GetActivity, uow: UnitOfWork):
    async with uow:
        return await uow.activities.get(q.id)

Code‑First Data (IDs, Constraints, Migrations) and Audit

We manage identifiers and constraints in code. Database defaults are avoided for IDs; instead, we generate them in the application layer to guarantee repeatability across environments and better testability.

Identifier Strategy

Default: ULID (Universally Unique Lexicographically Sortable Identifier) generated in code for monotonic ordering and good sharding properties.
Alternative: UUID v4 (random) or UUID v7 (time-ordered) if your project prefers. The key is: generated in code, not by the database.

# app/domain/shared/ids.py
import ulid

def new_id() -> str:
    return str(ulid.new())

SQLAlchemy Model (code-first, no DB-generated IDs)

# app/infrastructure/db/models.py
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column
from sqlalchemy import String, DateTime, Integer, JSON
from datetime import datetime, timezone
from app.domain.shared.ids import new_id

class Base(DeclarativeBase):
    pass

class ActivityModel(Base):
    __tablename__ = "activities"
    id: Mapped[str] = mapped_column(String(26), primary_key=True, default=new_id)  # ULID string
    title: Mapped[str] = mapped_column(String(200))
    description: Mapped[str | None]

    # Audit & concurrency (code-managed)
    created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=lambda: datetime.now(timezone.utc), nullable=False)
    updated_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=lambda: datetime.now(timezone.utc), nullable=False)
    created_by: Mapped[str | None] = mapped_column(String(128))
    updated_by: Mapped[str | None] = mapped_column(String(128))
    version: Mapped[int] = mapped_column(Integer, default=1, nullable=False)

Update Hooks (keep timestamps/version in code)

# app/infrastructure/db/repositories.py
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select
from datetime import datetime, timezone
from .models import ActivityModel

class ActivitiesRepo:
    def __init__(self, session: AsyncSession):
        self.session = session

    async def add(self, a):
        model = ActivityModel(
            id=a.id,
            title=a.title,
            description=a.description,
            created_by=a.audit.actor,
            updated_by=a.audit.actor,
        )
        self.session.add(model)

    async def get(self, activity_id: str):
        res = await self.session.execute(select(ActivityModel).where(ActivityModel.id == activity_id))
        return res.scalar_one_or_none()

    async def touch_update(self, model: ActivityModel, actor: str | None):
        model.updated_at = datetime.now(timezone.utc)
        model.updated_by = actor
        model.version += 1

Audit Log (append-only, code-managed)

We capture who did what and when, and (optionally) before/after snapshots for sensitive entities.

# app/infrastructure/db/models.py (continued)
class AuditLog(Base):
    __tablename__ = "audit_log"
    id: Mapped[str] = mapped_column(String(26), primary_key=True, default=new_id)
    occurred_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=lambda: datetime.now(timezone.utc), nullable=False)
    actor: Mapped[str | None] = mapped_column(String(128))
    action: Mapped[str] = mapped_column(String(64))  # e.g., "activity.created"
    entity_type: Mapped[str] = mapped_column(String(64))
    entity_id: Mapped[str] = mapped_column(String(64))
    metadata: Mapped[dict | None] = mapped_column(JSON)
    before: Mapped[dict | None] = mapped_column(JSON)
    after: Mapped[dict | None] = mapped_column(JSON)

Audit writes are emitted by repositories or the Unit of Work. Prefer code-level capture to database triggers so behavior is explicit, testable, and portable across environments.

Migrations

Use Alembic autogenerate from SQLAlchemy models; review diffs into hand-crafted revisions kept in version control.
Commands: make migrate (create revision), make upgrade, make downgrade.

API Design Conventions

Routing & versioning: Prefix routes with /v1. Group by bounded context (domain).
Consistency: JSON field naming is snake_case in Python but serialized as camelCase in responses (configure Pydantic aliasing if desired). Pick one and enforce it.
Error envelope: Return { code, message, details?, requestId } with appropriate HTTP status codes. Map exceptions centrally.
Pagination: Cursor-based pagination is preferred. Response includes { items, nextCursor }.
Idempotency: For POST/PUT that create or mutate resources, accept an Idempotency-Key header and de-duplicate on the server.
Correlation IDs: Generate a requestId per request and log it.

# app/presentation/api/errors.py
from fastapi import Request
from fastapi.responses import JSONResponse

class ApiError(Exception):
    def __init__(self, code: str, message: str, status_code: int = 400, details: dict | None = None):
        self.code = code
        self.message = message
        self.status_code = status_code
        self.details = details or {}

async def api_error_handler(request: Request, exc: ApiError):
    body = {
        "code": exc.code,
        "message": exc.message,
        "details": exc.details,
        "requestId": request.state.request_id,
    }
    return JSONResponse(status_code=exc.status_code, content=body)

# app/presentation/api/deps.py
import uuid
from fastapi import Request

async def request_id_middleware(request: Request, call_next):
    request.state.request_id = uuid.uuid4().hex
    response = await call_next(request)
    response.headers["X-Request-Id"] = request.state.request_id
    return response

Security & Privacy (Defaults)

Authentication: JSON Web Token (JWT) bearer tokens by default; optional OpenID Connect integration.
Authorization: Role-Based Access Control helpers. Keep permission checks in application layer handlers.
CORS: Restrictive by default; allow-list per environment.
Rate limiting: Redis-backed token bucket per IP and per user.
Input limits: Request body size limits on server. Timeouts on upstream calls.
Secrets: Environment variables (development), Fly.io secrets in production; AWS Secrets Manager when on AWS.
Data protection: Encrypt in transit (TLS) and at rest (Postgres, S3). Redact sensitive fields in logs. Classify personal data; avoid logging it.

Performance & Scalability

Prefer asynchronous I/O. Size database and HTTP connection pools for expected concurrency.
Avoid “N+1” queries via eager loading or dedicated read models in query handlers.
Cache expensive reads (in-process or Redis). Invalidate on write via Unit of Work events.
Use background workers for heavy tasks; retries with exponential backoff.

Monitoring, Observability & Alerting

Structured Logging

# app/main.py (excerpt)
import structlog
from fastapi import FastAPI

structlog.configure(processors=[
    structlog.processors.TimeStamper(fmt="iso"),
    structlog.processors.add_log_level,
    structlog.processors.JSONRenderer(),
])
logger = structlog.get_logger()

app = FastAPI()

@app.middleware("http")
async def access_log(request, call_next):
    response = await call_next(request)
    logger.info("http_request", method=request.method, path=request.url.path, status=response.status_code, request_id=request.state.request_id)
    return response

Metrics (Prometheus-compatible)

# app/main.py (excerpt)
from prometheus_fastapi_instrumentator import Instrumentator

Instrumentator().instrument(app).expose(app, endpoint="/metrics")

Tracing (OpenTelemetry)

# app/main.py (excerpt)
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

provider = TracerProvider(resource=Resource.create({"service.name": "zebra-api"}))
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
FastAPIInstrumentor.instrument_app(app, tracer_provider=provider)

Health & Readiness

# app/presentation/api/router_health.py
from fastapi import APIRouter
router = APIRouter()

@router.get("/healthz")
async def liveness():
    return {"status": "ok"}

@router.get("/readyz")
async def readiness():
    # Optionally check DB/connectivity here
    return {"status": "ready"}

Dashboards & Alerts

Provide Grafana dashboards: request rate, latency (p50/p95/p99), error rate, DB pool saturation, cache hit rate.
Alerts: error rate > 1% for 5 minutes, p95 latency above SLO for 10 minutes, readiness failing.

Local Development & Environments

Easy local: docker compose with PostgreSQL, Redis, Mailpit (SMTP testing), and the API.
Full local: add OpenTelemetry Collector + Grafana/Prometheus for local observability.
Make targets:

run: ## Run API locally (reload)
	uvicorn app.main:app --reload --port 8000

up: ## Start local stack
	docker compose up -d

down: ## Stop local stack
	docker compose down -v

lint:
	ruff check . && black --check .

typecheck:
	mypy app

test:
	pytest -q --maxfail=1 --disable-warnings

migrate:
	alembic revision --autogenerate -m "change"

upgrade:
	alembic upgrade head

downgrade:
	alembic downgrade -1

.env.example: include database URL, Redis URL, secret keys, tracing exporter URL.

Testing Strategy

Unit tests: pure domain logic and application handlers (CQRS) with fake repositories.
Integration tests: SQLAlchemy repositories against a test database (transaction rollbacks).
API contract tests: Schemathesis against the running FastAPI app.
End-to-end smoke: start stack with Docker, run a minimal flow.

CI/CD & Release

Pipeline: lint → typecheck → unit/integration → build image → scan → contract tests → push image → (optional) deploy preview.
Version images and migrations together; keep migrations additive and reversible where possible.

Infrastructure & Hosting

Principles

Portability first: Everything runs in a container image; configuration comes from environment variables; no provider‑specific code in the domain or application layers. Observability uses open standards (OpenTelemetry for traces/metrics, JSON logs), so it works on Fly.io and on Amazon Web Services (AWS) the same way.
Security by default: Private networking, least‑privilege identities, encrypted data at rest and in transit, minimal surface on the public internet, and immutable images built in Continuous Integration (CI) and deployed by Continuous Delivery (CD).
Infrastructure as Code (IaC): All cloud resources are defined in Terraform and versioned. Remote state uses Amazon Simple Storage Service (S3) with state locking in Amazon DynamoDB.
Environments: dev, stage, and prod are separate stacks with identical modules and different parameters.

Fly.io (Default)

What we run: one container for the FastAPI app behind Fly’s Anycast edge; optional Fly Postgres. Health checks pin to /readyz and /healthz.
Secrets: Fly secrets and environment variables; rotate on deploys.
Observability: same app image exports Prometheus metrics at /metrics and traces via OpenTelemetry (OTLP) to a collector you run locally or remotely.
Scaling: per‑region horizontal scaling + memory/CPU sizing; keep the number of regions small unless latency needs demand wider distribution.

AWS (Alternative, Terraform)

Default choice: Amazon Elastic Container Service (ECS) on AWS Fargate (serverless containers). Simpler alternative: AWS App Runner (managed HTTP service) — fine for small apps; fewer knobs, faster to stand up. Not default: Amazon Elastic Kubernetes Service (EKS) — only when you need Kubernetes explicitly.

Reference Architecture (ECS Fargate)

Networking: one Virtual Private Cloud (VPC), at least two Availability Zones. Public subnets for the load balancer, private subnets for tasks and databases. Network Address Translation (NAT) gateways for egress to the internet.
Ingress: Application Load Balancer (ALB) with TLS certificates from AWS Certificate Manager (ACM). Optional AWS Web Application Firewall (WAF) for Layer‑7 protections.
Compute: ECS service on Fargate with Auto Scaling (CPU/memory metrics and optionally request rate). Task definitions run the API container and (optionally) a sidecar OpenTelemetry Collector.
Data: Amazon Relational Database Service (RDS) for PostgreSQL (Multi‑AZ), Amazon ElastiCache for Redis, Amazon S3 for object storage.
Identity & Secrets: AWS Identity and Access Management (IAM) roles for tasks (least privilege). Secrets in AWS Secrets Manager. Keys in AWS Key Management Service (KMS) for encryption at rest.
Observability: Logs to Amazon CloudWatch Logs; metrics via Prometheus scraping (self‑managed in ECS) or Amazon Managed Service for Prometheus; dashboards in Grafana (self‑hosted or Amazon Managed Grafana). Traces exported via OTLP to your collector or AWS Distro for OpenTelemetry.
DNS: Amazon Route 53. Optionally use weighted records for canary/blue‑green.

Terraform Layout

infra/aws/terraform/
modules/
network/ # VPC, subnets, NAT, routing
alb/ # ALB, listeners, target groups, WAF (optional)
acm/ # TLS certificates
ecs_cluster/ # ECS cluster + capacity providers (Fargate)
ecs_service/ # Task definition, service, autoscaling
rds_postgres/ # PostgreSQL (Multi‑AZ), parameter groups, security groups
redis/ # ElastiCache Redis
s3_bucket/ # App buckets (e.g., uploads), versioning, lifecycle
secrets/ # Secrets Manager entries and IAM policies
otel/ # OpenTelemetry collector task and IAM
dns/ # Route 53 zone/records
envs/
dev/
main.tf variables.tf outputs.tf backend.tf
stage/
...
prod/
...
providers.tf
versions.tf

Security Defaults (AWS)

Network isolation: Application Load Balancer is public; ECS tasks, RDS, and Redis are private. Only the load balancer can reach the tasks; only tasks can reach the database and Redis.
Encryption: TLS via ACM on the load balancer; at rest with KMS for RDS, ElastiCache, and S3. Force TLS for client connections to RDS/Redis.
Identity: Separate IAM roles for the task execution (pulling images/secrets) and for the application (only the permissions it needs). No access keys in containers.
Secrets: Store in Secrets Manager; inject via ECS task definitions. Rotate DB credentials and JWT secrets on a schedule.
Containers: Run as non‑root, read‑only root filesystem, drop Linux capabilities, no privileged mode. Scan images in Amazon Elastic Container Registry (ECR) and in CI (for example, Grype/Trivy). Optionally sign images (Cosign) and verify in the pipeline.
Edge protections: Optional AWS WAF rules for common attacks; AWS Shield Standard provides basic Distributed Denial of Service (DDoS) protection.
Cost awareness: Network Address Translation gateways are billed per hour + data; prefer VPC endpoints for S3/ECR to reduce egress.

Observability on AWS

Logs: ship to CloudWatch Logs with a retention policy; include requestId, user/tenant (when present), latency, status.
Metrics: scrape /metrics with a Prometheus server in ECS or use Amazon Managed Service for Prometheus; alert from latency percentiles and error rates.
Traces: export OTLP to a collector; store and explore with AWS X‑Ray or any vendor supporting OTLP.

Migration: Fly.io → AWS

Goals

Security: improve isolation (private subnets), managed secrets, encryption, and per‑service identities.
Portability: keep the application unchanged — only environment and infrastructure differ. Observability remains standards‑based.
Predictability: rehearsed cutover with rollback.

Step 1 — Discovery & Readiness

Inventory services, environment variables, secrets, database schemas, object storage, background jobs, cron, ingress/egress rules, and external integrations (allowlists, webhooks).
Confirm the application generates identifiers in code and not in the database (our default) to simplify migration.

Step 2 — Stand Up AWS Staging with Terraform

Create VPC, subnets, NAT, security groups, ECS, RDS PostgreSQL, ElastiCache Redis, S3, Secrets Manager, Route 53, and observability stack using the modules above.
Deploy the same container image to staging (push to ECR). Verify /healthz, /readyz, /metrics, OpenAPI, and logging/tracing.

Step 3 — Data Migration (Database)

Option A: Near‑zero downtime with AWS Database Migration Service (DMS)

Provision RDS in Multi‑AZ.
Use DMS to perform full load then Change Data Capture (CDC) from Fly Postgres to RDS.
Validate row counts and checksums; run application read replicas against RDS to verify.

Option B: Planned downtime with pg_dump / pg_restore

Quiesce writes on Fly (maintenance mode or feature flag to block mutations).
Dump from Fly Postgres with pg_dump --format=custom.
Restore to RDS; run migrations; verify.

Step 4 — Object Storage & Files

If using Fly volumes/local disk: sync to S3 (for example, rclone or a one‑off container job). Verify checksums and metadata (content‑type, cache‑control).

Step 5 — Secrets & Configuration

Create Secrets Manager entries. Update ECS task definition environment variables and secret refs. Rotate keys on cutover.

Step 6 — Traffic Cutover

Bring up production ECS service behind ALB. Warm tasks.
Use Route 53 weighted DNS: start with a small percentage to AWS (for example, 5%), monitor latency/error budgets, then ramp to 100%.
Alternatively, use blue‑green with two target groups on the ALB and swap.

Step 7 — Finalize & Rollback Plan

After stable period, decommission DMS or tear down Fly Postgres.
Keep the Fly deployment hot for fast rollback. Rollback path: send DNS back to Fly. Note: if writes occurred on AWS, rolling back data requires either dual‑write during ramp‑up or accepting data loss up to the last verified sync. Prefer short “read‑only” windows during cutover to avoid divergence.

Security Checklist for Migration

✅ Private subnets for tasks and databases; only ALB is public.
✅ TLS everywhere; enforce require_ssl on Postgres.
✅ Least‑privilege IAM roles; no static credentials in containers.
✅ Secrets in Secrets Manager with rotation policies.
✅ KMS encryption for RDS/ElastiCache/S3; S3 buckets private + least‑privilege access.
✅ CloudWatch alarms on p95 latency and error rate; health checks on ALB.
✅ Web Application Firewall rules on ALB (rate limit, common exploit rules) if exposure warrants it.

Portability Guardrails

Keep provider specifics in the infrastructure layer (Terraform modules, task environment).
Use environment variables for configuration; avoid hardcoding AWS resource identifiers in application code.
Telemetry through OpenTelemetry (OTLP), not provider‑specific SDKs.
Object storage via S3 API — use MinIO locally to stay portable.
Image & registry: push to Amazon Elastic Container Registry (ECR); update deploy workflow.
Environment parity: copy secrets and environment variables; verify CORS, origins, and DNS.
Database move: either logical replication to Amazon RDS or pg_dump/pg_restore with planned downtime.
Object storage: MinIO/Fly volumes → S3 sync.
DNS cutover: Route 53 with health checks and quick rollback; keep old stack hot until verified.

Coming soon — detailed cutover runbook with commands

This section is being polished.

Final Thoughts

This baseline is designed for real-world delivery: clear domain boundaries, code-first data and migrations, CQRS for clarity and scalability, and strong operational visibility. Deviations are welcome when reasoned—and improvements are expected as we learn.