Zebra Labs Backend Default Design — Outline (v1)
by William Warne, Software Engineer | Fractional CTO | Founder
Introduction
This document describes our default approach to backend API code design and architecture with FastAPI. The focus is on outcomes: reliability, maintainability, performance, security, and developer speed. Every choice should tie back to those goals and be easy to reason about.
We prefer Domain-Driven Design (DDD) and a Hexagonal Architecture (also called Ports and Adapters). We adopt Command Query Responsibility Segregation (CQRS) to separate write and read concerns in the application layer. We also adopt a code‑first approach to data (types, identifiers, constraints, and migrations are defined in code), and we capture audit data for changes by default.
Aim
To provide an opinionated, practical baseline for building FastAPI services at Zebra Labs that can be used as the default starting point for new APIs. Deviations are welcome when reasoned for a project’s needs.
Our Goals
- Fast local and full local setup — one command to run the API and dependencies.
- Stable, explicit contracts — predictable request/response shapes and error envelopes.
- Domain-first code — clear boundaries that scale with features and teams.
- Code-first data — identifiers, constraints, and migrations defined in code; reproducible environments.
- Operational excellence — structured logs, metrics, and traces with actionable alerts.
- Security by default — authentication, authorization, rate limits, and safe defaults.
- Smooth deploys — Infrastructure as Code (IaC), Fly.io as default, Amazon Web Services (AWS) as alternative, with a migration path.
Technology Constraints & Baseline
- Language & runtime: Python 3.12+
- Web framework: FastAPI (ASGI)
- Data modelling & validation: Pydantic v2 (request/response schemas)
- Database access: SQLAlchemy 2.x (asynchronous engine and sessions)
- Migrations: Alembic (generated from models, reviewed in code)
- HTTP client: httpx (async)
- Server/Process model: Uvicorn (development) / Gunicorn with Uvicorn workers (production)
- Cache/queues: Redis (caching, rate limiting, lightweight queues/locks)
- Primary database: PostgreSQL
- Logging: structlog (JSON logs with correlation IDs)
- Tracing & metrics: OpenTelemetry (traces) + Prometheus-compatible metrics endpoint
- Lint/format/type: Ruff, Black, MyPy
- Testing: Pytest, pytest-asyncio, coverage, Schemathesis (OpenAPI-based contract testing)
- Configuration: pydantic-settings (environment-variable driven)
Why these? Mature ecosystem, async performance, strong typing/validation, first-class observability, and simple containerization.
Code Design (Domain-Driven, Hexagonal, CQRS)
We separate concerns into layers and keep dependencies pointing inward. The application layer adopts CQRS: commands (writes) and queries (reads) have separate handlers and DTOs (Data Transfer Objects). This improves clarity and allows different optimization strategies for reading versus writing.
Folder & Code Organization
app/
domain/ # Pure domain code: entities, value objects, domain services
activities/
entities.py
value_objects.py
services.py
repository.py # Ports: repository interfaces
application/ # Use-cases (CQRS): commands/queries and handlers
activities/
dto.py # Input/Output DTOs
commands.py # Command models (write intents)
command_handlers.py # Handlers for commands (use-cases)
queries.py # Query models (read intents)
query_handlers.py # Handlers for queries
uow.py # Unit of Work (transaction boundary) interface
infrastructure/ # Adapters (implement ports), external systems
db/
models.py # SQLAlchemy models (code-first)
repositories.py # Repository implementations (SQLAlchemy)
uow.py # Unit of Work implementation (sessions/transactions)
migrations/ # Alembic migrations
messaging/
outbox.py # Outbox pattern for reliable events
http/
clients.py # httpx clients
auth/
oidc.py # OpenID Connect helpers (optional)
presentation/
api/
deps.py # FastAPI dependencies (auth, db session, request id)
errors.py # Error mappers and exception handlers
pagination.py # Cursor pagination helpers
activities/
router.py # Routes composing application handlers
settings.py # pydantic-settings
main.py # FastAPI app factory, middleware, routes
scripts/
seed.py # Dev data seed
infra/
fly/ # Fly.io configuration and deploy scripts
aws/ # Terraform modules and environment configs (alt hosting)
tests/
unit/ # Pure domain tests
integration/ # Adapters (db, http) tests
api_contract/ # Schemathesis/OpenAPI contract tests
e2e/ # End-to-end smoke tests
Makefile # developer tasks
.env.example # environment variables
Rules
- Inner layers (domain, application) do not import outer layers (presentation, infrastructure).
- Repositories and Unit of Work are defined as interfaces (ports) in the domain/application; infrastructure implements them.
- CQRS by default: command handlers do not return domain entities; they return minimal results (IDs, summaries) or nothing. Query handlers return read-optimized DTOs.
- No cross-domain imports except via a public API module per domain.
Example: Unit of Work and Repository Port
# app/application/activities/uow.py
from __future__ import annotations
from typing import Protocol, ContextManager
class AbstractActivitiesRepo(Protocol):
async def add(self, activity: "Activity") -> None: ...
async def get(self, activity_id: str) -> "Activity | None": ...
async def list(self, *, cursor: str | None, limit: int) -> tuple[list["Activity"], str | None]: ...
class UnitOfWork(Protocol):
activities: AbstractActivitiesRepo
async def __aenter__(self) -> "UnitOfWork": ...
async def __aexit__(self, exc_type, exc, tb) -> None: ...
async def commit(self) -> None: ...
async def rollback(self) -> None: ...
# app/infrastructure/db/uow.py
from contextlib import asynccontextmanager
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker
from .repositories import ActivitiesRepo
class SqlAlchemyUoW:
def __init__(self, session_factory: async_sessionmaker[AsyncSession]):
self._session_factory = session_factory
self.session: AsyncSession | None = None
self.activities: ActivitiesRepo | None = None
async def __aenter__(self):
self.session = self._session_factory()
self.activities = ActivitiesRepo(self.session)
return self
async def __aexit__(self, exc_type, exc, tb):
if exc:
await self.session.rollback()
else:
await self.session.commit()
await self.session.close()
async def commit(self):
await self.session.commit()
async def rollback(self):
await self.session.rollback()
CQRS Handlers (Commands and Queries)
# app/application/activities/commands.py
from pydantic import BaseModel
class CreateActivity(BaseModel):
id: str
title: str
description: str | None = None
# app/application/activities/command_handlers.py
from .uow import UnitOfWork
from .commands import CreateActivity
from app.domain.activities.entities import Activity
async def handle_create(cmd: CreateActivity, uow: UnitOfWork) -> str:
activity = Activity(id=cmd.id, title=cmd.title, description=cmd.description)
async with uow:
await uow.activities.add(activity)
return activity.id
# app/application/activities/queries.py
from pydantic import BaseModel
class GetActivity(BaseModel):
id: str
# app/application/activities/query_handlers.py
from .uow import UnitOfWork
from .queries import GetActivity
async def handle_get(q: GetActivity, uow: UnitOfWork):
async with uow:
return await uow.activities.get(q.id)
Code‑First Data (IDs, Constraints, Migrations) and Audit
We manage identifiers and constraints in code. Database defaults are avoided for IDs; instead, we generate them in the application layer to guarantee repeatability across environments and better testability.
Identifier Strategy
- Default: ULID (Universally Unique Lexicographically Sortable Identifier) generated in code for monotonic ordering and good sharding properties.
- Alternative: UUID v4 (random) or UUID v7 (time-ordered) if your project prefers. The key is: generated in code, not by the database.
# app/domain/shared/ids.py
import ulid
def new_id() -> str:
return str(ulid.new())
SQLAlchemy Model (code-first, no DB-generated IDs)
# app/infrastructure/db/models.py
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column
from sqlalchemy import String, DateTime, Integer, JSON
from datetime import datetime, timezone
from app.domain.shared.ids import new_id
class Base(DeclarativeBase):
pass
class ActivityModel(Base):
__tablename__ = "activities"
id: Mapped[str] = mapped_column(String(26), primary_key=True, default=new_id) # ULID string
title: Mapped[str] = mapped_column(String(200))
description: Mapped[str | None]
# Audit & concurrency (code-managed)
created_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=lambda: datetime.now(timezone.utc), nullable=False)
updated_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=lambda: datetime.now(timezone.utc), nullable=False)
created_by: Mapped[str | None] = mapped_column(String(128))
updated_by: Mapped[str | None] = mapped_column(String(128))
version: Mapped[int] = mapped_column(Integer, default=1, nullable=False)
Update Hooks (keep timestamps/version in code)
# app/infrastructure/db/repositories.py
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import select
from datetime import datetime, timezone
from .models import ActivityModel
class ActivitiesRepo:
def __init__(self, session: AsyncSession):
self.session = session
async def add(self, a):
model = ActivityModel(
id=a.id,
title=a.title,
description=a.description,
created_by=a.audit.actor,
updated_by=a.audit.actor,
)
self.session.add(model)
async def get(self, activity_id: str):
res = await self.session.execute(select(ActivityModel).where(ActivityModel.id == activity_id))
return res.scalar_one_or_none()
async def touch_update(self, model: ActivityModel, actor: str | None):
model.updated_at = datetime.now(timezone.utc)
model.updated_by = actor
model.version += 1
Audit Log (append-only, code-managed)
We capture who did what and when, and (optionally) before/after snapshots for sensitive entities.
# app/infrastructure/db/models.py (continued)
class AuditLog(Base):
__tablename__ = "audit_log"
id: Mapped[str] = mapped_column(String(26), primary_key=True, default=new_id)
occurred_at: Mapped[datetime] = mapped_column(DateTime(timezone=True), default=lambda: datetime.now(timezone.utc), nullable=False)
actor: Mapped[str | None] = mapped_column(String(128))
action: Mapped[str] = mapped_column(String(64)) # e.g., "activity.created"
entity_type: Mapped[str] = mapped_column(String(64))
entity_id: Mapped[str] = mapped_column(String(64))
metadata: Mapped[dict | None] = mapped_column(JSON)
before: Mapped[dict | None] = mapped_column(JSON)
after: Mapped[dict | None] = mapped_column(JSON)
Audit writes are emitted by repositories or the Unit of Work. Prefer code-level capture to database triggers so behavior is explicit, testable, and portable across environments.
Migrations
- Use Alembic autogenerate from SQLAlchemy models; review diffs into hand-crafted revisions kept in version control.
- Commands:
make migrate(create revision),make upgrade,make downgrade.
API Design Conventions
- Routing & versioning: Prefix routes with
/v1. Group by bounded context (domain). - Consistency: JSON field naming is
snake_casein Python but serialized ascamelCasein responses (configure Pydantic aliasing if desired). Pick one and enforce it. - Error envelope: Return
{ code, message, details?, requestId }with appropriate HTTP status codes. Map exceptions centrally. - Pagination: Cursor-based pagination is preferred. Response includes
{ items, nextCursor }. - Idempotency: For POST/PUT that create or mutate resources, accept an
Idempotency-Keyheader and de-duplicate on the server. - Correlation IDs: Generate a
requestIdper request and log it.
# app/presentation/api/errors.py
from fastapi import Request
from fastapi.responses import JSONResponse
class ApiError(Exception):
def __init__(self, code: str, message: str, status_code: int = 400, details: dict | None = None):
self.code = code
self.message = message
self.status_code = status_code
self.details = details or {}
async def api_error_handler(request: Request, exc: ApiError):
body = {
"code": exc.code,
"message": exc.message,
"details": exc.details,
"requestId": request.state.request_id,
}
return JSONResponse(status_code=exc.status_code, content=body)
# app/presentation/api/deps.py
import uuid
from fastapi import Request
async def request_id_middleware(request: Request, call_next):
request.state.request_id = uuid.uuid4().hex
response = await call_next(request)
response.headers["X-Request-Id"] = request.state.request_id
return response
Security & Privacy (Defaults)
- Authentication: JSON Web Token (JWT) bearer tokens by default; optional OpenID Connect integration.
- Authorization: Role-Based Access Control helpers. Keep permission checks in application layer handlers.
- CORS: Restrictive by default; allow-list per environment.
- Rate limiting: Redis-backed token bucket per IP and per user.
- Input limits: Request body size limits on server. Timeouts on upstream calls.
- Secrets: Environment variables (development), Fly.io secrets in production; AWS Secrets Manager when on AWS.
- Data protection: Encrypt in transit (TLS) and at rest (Postgres, S3). Redact sensitive fields in logs. Classify personal data; avoid logging it.
Performance & Scalability
- Prefer asynchronous I/O. Size database and HTTP connection pools for expected concurrency.
- Avoid “N+1” queries via eager loading or dedicated read models in query handlers.
- Cache expensive reads (in-process or Redis). Invalidate on write via Unit of Work events.
- Use background workers for heavy tasks; retries with exponential backoff.
Monitoring, Observability & Alerting
Structured Logging
# app/main.py (excerpt)
import structlog
from fastapi import FastAPI
structlog.configure(processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.JSONRenderer(),
])
logger = structlog.get_logger()
app = FastAPI()
@app.middleware("http")
async def access_log(request, call_next):
response = await call_next(request)
logger.info("http_request", method=request.method, path=request.url.path, status=response.status_code, request_id=request.state.request_id)
return response
Metrics (Prometheus-compatible)
# app/main.py (excerpt)
from prometheus_fastapi_instrumentator import Instrumentator
Instrumentator().instrument(app).expose(app, endpoint="/metrics")
Tracing (OpenTelemetry)
# app/main.py (excerpt)
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
provider = TracerProvider(resource=Resource.create({"service.name": "zebra-api"}))
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
FastAPIInstrumentor.instrument_app(app, tracer_provider=provider)
Health & Readiness
# app/presentation/api/router_health.py
from fastapi import APIRouter
router = APIRouter()
@router.get("/healthz")
async def liveness():
return {"status": "ok"}
@router.get("/readyz")
async def readiness():
# Optionally check DB/connectivity here
return {"status": "ready"}
Dashboards & Alerts
- Provide Grafana dashboards: request rate, latency (p50/p95/p99), error rate, DB pool saturation, cache hit rate.
- Alerts: error rate > 1% for 5 minutes, p95 latency above SLO for 10 minutes, readiness failing.
Local Development & Environments
- Easy local:
docker composewith PostgreSQL, Redis, Mailpit (SMTP testing), and the API. - Full local: add OpenTelemetry Collector + Grafana/Prometheus for local observability.
- Make targets:
run: ## Run API locally (reload)
uvicorn app.main:app --reload --port 8000
up: ## Start local stack
docker compose up -d
down: ## Stop local stack
docker compose down -v
lint:
ruff check . && black --check .
typecheck:
mypy app
test:
pytest -q --maxfail=1 --disable-warnings
migrate:
alembic revision --autogenerate -m "change"
upgrade:
alembic upgrade head
downgrade:
alembic downgrade -1
- .env.example: include database URL, Redis URL, secret keys, tracing exporter URL.
Testing Strategy
- Unit tests: pure domain logic and application handlers (CQRS) with fake repositories.
- Integration tests: SQLAlchemy repositories against a test database (transaction rollbacks).
- API contract tests: Schemathesis against the running FastAPI app.
- End-to-end smoke: start stack with Docker, run a minimal flow.
CI/CD & Release
- Pipeline: lint → typecheck → unit/integration → build image → scan → contract tests → push image → (optional) deploy preview.
- Version images and migrations together; keep migrations additive and reversible where possible.
Infrastructure & Hosting
Principles
- Portability first: Everything runs in a container image; configuration comes from environment variables; no provider‑specific code in the domain or application layers. Observability uses open standards (OpenTelemetry for traces/metrics, JSON logs), so it works on Fly.io and on Amazon Web Services (AWS) the same way.
- Security by default: Private networking, least‑privilege identities, encrypted data at rest and in transit, minimal surface on the public internet, and immutable images built in Continuous Integration (CI) and deployed by Continuous Delivery (CD).
- Infrastructure as Code (IaC): All cloud resources are defined in Terraform and versioned. Remote state uses Amazon Simple Storage Service (S3) with state locking in Amazon DynamoDB.
- Environments:
dev,stage, andprodare separate stacks with identical modules and different parameters.
Fly.io (Default)
- What we run: one container for the FastAPI app behind Fly’s Anycast edge; optional Fly Postgres. Health checks pin to
/readyzand/healthz. - Secrets: Fly secrets and environment variables; rotate on deploys.
- Observability: same app image exports Prometheus metrics at
/metricsand traces via OpenTelemetry (OTLP) to a collector you run locally or remotely. - Scaling: per‑region horizontal scaling + memory/CPU sizing; keep the number of regions small unless latency needs demand wider distribution.
AWS (Alternative, Terraform)
Default choice: Amazon Elastic Container Service (ECS) on AWS Fargate (serverless containers). Simpler alternative: AWS App Runner (managed HTTP service) — fine for small apps; fewer knobs, faster to stand up. Not default: Amazon Elastic Kubernetes Service (EKS) — only when you need Kubernetes explicitly.
Reference Architecture (ECS Fargate)
- Networking: one Virtual Private Cloud (VPC), at least two Availability Zones. Public subnets for the load balancer, private subnets for tasks and databases. Network Address Translation (NAT) gateways for egress to the internet.
- Ingress: Application Load Balancer (ALB) with TLS certificates from AWS Certificate Manager (ACM). Optional AWS Web Application Firewall (WAF) for Layer‑7 protections.
- Compute: ECS service on Fargate with Auto Scaling (CPU/memory metrics and optionally request rate). Task definitions run the API container and (optionally) a sidecar OpenTelemetry Collector.
- Data: Amazon Relational Database Service (RDS) for PostgreSQL (Multi‑AZ), Amazon ElastiCache for Redis, Amazon S3 for object storage.
- Identity & Secrets: AWS Identity and Access Management (IAM) roles for tasks (least privilege). Secrets in AWS Secrets Manager. Keys in AWS Key Management Service (KMS) for encryption at rest.
- Observability: Logs to Amazon CloudWatch Logs; metrics via Prometheus scraping (self‑managed in ECS) or Amazon Managed Service for Prometheus; dashboards in Grafana (self‑hosted or Amazon Managed Grafana). Traces exported via OTLP to your collector or AWS Distro for OpenTelemetry.
- DNS: Amazon Route 53. Optionally use weighted records for canary/blue‑green.
Terraform Layout
infra/aws/terraform/
modules/
network/ # VPC, subnets, NAT, routing
alb/ # ALB, listeners, target groups, WAF (optional)
acm/ # TLS certificates
ecs_cluster/ # ECS cluster + capacity providers (Fargate)
ecs_service/ # Task definition, service, autoscaling
rds_postgres/ # PostgreSQL (Multi‑AZ), parameter groups, security groups
redis/ # ElastiCache Redis
s3_bucket/ # App buckets (e.g., uploads), versioning, lifecycle
secrets/ # Secrets Manager entries and IAM policies
otel/ # OpenTelemetry collector task and IAM
dns/ # Route 53 zone/records
envs/
dev/
main.tf variables.tf outputs.tf backend.tf
stage/
...
prod/
...
providers.tf
versions.tf
Security Defaults (AWS)
- Network isolation: Application Load Balancer is public; ECS tasks, RDS, and Redis are private. Only the load balancer can reach the tasks; only tasks can reach the database and Redis.
- Encryption: TLS via ACM on the load balancer; at rest with KMS for RDS, ElastiCache, and S3. Force TLS for client connections to RDS/Redis.
- Identity: Separate IAM roles for the task execution (pulling images/secrets) and for the application (only the permissions it needs). No access keys in containers.
- Secrets: Store in Secrets Manager; inject via ECS task definitions. Rotate DB credentials and JWT secrets on a schedule.
- Containers: Run as non‑root, read‑only root filesystem, drop Linux capabilities, no privileged mode. Scan images in Amazon Elastic Container Registry (ECR) and in CI (for example, Grype/Trivy). Optionally sign images (Cosign) and verify in the pipeline.
- Edge protections: Optional AWS WAF rules for common attacks; AWS Shield Standard provides basic Distributed Denial of Service (DDoS) protection.
- Cost awareness: Network Address Translation gateways are billed per hour + data; prefer VPC endpoints for S3/ECR to reduce egress.
Observability on AWS
- Logs: ship to CloudWatch Logs with a retention policy; include
requestId, user/tenant (when present), latency, status. - Metrics: scrape
/metricswith a Prometheus server in ECS or use Amazon Managed Service for Prometheus; alert from latency percentiles and error rates. - Traces: export OTLP to a collector; store and explore with AWS X‑Ray or any vendor supporting OTLP.
Migration: Fly.io → AWS
Goals
- Security: improve isolation (private subnets), managed secrets, encryption, and per‑service identities.
- Portability: keep the application unchanged — only environment and infrastructure differ. Observability remains standards‑based.
- Predictability: rehearsed cutover with rollback.
Step 1 — Discovery & Readiness
- Inventory services, environment variables, secrets, database schemas, object storage, background jobs, cron, ingress/egress rules, and external integrations (allowlists, webhooks).
- Confirm the application generates identifiers in code and not in the database (our default) to simplify migration.
Step 2 — Stand Up AWS Staging with Terraform
- Create VPC, subnets, NAT, security groups, ECS, RDS PostgreSQL, ElastiCache Redis, S3, Secrets Manager, Route 53, and observability stack using the modules above.
- Deploy the same container image to staging (push to ECR). Verify
/healthz,/readyz,/metrics, OpenAPI, and logging/tracing.
Step 3 — Data Migration (Database)
Option A: Near‑zero downtime with AWS Database Migration Service (DMS)
- Provision RDS in Multi‑AZ.
- Use DMS to perform full load then Change Data Capture (CDC) from Fly Postgres to RDS.
- Validate row counts and checksums; run application read replicas against RDS to verify.
Option B: Planned downtime with pg_dump / pg_restore
- Quiesce writes on Fly (maintenance mode or feature flag to block mutations).
- Dump from Fly Postgres with
pg_dump --format=custom. - Restore to RDS; run migrations; verify.
Step 4 — Object Storage & Files
- If using Fly volumes/local disk: sync to S3 (for example,
rcloneor a one‑off container job). Verify checksums and metadata (content‑type, cache‑control).
Step 5 — Secrets & Configuration
- Create Secrets Manager entries. Update ECS task definition environment variables and secret refs. Rotate keys on cutover.
Step 6 — Traffic Cutover
- Bring up production ECS service behind ALB. Warm tasks.
- Use Route 53 weighted DNS: start with a small percentage to AWS (for example, 5%), monitor latency/error budgets, then ramp to 100%.
- Alternatively, use blue‑green with two target groups on the ALB and swap.
Step 7 — Finalize & Rollback Plan
- After stable period, decommission DMS or tear down Fly Postgres.
- Keep the Fly deployment hot for fast rollback. Rollback path: send DNS back to Fly. Note: if writes occurred on AWS, rolling back data requires either dual‑write during ramp‑up or accepting data loss up to the last verified sync. Prefer short “read‑only” windows during cutover to avoid divergence.
Security Checklist for Migration
- ✅ Private subnets for tasks and databases; only ALB is public.
- ✅ TLS everywhere; enforce
require_sslon Postgres. - ✅ Least‑privilege IAM roles; no static credentials in containers.
- ✅ Secrets in Secrets Manager with rotation policies.
- ✅ KMS encryption for RDS/ElastiCache/S3; S3 buckets private + least‑privilege access.
- ✅ CloudWatch alarms on p95 latency and error rate; health checks on ALB.
- ✅ Web Application Firewall rules on ALB (rate limit, common exploit rules) if exposure warrants it.
Portability Guardrails
-
Keep provider specifics in the infrastructure layer (Terraform modules, task environment).
-
Use environment variables for configuration; avoid hardcoding AWS resource identifiers in application code.
-
Telemetry through OpenTelemetry (OTLP), not provider‑specific SDKs.
-
Object storage via S3 API — use MinIO locally to stay portable.
-
Image & registry: push to Amazon Elastic Container Registry (ECR); update deploy workflow.
-
Environment parity: copy secrets and environment variables; verify CORS, origins, and DNS.
-
Database move: either logical replication to Amazon RDS or
pg_dump/pg_restorewith planned downtime. -
Object storage: MinIO/Fly volumes → S3 sync.
-
DNS cutover: Route 53 with health checks and quick rollback; keep old stack hot until verified.
Coming soon — detailed cutover runbook with commands
This section is being polished.
Final Thoughts
This baseline is designed for real-world delivery: clear domain boundaries, code-first data and migrations, CQRS for clarity and scalability, and strong operational visibility. Deviations are welcome when reasoned—and improvements are expected as we learn.