Operator guide
Daily ops for whoever owns the Erold deployment. Pairs with EROLD.md (overview) + erold-backend/DEPLOY.md (deploy runbook).
Live endpoints + IDs
Section titled “Live endpoints + IDs”Backend (prod): https://eroldapi2093acff-erold-api-prod.functions.fnc.fr-par.scw.cloud (also https://api.erold.dev once the CNAME is in)Scaleway project: 4393b09f-8abe-4336-acae-4870da6033dbProject DB (prod): ccd7c08f-8ac3-4572-84cf-01a840434940 (db-gp-xs, HA, fr-par-1)Project DB (dev): d6560b85-29b8-48e8-ac9e-47f3eb346c49 (db-dev-s, fr-par-1)Container prod: d03ae584-eb78-4179-b16c-06e673a902fc (min=1, max=5)Container staging: f6dff64e-d5f8-4a16-b497-fed1999a7df5 (min=0, max=2)Registry: rg.fr-par.scw.cloud/eroldGitea: https://git.chut.me/sid/erold-backendState-id files (gitignored) at erold-backend/infra/.{project,vpc,pn,rdb-{dev,prod},registry-ns,container-{ns,prod,staging}}-id.
Daily health checks
Section titled “Daily health checks”# Is prod responding?curl -fs https://eroldapi2093acff-erold-api-prod.functions.fnc.fr-par.scw.cloud/health | jq .
# Container statusscw container container get d03ae584-eb78-4179-b16c-06e673a902fc region=fr-par -o json \ | jq '{status, error_message, registry_image}'
# DB instance statusscw rdb instance list region=fr-par -o json | jq '.[] | select(.tags[]? == "project=erold") | {name, status}'Expected: status=ok, status=ready, RDB instances ready.
Logs (Scaleway Cockpit / Loki)
Section titled “Logs (Scaleway Cockpit / Loki)”# One-time: create a Cockpit read-token (cached after first run)secret has erold.cockpit.token || { scw cockpit token create name=erold-ops \ "token-scopes.0=read_only_logs" \ project-id="$(cat erold-backend/infra/.project-id)" \ -o json | jq -r '.secret_key' | secret set erold.cockpit.token}
# Query last 10 min of prod container logsTOKEN=$(secret get erold.cockpit.token)DS=$(scw cockpit data-source list project-id="$(cat erold-backend/infra/.project-id)" -o json \ | jq -r '.[] | select(.type=="logs") | .url')END=$(date +%s); START=$((END-600))curl -s -G "$DS/loki/api/v1/query_range" -H "X-Token: $TOKEN" \ --data-urlencode 'query={resource_type="serverless_container",resource_id="d03ae584-eb78-4179-b16c-06e673a902fc"}' \ --data-urlencode "start=${START}000000000" --data-urlencode "end=${END}000000000" \ --data-urlencode "limit=50" \ | jq -r '.data.result[].values[][1]' | tail -50Cost dashboard
Section titled “Cost dashboard”| Line item | Monthly |
|---|---|
RDB prod db-gp-xs HA | ~€25 |
RDB dev db-dev-s | ~€6 |
| Serverless Container prod (min=1) | ~€14 |
| Serverless Container staging (min=0, idle) | ~€2 (only when scaled up) |
Container Registry erold (~5 GB) | ~€1 |
| Object Storage (3 buckets, < 10 GB) | ~€2 |
| Secret Manager (3 secrets) | ~€1 |
| Total | ~€51 |
Scaling levers:
- Container
min_scale=1keeps prod warm (~€14). Setting min=0 saves the €14 but adds ~5 s cold-start to the first request after idle. - Embedding compute (OpenAI) scales with fragment ingestion. Monitor via the OpenAI dashboard; rotate to Scaleway-hosted model if monthly exceeds €40.
Rotation cadence
Section titled “Rotation cadence”Every 90 days for high-value secrets — set a calendar reminder.
# JWT signing keyopenssl rand -hex 64 | secret set erold.prod.jwt-signing-keycd erold-backend && bash infra/scripts/06-secret-manager.sh --update# Then update container env to fetch the new SM versionscw container container update d03ae584-... \ region=fr-par "environment-variables.JWT_SIGNING_KEY=$(secret get erold.prod.jwt-signing-key)"scw container container deploy d03ae584-... region=fr-par
# OpenAI API key — rotate at platform.openai.com first, then:echo 'sk-NEW_KEY' | secret set erold.prod.openai-api-keycd erold-backend && bash infra/scripts/06-secret-manager.sh --update
# Scaleway IAM keys (5 apps)cd erold-backend && bash infra/scripts/01-iam.sh --rotate
# Erold API keys per tenant — UI at app.erold.dev → Settings → API KeysWhere the data lives
Section titled “Where the data lives”| What | Where |
|---|---|
| Raw events (append-only) | events table, prod RDB. Compressed to fragments by outbox worker. |
| Fragments + embeddings | fragments table (pgvector 0.8.2). HNSW index on (embedding) WHERE embedding_status='embedded'. |
| Tasks / Bugs / Deploys / Decisions / CredentialRefs | Separate tables, RLS per tenant. Soft-delete (deleted_at) on Tasks/Bugs; immutable Deploys + Decisions. |
| Local plugin outbox | ~/.erold/outbox/events.jsonl (each developer). Daemon flushes async. |
| Dead-letter | ~/.erold/outbox/dead-letter.jsonl (events that hit 4xx or exceeded retry budget). |
| Object Storage | erold-raw-events (30-day lifecycle), erold-deploy-logs (90-day), erold-attachments (app-managed). |
| Secrets | Scaleway SM erold-prod-{database-url,jwt-signing-key,openai-api-key}; macOS Keychain erold.{dev,prod}.* for operator use. |
Common operations
Section titled “Common operations”Push a new backend version
Section titled “Push a new backend version”cd erold-backend# 1. Edit code, commitgit add -A && git commit -m "feat: …" && git push# 2. Build amd64 image + push to registrydocker buildx build --platform=linux/amd64 --build-arg GIT_SHA=$(git rev-parse --short HEAD) \ -t rg.fr-par.scw.cloud/erold/api:v0.1.X --push .# 3. Roll prodscw container container update d03ae584-eb78-4179-b16c-06e673a902fc \ region=fr-par registry-image=rg.fr-par.scw.cloud/erold/api:v0.1.Xscw container container deploy d03ae584-eb78-4179-b16c-06e673a902fc region=fr-par# 4. Verify SHAcurl -fs https://eroldapi2093acff-erold-api-prod.functions.fnc.fr-par.scw.cloud/health | jq '.sha'Note: update environment-variables.* REPLACES the env map. To change one
var, pass them all (DATABASE_URL, JWT_SIGNING_KEY, OPENAI_API_KEY, APP_ENV,
SCW_REGION, SCW_PROJECT_ID, SECRET_MANAGER_ENDPOINT).
Add a new tenant
Section titled “Add a new tenant”# Provision a temp ACL + LB endpoint on the prod DB, then insert the tenant.# See erold-backend/DEPLOY.md §"Local image smoke test" for the open/close pattern.# Then via psql or the admin UI:INSERT INTO tenants (id, name) VALUES (gen_random_uuid(), 'new-tenant-name');# Generate an API key for them (via the admin UI, or insert directly into api_keys).Rollback prod
Section titled “Rollback prod”# Find the previous good tagscw container container get d03ae584-... region=fr-par -o json | jq -r '.registry_image'# Roll backscw container container update d03ae584-... region=fr-par \ registry-image=rg.fr-par.scw.cloud/erold/api:v0.1.0scw container container deploy d03ae584-... region=fr-parInspect outbox state (developer side)
Section titled “Inspect outbox state (developer side)”wc -l ~/.erold/outbox/events.jsonl # queued, not yet flushedwc -l ~/.erold/outbox/dead-letter.jsonl # permanently failed (4xx / retry-exhausted)tail -n 20 ~/.erold/error.log # transmission errors (host, status, exit code)Monitoring + alerting (recommended but not yet wired)
Section titled “Monitoring + alerting (recommended but not yet wired)”- Container health: Scaleway Cockpit auto-collects HTTP-status. Add an alert if 5xx > 5/min for 5 min.
- DB connections: alert if
pg_stat_activity> 80% ofmax_connections. - Outbox lag: alert if a fragment’s
created_atis > 60 s andembedding_status = 'pending'(worker is stuck). - Dead-letter rate: alert if
events_outbox WHERE status='dead'grows by > 10/hour. - Cross-tenant leak: weekly synthetic check — tenant B searches for a
tenant A marker, must return 0 (the test exists in
erold-backend/tests/integration/test_tenancy.py).
When things break
Section titled “When things break”- Container in
errorstate. Pull logs via Cockpit (recipe above). Most common cause: env var dropped during anupdatecall (the update replaces the env map). Re-set all 7 env vars together. /healthreturns 500. DB unreachable. Check ACL list, private-network attachment, andscw rdb instance getstatus.- Slow search. Check
embedding_status='embedded'count vs'pending'. If pending is large, embedding worker is stalled — usually OpenAI 429 or expired key. Rotate the key + redeploy.
Pointers
Section titled “Pointers”- Plan + decisions: plans/plugin-mcp-v3.md
- Backend deploy + smoke test: erold-backend/DEPLOY.md
- Infrastructure scripts: erold-backend/infra/README.md
- Cost forecast (live): erold-backend/infra/COSTS.md
- Validation checklist: erold-backend/infra/VERIFY.md