Skip to content

Operator guide

Daily ops for whoever owns the Erold deployment. Pairs with EROLD.md (overview) + erold-backend/DEPLOY.md (deploy runbook).

Backend (prod): https://eroldapi2093acff-erold-api-prod.functions.fnc.fr-par.scw.cloud
(also https://api.erold.dev once the CNAME is in)
Scaleway project: 4393b09f-8abe-4336-acae-4870da6033db
Project DB (prod): ccd7c08f-8ac3-4572-84cf-01a840434940 (db-gp-xs, HA, fr-par-1)
Project DB (dev): d6560b85-29b8-48e8-ac9e-47f3eb346c49 (db-dev-s, fr-par-1)
Container prod: d03ae584-eb78-4179-b16c-06e673a902fc (min=1, max=5)
Container staging: f6dff64e-d5f8-4a16-b497-fed1999a7df5 (min=0, max=2)
Registry: rg.fr-par.scw.cloud/erold
Gitea: https://git.chut.me/sid/erold-backend

State-id files (gitignored) at erold-backend/infra/.{project,vpc,pn,rdb-{dev,prod},registry-ns,container-{ns,prod,staging}}-id.

Terminal window
# Is prod responding?
curl -fs https://eroldapi2093acff-erold-api-prod.functions.fnc.fr-par.scw.cloud/health | jq .
# Container status
scw container container get d03ae584-eb78-4179-b16c-06e673a902fc region=fr-par -o json \
| jq '{status, error_message, registry_image}'
# DB instance status
scw rdb instance list region=fr-par -o json | jq '.[] | select(.tags[]? == "project=erold") | {name, status}'

Expected: status=ok, status=ready, RDB instances ready.

Terminal window
# One-time: create a Cockpit read-token (cached after first run)
secret has erold.cockpit.token || {
scw cockpit token create name=erold-ops \
"token-scopes.0=read_only_logs" \
project-id="$(cat erold-backend/infra/.project-id)" \
-o json | jq -r '.secret_key' | secret set erold.cockpit.token
}
# Query last 10 min of prod container logs
TOKEN=$(secret get erold.cockpit.token)
DS=$(scw cockpit data-source list project-id="$(cat erold-backend/infra/.project-id)" -o json \
| jq -r '.[] | select(.type=="logs") | .url')
END=$(date +%s); START=$((END-600))
curl -s -G "$DS/loki/api/v1/query_range" -H "X-Token: $TOKEN" \
--data-urlencode 'query={resource_type="serverless_container",resource_id="d03ae584-eb78-4179-b16c-06e673a902fc"}' \
--data-urlencode "start=${START}000000000" --data-urlencode "end=${END}000000000" \
--data-urlencode "limit=50" \
| jq -r '.data.result[].values[][1]' | tail -50
Line itemMonthly
RDB prod db-gp-xs HA~€25
RDB dev db-dev-s~€6
Serverless Container prod (min=1)~€14
Serverless Container staging (min=0, idle)~€2 (only when scaled up)
Container Registry erold (~5 GB)~€1
Object Storage (3 buckets, < 10 GB)~€2
Secret Manager (3 secrets)~€1
Total~€51

Scaling levers:

  • Container min_scale=1 keeps prod warm (~€14). Setting min=0 saves the €14 but adds ~5 s cold-start to the first request after idle.
  • Embedding compute (OpenAI) scales with fragment ingestion. Monitor via the OpenAI dashboard; rotate to Scaleway-hosted model if monthly exceeds €40.

Every 90 days for high-value secrets — set a calendar reminder.

Terminal window
# JWT signing key
openssl rand -hex 64 | secret set erold.prod.jwt-signing-key
cd erold-backend && bash infra/scripts/06-secret-manager.sh --update
# Then update container env to fetch the new SM version
scw container container update d03ae584-... \
region=fr-par "environment-variables.JWT_SIGNING_KEY=$(secret get erold.prod.jwt-signing-key)"
scw container container deploy d03ae584-... region=fr-par
# OpenAI API key — rotate at platform.openai.com first, then:
echo 'sk-NEW_KEY' | secret set erold.prod.openai-api-key
cd erold-backend && bash infra/scripts/06-secret-manager.sh --update
# Scaleway IAM keys (5 apps)
cd erold-backend && bash infra/scripts/01-iam.sh --rotate
# Erold API keys per tenant — UI at app.erold.dev → Settings → API Keys
WhatWhere
Raw events (append-only)events table, prod RDB. Compressed to fragments by outbox worker.
Fragments + embeddingsfragments table (pgvector 0.8.2). HNSW index on (embedding) WHERE embedding_status='embedded'.
Tasks / Bugs / Deploys / Decisions / CredentialRefsSeparate tables, RLS per tenant. Soft-delete (deleted_at) on Tasks/Bugs; immutable Deploys + Decisions.
Local plugin outbox~/.erold/outbox/events.jsonl (each developer). Daemon flushes async.
Dead-letter~/.erold/outbox/dead-letter.jsonl (events that hit 4xx or exceeded retry budget).
Object Storageerold-raw-events (30-day lifecycle), erold-deploy-logs (90-day), erold-attachments (app-managed).
SecretsScaleway SM erold-prod-{database-url,jwt-signing-key,openai-api-key}; macOS Keychain erold.{dev,prod}.* for operator use.
Terminal window
cd erold-backend
# 1. Edit code, commit
git add -A && git commit -m "feat: …" && git push
# 2. Build amd64 image + push to registry
docker buildx build --platform=linux/amd64 --build-arg GIT_SHA=$(git rev-parse --short HEAD) \
-t rg.fr-par.scw.cloud/erold/api:v0.1.X --push .
# 3. Roll prod
scw container container update d03ae584-eb78-4179-b16c-06e673a902fc \
region=fr-par registry-image=rg.fr-par.scw.cloud/erold/api:v0.1.X
scw container container deploy d03ae584-eb78-4179-b16c-06e673a902fc region=fr-par
# 4. Verify SHA
curl -fs https://eroldapi2093acff-erold-api-prod.functions.fnc.fr-par.scw.cloud/health | jq '.sha'

Note: update environment-variables.* REPLACES the env map. To change one var, pass them all (DATABASE_URL, JWT_SIGNING_KEY, OPENAI_API_KEY, APP_ENV, SCW_REGION, SCW_PROJECT_ID, SECRET_MANAGER_ENDPOINT).

Terminal window
# Provision a temp ACL + LB endpoint on the prod DB, then insert the tenant.
# See erold-backend/DEPLOY.md §"Local image smoke test" for the open/close pattern.
# Then via psql or the admin UI:
INSERT INTO tenants (id, name) VALUES (gen_random_uuid(), 'new-tenant-name');
# Generate an API key for them (via the admin UI, or insert directly into api_keys).
Terminal window
# Find the previous good tag
scw container container get d03ae584-... region=fr-par -o json | jq -r '.registry_image'
# Roll back
scw container container update d03ae584-... region=fr-par \
registry-image=rg.fr-par.scw.cloud/erold/api:v0.1.0
scw container container deploy d03ae584-... region=fr-par
Terminal window
wc -l ~/.erold/outbox/events.jsonl # queued, not yet flushed
wc -l ~/.erold/outbox/dead-letter.jsonl # permanently failed (4xx / retry-exhausted)
tail -n 20 ~/.erold/error.log # transmission errors (host, status, exit code)
Section titled “Monitoring + alerting (recommended but not yet wired)”
  • Container health: Scaleway Cockpit auto-collects HTTP-status. Add an alert if 5xx > 5/min for 5 min.
  • DB connections: alert if pg_stat_activity > 80% of max_connections.
  • Outbox lag: alert if a fragment’s created_at is > 60 s and embedding_status = 'pending' (worker is stuck).
  • Dead-letter rate: alert if events_outbox WHERE status='dead' grows by > 10/hour.
  • Cross-tenant leak: weekly synthetic check — tenant B searches for a tenant A marker, must return 0 (the test exists in erold-backend/tests/integration/test_tenancy.py).
  1. Container in error state. Pull logs via Cockpit (recipe above). Most common cause: env var dropped during an update call (the update replaces the env map). Re-set all 7 env vars together.
  2. /health returns 500. DB unreachable. Check ACL list, private-network attachment, and scw rdb instance get status.
  3. Slow search. Check embedding_status='embedded' count vs 'pending'. If pending is large, embedding worker is stalled — usually OpenAI 429 or expired key. Rotate the key + redeploy.