Two different questions an operator asks: is the system up? and is the workforce productive? RoboCo answers the first with two HTTP probes, and the second with an operational Metrics view. Neither has anything to do with token cost — for spend, see cost & usage.
Liveness and readiness probes
The orchestrator exposes two probes behind nginx (so they're reachable at localhost:3000/api/..., or directly on the orchestrator):
| Endpoint | Checks | Returns |
|---|---|---|
GET /api/health | The process is up | Always 200 with { status: "ok", version, environment } once the app is serving |
GET /api/ready | Database (SELECT 1) and Redis (PING) | 200 with { status, database, redis }; status is ok only when both pass, otherwise degraded |
1curl -s http://localhost:3000/api/health2curl -s http://localhost:3000/api/ready/api/health is a liveness probe — it stays 200 as long as the process answers, so it can't tell you a dependency is down. /api/ready is the real readiness signal: it actively pings Postgres and Redis and reports degraded (still HTTP 200, but status: "degraded" with the failing dependency's error in the database/redis field) when either is unreachable. Point your uptime monitor at /api/ready and alert on degraded.
The app-level root /health used during container startup is additionally gated on the in-house RAG engine being operational, which is why the orchestrator can take a minute to report healthy after a cold start while it indexes documents. See deployment for the full startup sequence.
Those probes cover the infrastructure; individual agents have their own liveness story. An agent whose MCP gateway breaks (a corrupted /app venv firing no verb) stays "up" as a container but does no work, and its verb-heartbeat goes stale. With ROBOCO_GATEWAY_HEALTH_ENABLED (default on) the reaper probes such a stale-but-live agent's gateway out-of-band and, once it's been broken past ROBOCO_GATEWAY_HEALTH_GRACE_SECONDS, kills and respawns it instead of protecting it forever — so a wedged agent self-heals without operator action. You don't monitor this directly; it shows up as a "broken-gateway agent killed and evicted" line in the orchestrator log.
The Metrics → Performance view
The Metrics page has a Performance tab driven by tasks, messages, and notifications — your read on whether work is actually flowing. See the panel walkthrough in Metrics.
What it surfaces:
| Group | Metrics |
|---|---|
| Velocity | Tasks completed vs created, average completion time (hours), completion rate |
| Blockers | Active blockers, average and longest blocked time, blockers by team |
| Team & agent | Per-team and per-agent performance |
| Communication | Message/notification volume |
The ok / slow / critical health signal
Org and per-team health roll up to a single status driven mainly by the blocked-task ratio:
A team with more than 30% of its tasks blocked reads critical; over 15% reads slow; below that, ok. A heuristic also flags a team sitting on stale active tasks with zero completions. When you see slow or critical, the Blockers panel tells you where the work is jammed and which team owns it — chase the longest-blocked tasks first.
This status is computed from task state, not exceptions or crashes. A green org-health with a degraded /api/ready means the infrastructure is wobbling even though the backlog looks healthy — watch both signals, they answer different questions.
The Metrics → Delivery view
Where Performance answers is work flowing, the Delivery tab answers where the time goes and how clean the work is — reconstructed from the recorded task-transition history, so it costs no extra bookkeeping. Use it when velocity dips and you need the why:
| Panel | Operational read |
|---|---|
| Cycle time by stage | Average dwell per lifecycle stage — a tall stage is where work waits (e.g. review vs coding) |
| Bottlenecks | The single worst stage by total time absorbed + how many tasks are parked in each stage right now |
| Rework | How often work bounces to needs_revision, by cell and by agent (the bounce charged to the QA/PR-reviewer who sent it back), plus its token cost |
| Cell scorecards | Per-cell completed / avg cycle / rework / cost over 7 days |
The full reading guide — especially how rework attribution works — is in the panel walkthrough at Metrics → Delivery.
Next
- Walk the panel surface in Metrics and the command center.
- For token spend and the cost dashboard, see cost & usage.
- For the startup sequence and what each container needs, see deployment.