Ona runners expose Prometheus metrics for runner health, environment lifecycle, and resource utilization.
Enabling metrics collection
- Go to Settings → Runners
- Select your runner
- Toggle Enable metrics collection
- Enter your configuration:
| Parameter | Required | Description |
|---|
| Metrics collector URL | Yes | Prometheus remote write endpoint |
| Username | No | Basic auth username |
| Password | No | Basic auth password |
- Click Save Configuration
Your credentials are encrypted at rest and transmitted securely. They are never exposed in logs or the dashboard.
Metrics flow immediately. Ensure outbound HTTPS (port 443) is allowed to your endpoint.
Network requirements: AWS | GCP
What to monitor
Runners expose many metrics, but not all require your attention. Some indicate issues you can resolve directly in your cloud account. Others signal problems that require Ona support. The rest provide visibility into usage and system health.
Act on these
These metrics reflect infrastructure you control. Set up alerts and respond directly.
| Metric | What it means | What to do |
|---|
up == 0 | Runner is unreachable | Check network connectivity, security groups, and firewall rules |
gitpod_gateway_proxy_up == 0 | Proxy is down | Check runner logs and network configuration |
gitpod_gateway_proxy_http_requests_total | Proxy request errors | Filter for status_code 4xx/5xx to identify failing requests; may indicate misconfigured clients or network issues |
These metrics indicate issues within the runner itself. You can’t resolve them directly, but they help you know when to reach out.
| Metric | What it means | What to tell support |
|---|
gitpod_runnerkit_function_errors_total | Total internal operation failures | Share the error rate trend and affected time window |
workqueue_unfinished_work_seconds | Processing is stuck (value stays elevated) | Note how long it’s been elevated and any correlated symptoms |
environment_error_errors_total | Environment creation/operation failures | Include the error_code and component labels from the metric |
These metrics provide visibility but don’t typically require action.
| Metric | What it shows |
|---|
gitpod_runnerkit_active_instances | Current environment count by state |
Example alerts
These alerts cover high-signal scenarios that directly impact your users. Runner and proxy availability determine whether users can access environments at all. Proxy error rates indicate active failures during environment connections. These are the first things to know about when something goes wrong.
Runner unreachable
Check your network configuration, security groups, and firewall rules.
- alert: RunnerUnreachable
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Runner is unreachable"
runbook: "Check network connectivity and security groups"
Proxy error rate elevated
Indicates users are experiencing failures connecting to environments.
- alert: ProxyErrorRateElevated
expr: |
sum(rate(gitpod_gateway_proxy_http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(gitpod_gateway_proxy_http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Proxy 5xx error rate above 5%"
runbook: "Check proxy logs and backend connectivity"
This alert signals an issue you can’t fix directly. Contact support with the time window and error details.
- alert: HighErrorRate
expr: |
rate(gitpod_runnerkit_function_errors_total[5m])
/ rate(gitpod_runnerkit_function_calls_total[5m]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "Runner error rate elevated"
runbook: "Contact Ona support with time window and error_code labels"
Available metrics
All metrics include these common labels:
| Label | Description |
|---|
stack | Runner stack name (e.g., Ona-AWS-US-East---Enterprise) |
account_id | Cloud provider account ID |
region | Deployment region |
instance | Container hostname |
job | Prometheus job name (ec2_runner, runner_manager, proxy) |
The tables below list additional metric-specific labels where applicable.
Standard
Common metrics available on all runners.
| Metric | Type | Labels | Description |
|---|
up | Gauge | — | Target health (1 = up, 0 = down) |
gitpod_gateway_proxy_up | Gauge | — | Proxy health (1 = up, 0 = down) |
gitpod_ec2_runner_version_info | Gauge | version | Runner version (AWS) |
gitpod_runner_version | Gauge | version, kind | Runner version (GCP) |
Environment (gitpod_runnerkit_*)
Metrics for environment lifecycle operations including creation, supervision, and state management.
| Metric | Type | Labels | Description |
|---|
gitpod_runnerkit_active_instances | Gauge | state | Environments by state |
gitpod_runnerkit_environment_operation_duration_seconds | Histogram | operation | Operation duration |
gitpod_runnerkit_function_calls_total | Counter | function | Function calls |
gitpod_runnerkit_function_duration_seconds | Histogram | function | Function duration |
gitpod_runnerkit_function_errors_total | Counter | function | Function errors |
gitpod_runnerkit_supervisor_status_events_total | Counter | — | Supervisor events |
gitpod_runnerkit_supervisor_watch_starts_total | Counter | — | Watch starts |
gitpod_runnerkit_supervisor_watch_closes_total | Counter | reason | Watch closes |
gitpod_runnerkit_supervisor_watch_duration_seconds | Histogram | — | Watch duration |
Snapshots (snapshot_*)
Metrics for environment snapshot operations used for persistence and restore.
| Metric | Type | Labels | Description |
|---|
snapshot_reconcile_duration_seconds | Histogram | phase, result | Processing time by phase |
snapshot_in_progress | Gauge | phase | Active snapshots |
snapshot_timeouts_total | Counter | — | Timeouts |
snapshot_deletions_total | Counter | result | Deletions |
Work queue (workqueue_*)
Internal task queue metrics. A growing workqueue_depth or high workqueue_unfinished_work_seconds may indicate the runner is falling behind on processing. Contact support if these remain elevated.
| Metric | Type | Labels | Description |
|---|
workqueue_depth | Gauge | name | Queue depth |
workqueue_adds_total | Counter | name | Items added |
workqueue_queue_duration_seconds | Histogram | name | Time in queue |
workqueue_work_duration_seconds | Histogram | name | Processing time |
workqueue_unfinished_work_seconds | Gauge | name | Stuck work indicator |
workqueue_longest_running_processor_seconds | Gauge | name | Longest processor |
workqueue_retries_total | Counter | name | Retries |
Errors (environment_error_*)
Tracks environment-level errors. Use the error_code and component labels when reporting issues to support.
| Metric | Type | Labels | Description |
|---|
environment_error_errors_total | Counter | instance_id, error_code, component | Errors by instance/code/component |
Gateway proxy (gitpod_gateway_proxy_*)
The gateway proxy handles all traffic between users and environments. It runs as a container named proxy in the same ECS task as the runner (AWS deployments).
HTTP requests
| Metric | Type | Labels | Description |
|---|
gitpod_gateway_proxy_http_requests_total | Counter | protocol, status_code | Total HTTP requests processed |
gitpod_gateway_proxy_http_request_duration_seconds | Histogram | protocol | Request duration |
gitpod_gateway_proxy_http_requests_in_flight | Gauge | protocol | Active requests |
gitpod_gateway_proxy_http_request_size_bytes | Histogram | protocol | Request size |
gitpod_gateway_proxy_http_response_size_bytes | Histogram | protocol | Response size |
Connections
| Metric | Type | Labels | Description |
|---|
gitpod_gateway_proxy_http_connections_in_flight | Gauge | protocol | Active connections |
gitpod_gateway_proxy_http_connection_errors_total | Counter | protocol, error_type | Connection errors |
gitpod_gateway_proxy_http_connection_duration_seconds | Histogram | protocol | Connection duration |
Backend
| Metric | Type | Labels | Description |
|---|
gitpod_gateway_proxy_http_backend_request_duration_seconds | Histogram | protocol | Backend request duration |
gitpod_gateway_proxy_http_backend_failures_total | Counter | error_type | Backend failures |
gitpod_gateway_proxy_http_backend_connections_in_flight | Gauge | protocol | Active backend connections |
DNS
| Metric | Type | Labels | Description |
|---|
gitpod_gateway_proxy_dns_resolution_duration_seconds | Histogram | — | DNS resolution time |
gitpod_gateway_proxy_dns_cache_hits_total | Counter | — | DNS cache hits |
gitpod_gateway_proxy_dns_cache_misses_total | Counter | — | DNS cache misses |
gitpod_gateway_proxy_dns_errors_total | Counter | error_type | DNS errors |
gitpod_gateway_proxy_gitpod_proxy_dns_negative_cache_hits_total | Counter | — | Negative cache hits |
gitpod_gateway_proxy_gitpod_proxy_dns_failures_by_code_total | Counter | code | DNS failures by HTTP status |
gitpod_gateway_proxy_gitpod_proxy_dns_cache_invalidations_total | Counter | — | Cache invalidations |
gitpod_gateway_proxy_gitpod_proxy_dns_cache_invalidations_batch_total | Counter | — | Batch cache invalidations |
gitpod_gateway_proxy_gitpod_proxy_environment_not_found_total | Counter | — | Requests for non-existent environments |
TLS
| Metric | Type | Labels | Description |
|---|
gitpod_gateway_proxy_tls_handshake_duration_seconds | Histogram | protocol_version | TLS handshake duration |
gitpod_gateway_proxy_tls_errors_total | Counter | error_type | TLS errors |
Security
| Metric | Type | Labels | Description |
|---|
gitpod_gateway_proxy_suspicious_requests_total | Counter | type | Suspicious requests detected |
GCP-specific (gitpod_gcp_*)
Metrics specific to GCP runner deployments, tracking compute and Redis connectivity.
| Metric | Type | Labels | Description |
|---|
gitpod_gcp_compute_network_errors_total | Counter | error_type | Network errors |
gitpod_gcp_compute_redis_connection_errors_total | Counter | — | Redis errors |
gitpod_gcp_compute_redis_connection_health | Gauge | — | Redis health |
gitpod_gcp_network_connection_health | Gauge | — | Network health |
Troubleshooting
Metrics not appearing?
- Check network connectivity to your endpoint
- Verify authentication credentials
- Check runner logs for errors
Network requirements: AWS | GCP
High cardinality?
- Aggregate at runner level instead of per-environment
- Adjust retention for high-volume metrics