Documentation Index Fetch the complete documentation index at: https://ona.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Ona runners expose Prometheus metrics for runner health, environment lifecycle, and resource utilization.
Enabling metrics collection
Go to Settings → Runners
Select your runner
Toggle Enable metrics collection
Enter your configuration:
Parameter Required Description Metrics collector URL Yes Prometheus remote write endpoint Username No Basic auth username Password No Basic auth password
Click Save Configuration
Your credentials are encrypted at rest and transmitted securely. They are never exposed in logs or the dashboard.
Metrics flow immediately. Ensure outbound HTTPS (port 443) is allowed to your endpoint.
Network requirements: AWS | GCP
What to monitor
Runners expose many metrics, but not all require your attention. Some indicate issues you can resolve directly in your cloud account. Others signal problems that require Ona support. The rest provide visibility into usage and system health.
Act on these
These metrics reflect infrastructure you control. Set up alerts and respond directly.
Metric What it means What to do up == 0Runner is unreachable Check network connectivity, security groups, and firewall rules gitpod_gateway_proxy_up == 0Proxy is down Check runner logs and network configuration gitpod_gateway_proxy_http_requests_totalProxy request errors Filter for status_code 4xx/5xx to identify failing requests; may indicate misconfigured clients or network issues
These metrics indicate issues within the runner itself. You can’t resolve them directly, but they help you know when to reach out.
Metric What it means What to tell support gitpod_runnerkit_function_errors_totalTotal internal operation failures Share the error rate trend and affected time window workqueue_unfinished_work_secondsProcessing is stuck (value stays elevated) Note how long it’s been elevated and any correlated symptoms environment_error_errors_totalEnvironment creation/operation failures Include the error_code and component labels from the metric
These metrics provide visibility but don’t typically require action.
Metric What it shows gitpod_runnerkit_active_instancesCurrent environment count by state
Example alerts
These alerts cover high-signal scenarios that directly impact your users. Runner and proxy availability determine whether users can access environments at all. Proxy error rates indicate active failures during environment connections. These are the first things to know about when something goes wrong.
Runner unreachable
Check your network configuration, security groups, and firewall rules.
- alert : RunnerUnreachable
expr : up == 0
for : 5m
labels :
severity : critical
annotations :
summary : "Runner is unreachable"
runbook : "Check network connectivity and security groups"
Proxy error rate elevated
Indicates users are experiencing failures connecting to environments.
- alert : ProxyErrorRateElevated
expr : |
sum(rate(gitpod_gateway_proxy_http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(gitpod_gateway_proxy_http_requests_total[5m])) > 0.05
for : 5m
labels :
severity : warning
annotations :
summary : "Proxy 5xx error rate above 5%"
runbook : "Check proxy logs and backend connectivity"
This alert signals an issue you can’t fix directly. Contact support with the time window and error details.
- alert : HighErrorRate
expr : |
rate(gitpod_runnerkit_function_errors_total[5m])
/ rate(gitpod_runnerkit_function_calls_total[5m]) > 0.1
for : 10m
labels :
severity : warning
annotations :
summary : "Runner error rate elevated"
runbook : "Contact Ona support with time window and error_code labels"
Available metrics
All metrics include these common labels:
Label Description stackRunner stack name (e.g., Ona-AWS-US-East---Enterprise) account_idCloud provider account ID regionDeployment region instanceContainer hostname jobPrometheus job name (ec2_runner, runner_manager, proxy)
The tables below list additional metric-specific labels where applicable.
Standard
Common metrics available on all runners.
Metric Type Labels Description upGauge - Target health (1 = up, 0 = down) gitpod_gateway_proxy_upGauge - Proxy health (1 = up, 0 = down) gitpod_ec2_runner_version_infoGauge versionRunner version (AWS) gitpod_runner_versionGauge version, kindRunner version (GCP)
Environment (gitpod_runnerkit_*)
Metrics for environment lifecycle operations including creation, supervision, and state management.
Metric Type Labels Description gitpod_runnerkit_active_instancesGauge stateEnvironments by state gitpod_runnerkit_environment_operation_duration_secondsHistogram operationOperation duration gitpod_runnerkit_function_calls_totalCounter functionFunction calls gitpod_runnerkit_function_duration_secondsHistogram functionFunction duration gitpod_runnerkit_function_errors_totalCounter functionFunction errors gitpod_runnerkit_supervisor_status_events_totalCounter - Supervisor events gitpod_runnerkit_supervisor_watch_starts_totalCounter - Watch starts gitpod_runnerkit_supervisor_watch_closes_totalCounter reasonWatch closes gitpod_runnerkit_supervisor_watch_duration_secondsHistogram - Watch duration
Snapshots (snapshot_*)
Metrics for environment snapshot operations used for persistence and restore.
Metric Type Labels Description snapshot_reconcile_duration_secondsHistogram phase, resultProcessing time by phase snapshot_in_progressGauge phaseActive snapshots snapshot_timeouts_totalCounter - Timeouts snapshot_deletions_totalCounter resultDeletions
Work queue (workqueue_*)
Internal task queue metrics. A growing workqueue_depth or high workqueue_unfinished_work_seconds may indicate the runner is falling behind on processing. Contact support if these remain elevated.
Metric Type Labels Description workqueue_depthGauge nameQueue depth workqueue_adds_totalCounter nameItems added workqueue_queue_duration_secondsHistogram nameTime in queue workqueue_work_duration_secondsHistogram nameProcessing time workqueue_unfinished_work_secondsGauge nameStuck work indicator workqueue_longest_running_processor_secondsGauge nameLongest processor workqueue_retries_totalCounter nameRetries
Errors (environment_error_*)
Tracks environment-level errors. Use the error_code and component labels when reporting issues to support.
Metric Type Labels Description environment_error_errors_totalCounter instance_id, error_code, componentErrors by instance/code/component
Gateway proxy (gitpod_gateway_proxy_*)
The gateway proxy handles all traffic between users and environments. It runs as a container named proxy in the same ECS task as the runner (AWS deployments).
HTTP requests
Metric Type Labels Description gitpod_gateway_proxy_http_requests_totalCounter protocol, status_codeTotal HTTP requests processed gitpod_gateway_proxy_http_request_duration_secondsHistogram protocolRequest duration gitpod_gateway_proxy_http_requests_in_flightGauge protocolActive requests gitpod_gateway_proxy_http_request_size_bytesHistogram protocolRequest size gitpod_gateway_proxy_http_response_size_bytesHistogram protocolResponse size
Connections
Metric Type Labels Description gitpod_gateway_proxy_http_connections_in_flightGauge protocolActive connections gitpod_gateway_proxy_http_connection_errors_totalCounter protocol, error_typeConnection errors gitpod_gateway_proxy_http_connection_duration_secondsHistogram protocolConnection duration
Backend
Metric Type Labels Description gitpod_gateway_proxy_http_backend_request_duration_secondsHistogram protocolBackend request duration gitpod_gateway_proxy_http_backend_failures_totalCounter error_typeBackend failures gitpod_gateway_proxy_http_backend_connections_in_flightGauge protocolActive backend connections
DNS
Metric Type Labels Description gitpod_gateway_proxy_dns_resolution_duration_secondsHistogram - DNS resolution time gitpod_gateway_proxy_dns_cache_hits_totalCounter - DNS cache hits gitpod_gateway_proxy_dns_cache_misses_totalCounter - DNS cache misses gitpod_gateway_proxy_dns_errors_totalCounter error_typeDNS errors gitpod_gateway_proxy_gitpod_proxy_dns_negative_cache_hits_totalCounter - Negative cache hits gitpod_gateway_proxy_gitpod_proxy_dns_failures_by_code_totalCounter codeDNS failures by HTTP status gitpod_gateway_proxy_gitpod_proxy_dns_cache_invalidations_totalCounter - Cache invalidations gitpod_gateway_proxy_gitpod_proxy_dns_cache_invalidations_batch_totalCounter - Batch cache invalidations gitpod_gateway_proxy_gitpod_proxy_environment_not_found_totalCounter - Requests for non-existent environments
TLS
Metric Type Labels Description gitpod_gateway_proxy_tls_handshake_duration_secondsHistogram protocol_versionTLS handshake duration gitpod_gateway_proxy_tls_errors_totalCounter error_typeTLS errors
Security
Metric Type Labels Description gitpod_gateway_proxy_suspicious_requests_totalCounter typeSuspicious requests detected
Warm pools (warm_pool_*)
Metrics for warm pool scaling, instance lifecycle, and claim performance. All metrics include a warm_pool_id label.
Metric Type Labels Description warm_pool_instancesGauge warm_pool_idCurrent number of running instances warm_pool_instances_by_stateGauge warm_pool_id, stateInstance count by lifecycle state (in_service, stopped) warm_pool_target_sizeGauge warm_pool_idCurrent desired instance count (set by the scaling policy) warm_pool_min_sizeGauge warm_pool_idConfigured minimum pool size warm_pool_max_sizeGauge warm_pool_idConfigured maximum pool size warm_pool_claims_totalCounter warm_pool_id, resultClaim attempts (success, instance_not_found, error) warm_pool_claim_instance_age_secondsHistogram warm_pool_idAge of instances at claim time (time since launch) warm_pool_oldest_instance_age_secondsGauge warm_pool_idAge of the oldest running instance warm_pool_instances_created_totalCounter warm_pool_idTotal instances launched warm_pool_instances_terminated_totalCounter warm_pool_idTotal instances terminated warm_pool_reconcile_duration_secondsHistogram phase, resultReconciliation duration by pool phase
Key metrics to watch:
warm_pool_claims_total with result="instance_not_found" : Indicates users hit a cold start because no warm instance was available. If this happens frequently, increase max-size or min-size (the pool may be scaling down too aggressively).
warm_pool_claim_instance_age_seconds : Shows how long instances waited before being claimed. Very short ages may indicate the pool is too small for demand.
warm_pool_instances_by_state : Compare in_service vs stopped counts to verify the pool is scaling as expected.
AWS-specific (gitpod_ec2_runner_*, gitpod_ecs_*)
Metrics specific to AWS runner deployments. The gitpod_ecs_* series are sourced from the ECS Task Metadata Endpoint and report task- and container-level resource usage. They are emitted on every ECS launch type (EC2 and Fargate) and can be used to detect a hot or memory-pressured runner task before the symptoms surface elsewhere.
Metric Type Labels Description gitpod_ecs_task_cpu_utilized_percentGauge service, task_familyTask CPU utilization, summed across all containers in the task gitpod_ecs_task_memory_utilized_bytesGauge service, task_familyTask memory used in bytes, summed across all containers gitpod_ecs_task_memory_limit_bytesGauge service, task_familyTask memory limit (only emitted when the task definition sets a task-level memory limit) gitpod_ecs_task_network_rx_bytes_totalCounter service, task_familyBytes received on the task’s network interfaces gitpod_ecs_task_network_tx_bytes_totalCounter service, task_familyBytes transmitted on the task’s network interfaces gitpod_ecs_container_cpu_utilized_percentGauge service, task_family, container_name, container_typePer-container CPU utilization gitpod_ecs_container_memory_utilized_bytesGauge service, task_family, container_name, container_typePer-container memory used gitpod_ecs_container_network_rx_bytes_totalCounter service, task_family, container_name, container_typePer-container bytes received gitpod_ecs_container_network_tx_bytes_totalCounter service, task_family, container_name, container_typePer-container bytes transmitted
GCP-specific (gitpod_gcp_*)
Metrics specific to GCP runner deployments, tracking compute and Redis connectivity.
Metric Type Labels Description gitpod_gcp_compute_network_errors_totalCounter error_typeNetwork errors gitpod_gcp_compute_redis_connection_errors_totalCounter - Redis errors gitpod_gcp_compute_redis_connection_healthGauge - Redis health gitpod_gcp_network_connection_healthGauge - Network health
GCP Runner alerts and dashboards
If you are using a GCP Runner, pre-built Grafana alerts and a dashboard are available that you can import directly. See Alerts and Dashboards for details.
Troubleshooting
Check network connectivity to your endpoint
Verify authentication credentials
Check runner logs for errors
Network requirements: AWS | GCP
Aggregate at runner level instead of per-environment
Adjust retention for high-volume metrics