Documentation Index
Fetch the complete documentation index at: https://ona.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Zone failover for capacity errors
GCP Runner ·20260508.526If a GCP zone lacks capacity to create an environment, the runner now retries in a different zone automatically. This reduces the impact of zonal capacity exhaustion, though it does not eliminate it entirely. Runners configured with multiple zones benefit most.Infrastructure upgrade required
This release requires a Terraform module upgrade to v2.0.1 (Terraform Registry).Key infrastructure changes:- SSH access restricted to IAP-only (port 22 no longer open to
0.0.0.0/0). - Shielded VM hardening enabled with Secure Boot, vTPM, and integrity monitoring. Project-wide SSH keys blocked on runner and proxy VMs.
- Flow logging added to security-critical firewall rules.
- Memory and CPU limits added to all Docker containers on the runner VM.
- TLS certificate rotation fixed for the auth proxy.
- Honeycomb API key removed from Terraform configuration and VM metadata.
- Managed metrics direct push enabled for the metrics pipeline.
Upgrade steps
- Update the
versionconstraint in yourmain.tfmodule block tov2.0.1. See the release page for details. - Run
terraform init -upgradeto fetch the new module. - Run
terraform plan -out=tfplanand review the changes, paying attention to firewall and shielded VM settings. - Run
terraform apply tfplan.
What else is in this release
New
New
- The Terraform module version used to provision your runner infrastructure is now displayed on the runner details page in the dashboard.
- External user IDs are now resolved for Bitbucket and GitLab auth tokens, enabling user attribution in Insights across all SCM providers.
Improvements
Improvements
- Environments that fail to start within 10 minutes (supervisor never connects) are now stopped automatically instead of staying in “starting” indefinitely.
- Workspace folder path is correctly reported during environment creation when dotfiles are configured.
- Supervisor retries asset downloads on SHA-256 mismatch instead of failing permanently.
- File watch self-healing works reliably under Docker-in-Docker (fuse-overlayfs) after file unlink and recreate.
- Agent conversations no longer stall silently when the model pauses mid-turn.
Security
Security
- Credentials (AWS keys, GitHub tokens, basic-auth URLs, bearer tokens, JWTs) are now redacted from environment status messages, on-disk state files, and process-output logs.
Warm pools now available on GCP
Warm pools keep pre-initialized Compute Engine instances in a suspended state, ready to resume when you create an environment. Instead of provisioning a new VM and loading the prebuild snapshot from scratch, Ona claims a suspended instance and resumes it. Startup drops from minutes to around 10 seconds.Enable warm pools per environment class in your project’s prebuild settings. The runner dynamically scales the pool between your configured minimum and maximum based on demand, and rotates instances automatically when new prebuilds complete.Requires an Enterprise plan. See the warm pools documentation for prerequisites and setup instructions.Infrastructure upgrade required
This release requires a Terraform module upgrade to v2.0.0 to enable warm pools and apply IAM changes.New IAM permissions added to the runner custom role:| Permission | Purpose |
|---|---|
compute.autoscalers.create | Manage MIG autoscalers for dynamic warm pool scaling |
compute.autoscalers.delete | Clean up autoscalers when warm pools are removed |
compute.autoscalers.get | Read autoscaler state during reconciliation |
compute.autoscalers.update | Adjust autoscaler targets as demand changes |
compute.instanceGroupManagers.use | Required for autoscaler to manage MIG instances |
compute.instances.listReferrers | Discover which MIG owns a VM during warm pool operations |
compute.instances.resume | Resume suspended warm pool VMs on claim |
monitoring.timeSeries.create | Publish scaling metrics that drive the autoscaler |
- The project-level
iam.serviceAccounts.actAsandiam.serviceAccounts.getAccessTokenpermissions have been removed from the runner custom role. - Instead, the runner SA is granted
roles/iam.serviceAccountUseron three specific service accounts:runner_sa,environment_vm_sa, andproxy_vm_sa. This limits impersonation to only the SAs the runner attaches to instances. - The runner assets bucket role has been elevated from
roles/storage.objectViewertoroles/storage.objectAdminto support writing managed metrics audit payloads.
- Unused service accounts (
build_cache,secret_manager,pubsub_processor) are removed. - Environment UDP egress is now restricted to DNS, NTP, and QUIC.
Upgrade steps
- Update the
versionconstraint in yourmain.tfmodule block tov2.0.0. See the release page for details. - Run
terraform init -upgradeto fetch the new module. - Run
terraform plan -out=tfplanand review the changes, paying attention to IAM and firewall rule updates. - Run
terraform apply tfplan. - If you use pre-created service accounts, you must:
- Add the new custom role permissions listed above.
- Grant
roles/iam.serviceAccountUseron therunner_sa,environment_vm_sa, andproxy_vm_saservice accounts to the runner SA.
What else is in this release
New
New
- Managed metrics pipeline lets you export runner metrics via Prometheus
remote_writefor monitoring runner health, environment lifecycle, and resource utilization. Contact your account team to enable it. - Quota and capacity errors from GCP are now surfaced as clear machine failure messages instead of generic errors.
- Automation services support a configurable readiness timeout, preventing services from hanging indefinitely when a health check never passes.
- Orphaned MIGs, autoscalers, instance templates, and warm pool instances are automatically cleaned up, preventing resource leaks.
Improvements
Improvements
- Environment startup is faster. Supervisor initialization steps now run concurrently, disk pre-warming prioritizes startup-critical paths, and git configuration runs in fewer round trips.
- Warm pool claim reliability is improved. The runner picks the oldest available instance, skips in-flight instances, and recovers the default network route after resuming a suspended VM.
- Async VM creation failures are now surfaced via Pub/Sub instead of silently failing.
- Log line ordering within the same timestamp is now preserved.
- The agent operations proxy is more resilient to transient connection failures.
- Prebuild snapshots no longer carry stale git identity from the prebuild executor.
- File watch self-healing works correctly when a denylisted file is unlinked and recreated inside Docker-in-Docker.
- The runner recovers gracefully from stale gitconfig lock files.
Security
Security
- Updated
go-jose/v4to v4.1.4 (High severity, GHSA-78h2-9frx-2jm8). - Updated
go.opentelemetry.io/otel/sdkto v1.43.0 (High severity). - Updated Node.js to v24.14.1 (High severity).
- Updated base container images and Prometheus for CVE fixes.
- Go toolchain bumped to go1.26.2 (fixes CVE-2026-27143).