Troubleshooting GCP runners

If you encounter any issues while setting up or operating a Runner, please follow these steps:

Review the common problems.
If the issue persists, reach out to support.

Contacting Support

To start a support chat, use the bubble icon located in the bottom right corner of the application. When contacting support, please include the following information:

Any error messages and relevant screenshots.
Runner ID and Version and GCP Region.
Runner Logs.
Report Issue

Copy Runner ID and Version

Navigate to Settings > Runners.
Locate your Runner card.
Click ... in the top right corner and select Copy ID.
The Runner Version is displayed as the last item in the menu.
Find Runner ID and Version

Find Terraform State

Navigate to Settings > Runners.
Open the Runner card to find the deployment details including region and project ID.
Check your Terraform state file or remote state backend for additional deployment information.

Retrieve Runner Logs (Compute Engine Logs)

You can adjust the log level of your Runner from the Runner Configuration section to get more detailed logs for troubleshooting. See GCP Runner setup for log level configuration options.

Using GCP Console

To view the logs for the Runner using the GCP Console:

Navigate to the GCP Compute Engine console.
Locate the runner instances by filtering for your runner name or project.
Select the instance associated with the Runner.
Go to the Logs tab or click View logs to access Cloud Logging.
Filter logs by the runner service or container name.

Note that runner instances have multiple containers: one for the Runner itself and another for monitoring; we need the former.

Using gcloud CLI

To look up the instance name and view logs using gcloud CLI, follow these commands:

To list all instances and find your runner instances by name pattern:

gcloud compute instances list --filter="name~'.*runner.*'"

To view logs for a specific instance:

gcloud logging read "resource.type=gce_instance AND resource.labels.instance_name=INSTANCE_NAME" --limit=50

To view logs for the runner service specifically:

gcloud logging read "resource.type=gce_instance AND jsonPayload.container_name=runner" --limit=50

Monitoring and Metrics

If you have configured metrics collection, your monitoring system will receive Runner metrics. The GCP Runner includes Prometheus metrics collection on port 9090. For information on configuring metrics collection, see GCP Runner setup.

Common Problems

Network misconfigurations and IAM permission issues are the most frequent causes of installation issues. Please refer to the infrastructure prerequisites to ensure all requirements are met. Below are common problems along with their diagnostics.

Terraform Deployment Fails

Symptoms:
- Terraform apply fails with resource creation errors.
- Error messages related to missing VPC, subnets, or insufficient permissions.
- Terraform state shows failed resource creation with status reasons:
  - Error creating instance: googleapi: Error 400: Invalid value for field 'resource.networkInterfaces[0].subnetwork'
  - Error creating forwarding rule: googleapi: Error 403: Insufficient Permission
  - Error creating service account: googleapi: Error 403: Permission denied
Diagnostics:
- Verify your service account has all required IAM roles from the Access Requirements page.
- Ensure the VPC and subnet specified in terraform.tfvars exist and are accessible.
- Check that the specified zones are available in your selected region.
- For internal load balancers, verify the proxy-only subnet exists with purpose REGIONAL_MANAGED_PROXY.

Runner Instance Fails to Start

Symptoms:
- Terraform deployment succeeds but runner instances fail to start or remain unhealthy.
- Health check validation fails during Terraform apply.
- Runner instances show as “RUNNING” but fail health checks.
- Cloud Logging shows container startup errors or authentication failures.
Diagnostics:
- Verify that the VPC has Cloud NAT or external IP addresses configured for internet access.
- Check that firewall rules allow the required outbound traffic to Ona services.
- Ensure the runner token and runner ID are correct and haven’t expired.
- For proxy environments, verify the proxy configuration is correct and accessible.
- Check that required container images are accessible from your project.

Machine Type Not Available

If you encounter an error stating that the requested machine type is unavailable in a specific zone (e.g., “The zone ‘projects/PROJECT_ID/zones/us-central1-c’ does not have enough resources available to fulfill the request”), this is often due to regional or zone-specific availability constraints within GCP.

Some zones may experience resource shortages more frequently. If possible, avoid using single zones exclusively and instead deploy your runners across multiple zones for better availability.

Here’s how you can address this:

Deploy to a Different Region:
- Some machine types may be unavailable in certain regions or zones due to resource constraints. Refer to GCP machine type availability for details. If necessary, deploy runners to use a different GCP region that supports your preferred machine type.
Select Multiple Zones:
- When configuring your Terraform deployment, ensure that you specify multiple zones in the zones variable. For example, instead of restricting your deployment to only us-central1-c, include zones like us-central1-a and us-central1-b to improve availability.
  - You can update the zones variable in your terraform.tfvars and run terraform apply to update the deployment.
Use an Alternate Machine Type:
- If the desired machine type (e.g., n2-standard-4) is unavailable, consider using a different machine type, such as e2-standard-4 or n1-standard-4, which may have better availability.
- To update, modify the runner_vm_config.machine_type in your Terraform configuration and apply the changes.
Retry Later:
- Machine type availability can be transient. If none of the above options resolve the issue, wait and try again later, as GCP resources might become available after a brief period.

Unexpected Costs

Symptoms:
- You notice unexpected charges in your GCP bill that you believe are related to the Runner infrastructure.
- You continue receiving bills for resources even after destroying the Terraform deployment.
Diagnostics:
- Use the GCP Billing console to investigate the specific GCP resources contributing to the charges.
- After running terraform destroy, verify that all resources have been fully deleted. Check for any residual resources such as:
  - Compute Engine instances or persistent disks
  - Load balancer components (forwarding rules, backend services)
  - Cloud Storage buckets
  - Memorystore Redis instances
- Use the GCP Console or gcloud CLI to manually delete any remaining resources if necessary to avoid ongoing costs.

Load Balancer Connectivity Issues

Symptoms:
- Cannot access the runner domain or environments through the load balancer.
- Health checks fail for the load balancer backend services.
- SSL/TLS certificate errors when accessing the runner domain.
- DNS resolution fails for the runner domain.
Diagnostics:
- For internal load balancers: Verify that your corporate network has routing to the VPC and can reach the internal IP address.
- For external load balancers: Check that DNS records point to the correct external IP address.
- Verify SSL certificate configuration:
  - For internal LB: Check that the certificate is properly stored in Secret Manager
  - For external LB: Verify the Certificate Manager certificate is valid and covers your domain
- Test connectivity from different network locations to isolate routing issues.
- Check firewall rules allow HTTPS traffic (port 443) to the load balancer.

IAM Permission Issues

Symptoms:
- Terraform deployment fails with permission denied errors.
- Runner instances cannot access required GCP services (Secret Manager, Cloud Storage, etc.).
- Error messages like Error 403: Insufficient Permission or Permission denied.
Diagnostics:
- Verify your deployment service account has all required roles from the Access Requirements page.
- Check that service accounts created by Terraform have the correct IAM bindings.
- For organizations with custom IAM policies, ensure they don’t block required permissions.
- Test service account permissions using gcloud auth activate-service-account and attempting the failing operations manually.

Network Connectivity Issues

If you experience connectivity issues with your GCP Runner, follow these troubleshooting steps to diagnose and resolve common networking problems.

Common Network Issues

If you experience connectivity issues:

Verify firewall rules
- Ensure port 29222 is open for SSH access to development environments
- Check that outbound rules allow HTTPS traffic to required endpoints
- Verify internal communication ports are allowed between runner components
Check VPC and subnet configuration
- Confirm Private Google Access is enabled on the runner subnet
- Verify Cloud NAT or external IP addresses are configured for internet access
- Ensure proxy-only subnet exists for internal load balancers
Validate DNS resolution
- Test DNS resolution for app.gitpod.io and required endpoints
- Verify corporate DNS can resolve your runner domain
- Check that VPC DNS settings are properly configured
Test connectivity to Ona services
- From a Compute Engine instance in your runner’s subnet, test connectivity to required endpoints
- Use tools like curl or telnet to verify connectivity

Health Endpoint Connectivity Test

Test the health endpoint to verify network connectivity and load balancer functionality:

# Test health endpoint connectivity (returns HTTP 200 on success)
curl -v https://your-domain.com/_health

# For internal load balancers, test from within your corporate network
curl -k https://your-internal-domain.com/_health

Replace your-domain.com with your actual domain name configured during setup. A successful response returns HTTP 200 status code, indicating that:

DNS resolution is working correctly
Load balancer is accessible from your network
SSL/TLS certificate is properly configured
Basic network connectivity is established

If this test fails, check:

DNS configuration and propagation
Firewall rules allowing HTTPS traffic
Load balancer health and backend service status
SSL certificate validity and domain matching

Required Endpoints Connectivity Test

Test connectivity to these critical endpoints from your runner’s subnet:

# Test HTTPS connectivity to Ona services
curl -I https://app.gitpod.io
curl -I https://releases.gitpod.io

# Test connectivity to GCP services
curl -I https://storage.googleapis.com
curl -I https://secretmanager.googleapis.com

Restarting Runner Instances After Networking Changes

After applying networking changes (such as firewall rule updates, VPC modifications, or proxy configurations), you may need to restart the runner instances to ensure the changes take effect.

Using the GCP Console

Navigate to the GCP Compute Engine console
Filter instances by your runner name or project
Select the runner instances you want to restart
Click Stop and wait for instances to stop completely
Click Start to restart the instances with updated networking configuration
The managed instance group will automatically recreate instances if needed

Using gcloud CLI

You can also restart runner instances using the gcloud CLI:

# List runner instances
gcloud compute instances list --filter="name~'.*runner.*'"

# Stop instances (they will be automatically recreated by the managed instance group)
gcloud compute instances delete INSTANCE_NAME --zone=ZONE_NAME

# Or restart the entire managed instance group
gcloud compute instance-groups managed rolling-action restart INSTANCE_GROUP_NAME --region=REGION

Using Terraform

For a complete refresh of the deployment:

# Recreate all instances with updated configuration
terraform apply -replace="google_compute_region_instance_group_manager.runner"
terraform apply -replace="google_compute_region_instance_group_manager.proxy"

Verification Steps

After making networking changes and restarting instances:

Check Runner status in Ona
- Go to Settings > Runners in your Ona dashboard
- Verify the Runner shows as “Connected”
Test Environment creation
- Create a new Environment using the Runner
- Verify the Environment starts successfully
Monitor Cloud Logging
- Check Compute Engine logs for any connectivity errors
- Look for successful connections to Ona services

Proxy Configuration Issues

Symptoms:
- Runner instances cannot reach external services through corporate proxy.
- Container image pulls fail through proxy.
- SSL/TLS certificate validation errors in proxy environments.
Diagnostics:
- Verify proxy configuration in terraform.tfvars includes all required settings:
  - http_proxy, https_proxy, no_proxy variables
- Check that no_proxy includes required internal domains and IP ranges.
- For custom CA certificates, verify the certificate is properly configured and accessible.
- Test proxy connectivity from a test instance in the same subnet.

CMEK Encryption Issues

Symptoms:
- Terraform fails to create encrypted resources.
- Error messages related to KMS key access or encryption.
- Resources fail to start due to encryption key unavailability.
Diagnostics:
- Verify KMS key exists and is in the correct region.
- Check that service accounts have roles/cloudkms.cryptoKeyEncrypterDecrypter permission on the KMS key.
- For automatically created CMEK keys, ensure the deployment service account has KMS admin permissions.
- Verify the key is not disabled or scheduled for destruction.

Quickstart

Ona Environments

Ona Agents

Ona Guardrails

Automations

Runners

Editors & IDEs

Organizations

Projects

Integrations

Source Control

Troubleshooting GCP runners

Contacting Support

Copy Runner ID and Version

Find Terraform State

Retrieve Runner Logs (Compute Engine Logs)

Using GCP Console

Using gcloud CLI

Monitoring and Metrics

Common Problems

Terraform Deployment Fails

Runner Instance Fails to Start

Machine Type Not Available

Unexpected Costs

Load Balancer Connectivity Issues

IAM Permission Issues

Network Connectivity Issues

Common Network Issues

Health Endpoint Connectivity Test

Required Endpoints Connectivity Test

Restarting Runner Instances After Networking Changes

Using the GCP Console

Using gcloud CLI

Using Terraform

Verification Steps

Proxy Configuration Issues

CMEK Encryption Issues

Quickstart

Ona Environments

Ona Agents

Ona Guardrails

Automations

Runners

Editors & IDEs

Organizations

Projects

Integrations

Source Control

​Contacting Support

​Copy Runner ID and Version

​Find Terraform State

​Retrieve Runner Logs (Compute Engine Logs)

​Using GCP Console

​Using gcloud CLI

​Monitoring and Metrics

​Common Problems

​Terraform Deployment Fails

​Runner Instance Fails to Start

​Machine Type Not Available

​Unexpected Costs

​Load Balancer Connectivity Issues

​IAM Permission Issues

​Network Connectivity Issues

​Common Network Issues

​Health Endpoint Connectivity Test

​Required Endpoints Connectivity Test

​Restarting Runner Instances After Networking Changes

Using the GCP Console

Using gcloud CLI

Using Terraform

​Verification Steps

​Proxy Configuration Issues

​CMEK Encryption Issues

Contacting Support

Copy Runner ID and Version

Find Terraform State

Retrieve Runner Logs (Compute Engine Logs)

Using GCP Console

Using gcloud CLI

Monitoring and Metrics

Common Problems

Terraform Deployment Fails

Runner Instance Fails to Start

Machine Type Not Available

Unexpected Costs

Load Balancer Connectivity Issues

IAM Permission Issues

Network Connectivity Issues

Common Network Issues

Health Endpoint Connectivity Test

Required Endpoints Connectivity Test

Restarting Runner Instances After Networking Changes

Verification Steps

Proxy Configuration Issues

CMEK Encryption Issues