- Review the common problems.
- If the issue persists, reach out to support.
Contacting Support
To start a support chat, use the bubble icon located in the bottom right corner of the application. When contacting support, please include the following information:- Any error messages and relevant screenshots.
- Runner ID and Version and GCP Region.
-
Runner Logs.
Report Issue
Copy Runner ID and Version
- Navigate to Settings > Runners.
- Locate your Runner card.
-
Click
...
in the top right corner and selectCopy ID
. -
The Runner Version is displayed as the last item in the menu.
Find Runner ID and Version
Find Terraform State
- Navigate to Settings > Runners.
- Open the Runner card to find the deployment details including region and project ID.
- Check your Terraform state file or remote state backend for additional deployment information.
Retrieve Runner Logs (Compute Engine Logs)
You can adjust the log level of your Runner from the Runner Configuration section to get more detailed logs for troubleshooting. See GCP Runner setup for log level configuration options.Using GCP Console
To view the logs for the Runner using the GCP Console:- Navigate to the GCP Compute Engine console.
- Locate the runner instances by filtering for your runner name or project.
- Select the instance associated with the Runner.
- Go to the Logs tab or click View logs to access Cloud Logging.
- Filter logs by the runner service or container name.
Note that runner instances have multiple containers: one for the Runner itself and another for monitoring; we need the former.
Using gcloud CLI
To look up the instance name and view logs using gcloud CLI, follow these commands:- To list all instances and find your runner instances by name pattern:
- To view logs for a specific instance:
- To view logs for the runner service specifically:
Monitoring and Metrics
If you have configured metrics collection, your monitoring system will receive Runner metrics. The GCP Runner includes Prometheus metrics collection on port 9090. For information on configuring metrics collection, see GCP Runner setup.Common Problems
Network misconfigurations and IAM permission issues are the most frequent causes of installation issues. Please refer to the infrastructure prerequisites to ensure all requirements are met. Below are common problems along with their diagnostics.Terraform Deployment Fails
-
Symptoms:
- Terraform apply fails with resource creation errors.
- Error messages related to missing VPC, subnets, or insufficient permissions.
- Terraform state shows failed resource creation with status reasons:
Error creating instance: googleapi: Error 400: Invalid value for field 'resource.networkInterfaces[0].subnetwork'
Error creating forwarding rule: googleapi: Error 403: Insufficient Permission
Error creating service account: googleapi: Error 403: Permission denied
-
Diagnostics:
- Verify your service account has all required IAM roles from the Access Requirements page.
- Ensure the VPC and subnet specified in
terraform.tfvars
exist and are accessible. - Check that the specified zones are available in your selected region.
- For internal load balancers, verify the proxy-only subnet exists with purpose
REGIONAL_MANAGED_PROXY
.
Runner Instance Fails to Start
-
Symptoms:
- Terraform deployment succeeds but runner instances fail to start or remain unhealthy.
- Health check validation fails during Terraform apply.
- Runner instances show as “RUNNING” but fail health checks.
- Cloud Logging shows container startup errors or authentication failures.
-
Diagnostics:
- Verify that the VPC has Cloud NAT or external IP addresses configured for internet access.
- Check that firewall rules allow the required outbound traffic to Ona services.
- Ensure the runner token and runner ID are correct and haven’t expired.
- For proxy environments, verify the proxy configuration is correct and accessible.
- Check that required container images are accessible from your project.
Machine Type Not Available
If you encounter an error stating that the requested machine type is unavailable in a specific zone (e.g., “The zone ‘projects/PROJECT_ID/zones/us-central1-c’ does not have enough resources available to fulfill the request”), this is often due to regional or zone-specific availability constraints within GCP.Some zones may experience resource shortages more frequently. If possible, avoid using single zones exclusively and instead deploy your runners across multiple zones for better availability.Here’s how you can address this:
-
Deploy to a Different Region:
- Some machine types may be unavailable in certain regions or zones due to resource constraints. Refer to GCP machine type availability for details. If necessary, deploy runners to use a different GCP region that supports your preferred machine type.
-
Select Multiple Zones:
- When configuring your Terraform deployment, ensure that you specify multiple zones in the
zones
variable. For example, instead of restricting your deployment to onlyus-central1-c
, include zones likeus-central1-a
andus-central1-b
to improve availability.- You can update the
zones
variable in yourterraform.tfvars
and runterraform apply
to update the deployment.
- You can update the
- When configuring your Terraform deployment, ensure that you specify multiple zones in the
-
Use an Alternate Machine Type:
- If the desired machine type (e.g.,
n2-standard-4
) is unavailable, consider using a different machine type, such ase2-standard-4
orn1-standard-4
, which may have better availability. - To update, modify the
runner_vm_config.machine_type
in your Terraform configuration and apply the changes.
- If the desired machine type (e.g.,
-
Retry Later:
- Machine type availability can be transient. If none of the above options resolve the issue, wait and try again later, as GCP resources might become available after a brief period.
Unexpected Costs
-
Symptoms:
- You notice unexpected charges in your GCP bill that you believe are related to the Runner infrastructure.
- You continue receiving bills for resources even after destroying the Terraform deployment.
-
Diagnostics:
- Use the GCP Billing console to investigate the specific GCP resources contributing to the charges.
- After running
terraform destroy
, verify that all resources have been fully deleted. Check for any residual resources such as:- Compute Engine instances or persistent disks
- Load balancer components (forwarding rules, backend services)
- Cloud Storage buckets
- Memorystore Redis instances
- Use the GCP Console or
gcloud
CLI to manually delete any remaining resources if necessary to avoid ongoing costs.
Load Balancer Connectivity Issues
-
Symptoms:
- Cannot access the runner domain or environments through the load balancer.
- Health checks fail for the load balancer backend services.
- SSL/TLS certificate errors when accessing the runner domain.
- DNS resolution fails for the runner domain.
-
Diagnostics:
- For internal load balancers: Verify that your corporate network has routing to the VPC and can reach the internal IP address.
- For external load balancers: Check that DNS records point to the correct external IP address.
- Verify SSL certificate configuration:
- For internal LB: Check that the certificate is properly stored in Secret Manager
- For external LB: Verify the Certificate Manager certificate is valid and covers your domain
- Test connectivity from different network locations to isolate routing issues.
- Check firewall rules allow HTTPS traffic (port 443) to the load balancer.
IAM Permission Issues
-
Symptoms:
- Terraform deployment fails with permission denied errors.
- Runner instances cannot access required GCP services (Secret Manager, Cloud Storage, etc.).
- Error messages like
Error 403: Insufficient Permission
orPermission denied
.
-
Diagnostics:
- Verify your deployment service account has all required roles from the Access Requirements page.
- Check that service accounts created by Terraform have the correct IAM bindings.
- For organizations with custom IAM policies, ensure they don’t block required permissions.
- Test service account permissions using
gcloud auth activate-service-account
and attempting the failing operations manually.
Network Connectivity Issues
If you experience connectivity issues with your GCP Runner, follow these troubleshooting steps to diagnose and resolve common networking problems.Common Network Issues
If you experience connectivity issues:-
Verify firewall rules
- Ensure port 29222 is open for SSH access to development environments
- Check that outbound rules allow HTTPS traffic to required endpoints
- Verify internal communication ports are allowed between runner components
-
Check VPC and subnet configuration
- Confirm Private Google Access is enabled on the runner subnet
- Verify Cloud NAT or external IP addresses are configured for internet access
- Ensure proxy-only subnet exists for internal load balancers
-
Validate DNS resolution
- Test DNS resolution for
app.gitpod.io
and required endpoints - Verify corporate DNS can resolve your runner domain
- Check that VPC DNS settings are properly configured
- Test DNS resolution for
-
Test connectivity to Ona services
- From a Compute Engine instance in your runner’s subnet, test connectivity to required endpoints
- Use tools like
curl
ortelnet
to verify connectivity
Health Endpoint Connectivity Test
Test the health endpoint to verify network connectivity and load balancer functionality:your-domain.com
with your actual domain name configured during setup. A successful response returns HTTP 200 status code, indicating that:
- DNS resolution is working correctly
- Load balancer is accessible from your network
- SSL/TLS certificate is properly configured
- Basic network connectivity is established
- DNS configuration and propagation
- Firewall rules allowing HTTPS traffic
- Load balancer health and backend service status
- SSL certificate validity and domain matching
Required Endpoints Connectivity Test
Test connectivity to these critical endpoints from your runner’s subnet:Restarting Runner Instances After Networking Changes
After applying networking changes (such as firewall rule updates, VPC modifications, or proxy configurations), you may need to restart the runner instances to ensure the changes take effect.Using the GCP Console
- Navigate to the GCP Compute Engine console
- Filter instances by your runner name or project
- Select the runner instances you want to restart
- Click Stop and wait for instances to stop completely
- Click Start to restart the instances with updated networking configuration
- The managed instance group will automatically recreate instances if needed
Using gcloud CLI
You can also restart runner instances using the gcloud CLI:Using Terraform
For a complete refresh of the deployment:Verification Steps
After making networking changes and restarting instances:-
Check Runner status in Ona
- Go to Settings > Runners in your Ona dashboard
- Verify the Runner shows as “Connected”
-
Test Environment creation
- Create a new Environment using the Runner
- Verify the Environment starts successfully
-
Monitor Cloud Logging
- Check Compute Engine logs for any connectivity errors
- Look for successful connections to Ona services
Proxy Configuration Issues
-
Symptoms:
- Runner instances cannot reach external services through corporate proxy.
- Container image pulls fail through proxy.
- SSL/TLS certificate validation errors in proxy environments.
-
Diagnostics:
- Verify proxy configuration in
terraform.tfvars
includes all required settings:http_proxy
,https_proxy
,no_proxy
variables
- Check that
no_proxy
includes required internal domains and IP ranges. - For custom CA certificates, verify the certificate is properly configured and accessible.
- Test proxy connectivity from a test instance in the same subnet.
- Verify proxy configuration in
CMEK Encryption Issues
-
Symptoms:
- Terraform fails to create encrypted resources.
- Error messages related to KMS key access or encryption.
- Resources fail to start due to encryption key unavailability.
-
Diagnostics:
- Verify KMS key exists and is in the correct region.
- Check that service accounts have
roles/cloudkms.cryptoKeyEncrypterDecrypter
permission on the KMS key. - For automatically created CMEK keys, ensure the deployment service account has KMS admin permissions.
- Verify the key is not disabled or scheduled for destruction.