Christian Weichel, Alejandro de Brito Fontes/October 31, 2024Platform Engineering

We’re leaving Kubernetes

We are moving away from Kubernetes for cloud development environments after 6 years of experience at scale. Learn about Gitpod Flex and our new approach to development infrastructure for developers.

Kubernetes seems like the obvious choice for building out remote, standardized and automated development environments. We thought so too and have spent six years invested in making the most popular cloud development environment platform at internet scale. That’s 1.5 million users, where we regularly see thousands of development environments per day. In that time, we’ve found that Kubernetes is not the right choice for building development environments.

This is our journey of experiments, failures and dead-ends building development environments on Kubernetes. Over the years, we experimented with many ideas involving SSDs, PVCs, eBPF, seccomp notify, TC and io_uring, shiftfs, FUSE and idmapped mounts, ranging from microVMs, kubevirt to vCluster.

In pursuit of the most optimal infrastructure to balance security, performance and interoperability. All while wrestling with the unique challenges of building a system to scale up, remain secure as it’s handling arbitrary code execution, and be stable enough for developers to work in.

This is not a story of whether or not to use Kubernetes for production workloads that’s a whole separate conversation. As is the topic of how to build a comprehensive soup-to-nuts developer experience for shipping applications on Kubernetes.

This is the story of how (not) to build development environments in the cloud.

Why are development environments unique?

Before we dive in, it’s crucial to understand what makes development environments unique compared to production workloads:

These characteristics set development environments apart from typical application workloads and significantly influence the infrastructure decisions we’ve made along the way.

The system today: obviously it’s Kubernetes

When we started Gitpod, Kubernetes seemed like the ideal choice for our infrastructure. Its promise of scalability, container orchestration, and rich ecosystem aligned perfectly with our vision for cloud development environments. However, as we scaled and our user base grew, we encountered several challenges around security and state management that pushed Kubernetes to its limits. Fundamentally, Kubernetes is built to run well controlled application workloads, not unruly development environments.

Managing Kubernetes at scale is complex. While managed services like GKE and EKS alleviate some pain points, they come with their own set of restrictions and limitations. We found that many teams looking to operate a CDE underestimate the complexity of Kubernetes, which lead to a significant support load for our previous self-managed Gitpod offering.

Resource management struggles

One of the most significant challenges we faced was resource management, particularly CPU and memory allocation per environment. At first glance, running multiple environments on a node seems attractive to share resources (such as CPU, memory, IO and network bandwidth) between those resources. In practice, this incurs significant noisy neighbor effects leading to a detrimental user experience.

CPU challenges

CPU time seems like the simplest candidate to share between environments. Most of the time development environments don’t need much CPU, but when they do, they need it quickly. Latency becomes immediately apparent to users when their language server starts to lag or their terminal becomes choppy. This spiky nature of the CPU requirements of development environments (periods of inactivity followed by intensive builds) makes it difficult to predict when CPU time is needed.

For solutions, we experimented with various CFS (Completely Fair Scheduler (https://docs.kernel.org/scheduler/sched-design-CFS.html)) based schemes, implementing a custom controller using a DaemonSet. A core challenge is that we can not predict when CPU bandwidth is needed, but only understand when it would have been needed (by observing nr_throttled of the cgroup’s cpu_stats).

Even when using static CPU resource limits, challenges arise, because unlike application workloads a development environment will run many processes in the same container. These processes compete for the same CPU bandwidth, which can lead to e.g. VS Code disconnects because VS Code server is starved for CPU time.

We have attempted to solve this problem by adjusting the process priorities of the individual processes, e.g. increasing the priority of bash or vscode-server. However, these process priorities apply to the entire process group (depending on your Kernel’s autogroup scheduling configuration), hence also to the resource hungry compilers started in a VS Code terminal. Using process priorities to counter terminal lag requires a carefully written control loop to be effective.

We introduced custom CFS and process priority based control loops built on cgroupv1 and moved to cgroupsv2 once they became more readily available on managed Kubernetes platforms with 1.24. Dynamic resource allocation (https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) introduced with Kubernetes 1.26 means one no longer needs to deploy a DaemonSet and modify cgroups directly, possibly at the expense of the control loop speed and hence effectiveness. All the schemes outlined above rely on second-by-second readjustment of CFS limits and niceness values.

Memory management

Memory management presented its own set of challenges. Assigning every environment a fixed amount of memory, so that under maximum occupation each environment gets their fixed share is straightforward, but very limiting. In the cloud, RAM is one of the more expensive resources, hence the desire to overbook memory.

Until swap-space became available in Kubernetes 1.22, memory overbooking was near impossible to do, because reclaiming memory inevitably means killing processes. With the addition of swap space the need to overbook memory has somewhat gone away, since swap works well in practice for hosting development environments.

Storage performance optimization

Storage performance is important for the startup performance and experience of development environments. We have found that specifically IOPS and latency affect experience within an environment. IO bandwidth however directly impacts your workspace startup performance, specifically when creating/restoring backups or extracting large workspace images.

We experimented with various setups to find the right balance between speed and reliability, cost and performance.

Backing up and restoring local disks proved to be an expensive operation. We implemented a solution using a daemonSet that uploads and downloads uncompressed tar archives to/from S3. This approach required careful balancing of I/O, network bandwidth, and CPU usage: for example, (de)compressing archives consumes most available CPU on a node, whereas the extra traffic produced by uncompressed backups usually doesn’t consume all available network bandwidth (if the number of concurrently starting/stopping workspaces is carefully controlled).

IO bandwidth on the node is shared across workspaces. We found that, unless we limited the IO bandwidth available to each workspace, other workspaces might starve for IO bandwidth and cease to function. Particularly the content backup/restore produced this problem. We implemented cgroup-based IO limiter which imposed fixed IO bandwidth limits per environment to solve this problem.

Autoscaling and startup time optimization

Our primary goal was to minimize startup time at all costs. Unpredictable wait times can significantly impact productivity and user satisfaction. However, this goal often conflicted with our desire to pack workspaces densely to maximize machine utilization.

We initially thought that running multiple workspaces on one node would help with startup times due to shared caches. However, this didn’t pan out as expected. The reality is that Kubernetes imposes a lower bound for startup time because of all the content operations that need to happen, content needs to be moved into place, which takes time.

Short of keeping workspaces in hot standby (which would be prohibitively expensive), we had to find other ways to optimize startup times.

Scaling ahead: evolution of our approach

To minimize startup time, we explored various approaches to scale up and ahead:

Proportional autoscaling for peak loads

To handle peak loads more effectively, we implemented a proportional autoscaling system. This approach controls the rate of scale-up as a function of the rate of starting development environments. It works by launching empty pods using the pause image, allowing us to quickly increase our capacity in response to demand spikes.

Image pull optimization: a tale of many attempts

Another crucial aspect of startup time optimization was improving image pull times. Workspace container images (i.e. all the tools available to a developer) can grow to more than 10 gigabytes uncompressed. Downloading and extracting this amount of data for every workspace considerably taxes a node’s resources. We explored numerous strategies to speed up image pulls:

There is no one-size-fits all solution for image caching, but a set of trade-offs with respect to complexity, cost and restrictions imposed on users (images they can use). We have found that homogeneity of workspace images is the most straightforward way to optimize startup times.

Networking complexities

Networking in Kubernetes introduced its own set of challenges, specifically:

Security and isolation: balancing flexibility and protection

One of the most significant challenges we faced in our Kubernetes-based infrastructure was providing a secure environment while giving users the flexibility they need for development. Users want the ability to install additional tools (e.g., using apt-get install), run Docker, or even set up a Kubernetes cluster within their development environment. Balancing these requirements with robust security measures proved to be a complex undertaking.

The naive approach: root access

The simplest solution would be to give users root access to their containers. However, this approach quickly reveals its flaws:

Clearly, a more sophisticated approach was needed.

User namespaces: a more nuanced solution

To address these challenges, we turned to user namespaces, a Linux kernel feature that provides fine-grained control over the mapping of user and group IDs inside containers. This approach allows us to give users “root-like” privileges within their container without compromising the security of the host system.

While Kubernetes introduced support for user namespaces in version 1.25, we had already implemented our own solution starting with Kubernetes 1.22. Our implementation involved several complex components:

Implementing this security model came with its own set of challenges:

The micro-VM experiment

As we grappled with the challenges of Kubernetes, we began exploring micro-VM (uVM) technologies like Firecracker, Cloud Hypervisor, and QEMU as a potential middle ground. This exploration was driven by the promise of improved resource isolation, compatibility with other workloads (e.g. Kubernetes) and security, while potentially maintaining some of the benefits of containerization.

The promise of micro-VMs

Micro-VMs offered several enticing benefits that aligned well with our goals for cloud development environments:

Challenges with micro-VMs

However, our experiments with micro-VMs revealed several significant challenges:

Lessons from the uVM experiment

While micro-VMs didn’t ultimately become our primary infrastructure solution, the experiment provided valuable insights:

Kubernetes is immensely challenging as a development environment platform

As I mentioned at the beginning, for development environments we need a system that respects the uniquely stateful nature of development environments. We need to give the necessary permissions for developers to be productive, whilst ensuring secure boundaries. And we need to do all of this whilst keeping operational overhead low and not compromising security.

Today, achieving all of the above with Kubernetes is possible—but comes at a significant cost. We learned the difference between application and system workloads the hard way.

Kubernetes is incredible. It’s supported by an engaged and passionate community, which builds a truly rich ecosystem. If you’re running application workloads, Kubernetes continues to be a fine choice. However for system workloads like development environments Kubernetes presents immense challenges in both security and operational overhead. Micro-VMs and clear resource budgets help, but make cost a more dominating factor.

So after many years of effectively reverse-engineering and forcing development environments onto the Kubernetes platform we took a step back to think about what we believe a future development architecture needs to look like. In January 2024 we set out to build it. In October, we shipped it: Gitpod Flex

More than six years of incredibly hard-won insights for running development environments securely at internet scale went into the architectural foundations.

The future of development environments

In Gitpod Flex, we carried over the foundational aspects of Kubernetes such as the liberal application of control theory and the declarative APIs whilst simplifying the architecture and improving the security foundation.

We orchestrate development environments using a control plane heavily inspired by Kubernetes. We introduced some necessary abstraction layers that are specific to development environments and cast aside much of the infrastructure complexity that we didn’t need—all whilst putting zero-trust security first.


Security boundaries of Gitpod Flex.
Security boundaries of Gitpod Flex.

This new architecture allows us to integrate devcontainer seamlessly. We also unlocked the ability to run development environments on your desktop. Now that we’re no longer carrying the heavy weight of the Kubernetes platform, Gitpod Flex can be deployed self-hosted in less than three minutes and in any number of regions, giving more fine-grained control on compliance and added flexibility when modeling organizational boundaries and domains.

When it comes to building a platform for standardized, automated and secure development environments choose a system because it improves your developer experience, eases your operational burden and improves your bottom line. You are not choosing Kubernetes vs something else, you are choosing a system because it improves the experience for the teams you support.

UPDATE: Since publishing we’ve received a lot of questions [1] about what our post-kubernetes architecture looks like. There is a lot to cover so I recorded a deep-dive session. You can watch it here → (https://ona.com/events/gitpod-flex-demo)

- Thanks for the interest and kind words, Chris (blog author)

This website uses cookies to enhance the user experience. Read our cookie policy for more info.