Zero-Downtime Service Restarts using eBPF

Hot Standby Load Balancing with eBPF

Sep 01, 2024

It’s no secret that every service provider aims to avoid service interruptions at all costs.

And this isn’t just about unexpected failures — like hardware crashes or software bugs. What if you simply want to deploy a new version of your service?

How do you achieve zero-downtime?

In theory, you’d keep the old server instance running alongside the new one for a short time. New connections would be routed to the new instance, while existing connections continue to drain on the old.

But here’s the catch: not every service is fronted by a load balancer that can seamlessly handle this transition. In those cases, ensuring uninterrupted service becomes a much harder problem.

In this post, we’ll explore the eBPF SO_REUSEPORT program type, which lets you implement custom load-balancing logic to determine which socket in a SO_REUSEPORT group should handle incoming connection.

⚠️ Note: There are many useful resources that explain the socket and eBPF basics better than I can, so I’m only going to focus on preparing the groundwork for the code example.

First of, what is SO_REUSEPORT and how does it work.

SO_REUSEPORT

Briefly, there are three main approaches how to configure you service to listen on a single host port:

Single listen socket, single application/worker process.

Single listen socket, multiple application/worker processes.

Multiple worker processes, each with separate listen socket (attached to the same host port).

The last configuration is what you end up with if you configure your services to use SO_REUSEPORT. For more details, take a look at this post.

While this might seem really cool, by default you are constrained to a round-robin scheduling, meaning any time a packet is received on the port that has SO_REUSEPORT configured, the packets are evenly distributed among listening services.

However, considering the nature of different load balancers, it would be great if we could define our own logic for how and when packets are transferred to each of the listening services.

Imagine running both an old and a new version of your application — and wanting to gradually shift traffic to the new one. With fine-grained control, you could route only a subset of connections to the new instance until all clients are smoothly migrated.

While demonstrating this exact scenario can be a bit complex, I decided to walk you through a simpler example that’s easier to grasp — but still highlights the mechanics behind achieving zero-downtime restarts.

Hot Standby Load Balancing

In this example, we’ll deploy an eBPF program to control the traffic distribution between two services: a primary service and a standby service.

The goal is to have both services running side by side, sharing the same port via SO_REUSEPORT — but the standby service should only handle traffic when the primary service is down.

As mentioned above, SO_REUSEPORT uses round-robin scheduling, meaning both services would get an equal share of the traffic.

So how can we override this behavior to ensure that the standby service only receives traffic when the primary service is unavailable?

The answer lies in eBPF, specifically using the SO_REUSEPORT program type.

By attaching an eBPF program to the sk_reuseport hook point, we can override the default round-robin load balancing and implement custom logic to control to which application instance the request is routed to.

⚠️ Note: There are many tools to deploy eBPF program, but I prefer ebpf-go framework, so don’t feel constrained about it.

I find code example renders in Substack tedious, so I’ll refer to my GitHub repository with the code and test results.

Here’s the link.

As with my previous posts, I intentionally left the technical details in the code comments, which I believe makes the concepts slightly clearer.

I hope you find this resource as enlightening as I did. Stay tuned for more exciting developments and updates in the world of eBPF in next week's newsletter.

Until then, keep 🐝-ing!

Warm regards, Teodor

Thanks for reading Cloud Chirp Blog! This post is public so feel free to share it.

eBPFChirp

Discussion about this post