Rust Adventures: Writing My Own L4/L7 Load Balancer

Introduction

I’ve been learning Rust recently after moving team at work, this team is primarily Rust focussed which mimics the general company direction. In other words, practically every new greenfield project that I’ve been aware of has been written in Rust, there has been a heavy investment in the language and, by the looks of it, we’re going all in.

After speaking with a few of my colleagues, I noticed a general trend. Most people who are really interested in Rust have done a fair amount of C++ in the past, or some sort of low-level systems programming in general. This is where I was completely lacking, I had very minimal context of what was going on from the ground up. By this point, you could say that I was struggling to “get it” with Rust, being thrown into the deep end with an async Rust project probably didn’t help, but I knew that one way I can greatly help myself is to conduct a project of my own meaning that I can continuously fail without breaching any deadlines.

As for a project, I landed on a load balancer. The initial thought for this came from codingchallenges.fyi, but that lasted for about 5 minutes as I decided to “wing it” and conduct the project based on how I conceptually understand a load balancer to function, filling in any gaps or uncertainty using documentation from Cloudflare and others. I also thought it would be great to create a project that I could actually use. Granted, in its current form it will need a tonne of extra work, but sitting it behind a production-grade load balancer (e.g. nginx) would work nicely.

Now for the toughest part of the project, naming it. I decided on gruglb, inspired by the grug brained developer post, largely because I had the strong believe that it would turn out incredibly badly with terrible performance; however, all in all it is actually half-way decent considering I had absolutely no idea what I was doing in Rust to begin with. So I started off, pausing to re-read particular chapters from the Book for any pieces of Rust-specific knowledge that I had forgotten.

I can’t believe it, it actually works!

Starting off, I had grand ideas of using tokio for an asynchronous load balancer. From a design/conceptual perspective this makes sense, having connections handled asynchronously would be a great thing for performance as the load balancer can do other things whilst the connection handling over await statements is done. In reality, this didn’t work out, I got confused and couldn’t get things working, largely because things didn’t work the way I thought they did. I had a major gap in my understanding.

To fix this, I started in the simplest way that I knew. Fallback to the standard library, performance doesn’t matter for a personal project and there is no need to do anything fancy with it, aiming for crazy performance would be something I could do later on after much refactoring - this worked out nicely. By starting off simple, I had the general concept of how I wanted it to work. I would accept a single argument, --config, which was the configuration file, this would have a number of targets defined and would route to them. Introducing clap and using most of the standard library got the project off to a great start. Using multiple threads for the handling of listeners and for incoming connections made sure that the load balancer actually did what it was set out to do: bind to port to accept incoming connections and route to specified backends.

The reasoning behind this was to use the construct of a Target and Backend, which look like the following

pub struct Target {
    pub listener: u16,
    pub backends: Option<Vec<Backend>>,
}

pub struct Backend {
    pub host: String,
    pub port: String,
    pub healthcheck_path: String,
}

Using a Target as an encapsulating concept for metadata around a labelled number of servers, with an actual address being encoded within the Backend struct, representing a backend server that will have traffic routed to it. This is still in place for the current load balancer implementation. As for maintaining the active lifecycle of targets we need health checks and a structure to maintain the targets to route to, I opted for a HashMap and the use of channels as a means for communicating about the success and failure of backends. No doubt I made a very poor design choice in the initial implementation, as the HashMap was sent through the Sender channel itself and a replacement was forced on a received message. The initial implementation came together here.

It’s incredibly slow…

As I started off simple, this meant using std::thread, spawning a thread for each listener and taking a new thread to handle a connection. An issue which presented itself relatively quickly was that without using an asynchronous runtime, the reqwest library was needed to be using the blocking feature, making the library synchronous. Unfortunately this meant that the performance under any sort of load was abysmal as the blocking connections continually built up until the load balancer could no longer accept anymore incoming connections.

At most this tops out at around 35 requests per second and then throttles out completely.

Performance improvements: a life of its own

tokio

When I first started this project, I began with the idea to use tokio but this quickly became overwhelming with trying to learn everything with Rust, as well as the async ecosystem.

After completing by initial goal of the project by primarily using the standard library, this no longer seemed like a crazy feat to achieve. So I started off, swapping everything that made sense to change into its asynchronous tokio equivalent. I found this had a reasonable performance increase, which makes sense considering a large majority of the time spent in network was waiting for responses, flipping this to an asynchronous model meant that other threads can enact tasks whilst being awaited.

Health check enums

A large point of contention and a major drain on performance was an early decision that I made in passing the state of healthy targets through an update of a HashMap. At the time this made sense and worked, I believe that passing an update through channels was correct here; however, sending an entire HashMap of healthy targets was not.

I found that this could be represented in a much better way with an enum:

pub struct CheckState {
    pub target_name: String,
    pub backend: Backend,
}

pub enum Health {
    Success(CheckState),
    Failure(CheckState),
}

With this the health check process receives a number of Health, as Vec<Health>, and populates what targets were (un)healthy after each interval.

At the same time, I also came across concurrent hashmaps, namely DashMap, which acted as a drop-in replacement for my original use of Arc<Mutex<HashMap<T>>>.

More information on this can be gleaned from this pull request: https://github.com/jdockerty/gruglb/pull/14

Scoping a `RwLock` for enormous gains

The single largest performance increase came from realising that there was major lock contention over the lifetime of a connection and understanding a footgun that I had introduced.

This can be better understood by seeing trimmed down version of the original code:

let mut idx = routing_idx.write().await;
if *idx >= backend_count {
    *idx = 0;
}
let http_backend = format!(...);
*idx += 1;
match method.as_str() {
    // connection handling here
}

The routing_idx was my incredibly naive implementation of a ring buffer array, whereby the index should “wrap around” when exceeded over the length bound - used to do basic round-robin load balancing. A problem is presented here for the idx variable, as it represents a RwLockGuard, this lock is dropped when it goes out of scope. However, this function was a little too eager in that the connection handling also happens here.

What does this mean? The lock was held over the connection handling, which still enabled a few hundred requests a second to be handled, but once it was removed it skyrocketed.

bombardier -l http://127.0.0.1:8080
Bombarding http://127.0.0.1:8080 for 10s using 125 connection(s)
[========================================================================================] 10s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec     31758.79    2860.77   37445.29
  Latency        3.93ms   393.27us    21.97ms

The full code for this change is available here: https://github.com/jdockerty/gruglb/pull/18

NOTE: Reading over this later and from doing more with Rust, I think this RwLock can be entirely removed by using an Atomic value instead.

Conclusion

This project has been exceptionally useful for me in both learning the language and becoming more interested in distributed systems as a whole - which I’ve recently delved into writing my own distributed key-value store through the PingCAP talent plan course. It has exposed me to a variety of concepts in Rust, but I believe the most useful has been that of concurrency control in understanding more about Mutex, RwLock, and a concurrent hashmap structure in DashMap (a lock-free hashmap, which is fascinating).

Introduction#

I can’t believe it, it actually works!#

It’s incredibly slow…#

Performance improvements: a life of its own#

tokio#

Health check enums#

Scoping a RwLock for enormous gains#

Conclusion#