...

A Ruby Performance Experiment for the Modern Cloud - AWS Recon

24 July 2020

Is Ruby too slow to be taken seriously in modern, fast-paced, enterprise-scale cloud environments? According to Twitter, StackOverflow, and all the cool kids except this one, the answer seems to be “Yes!” Surely the right tool these days would be Node, Go, Rust, or even Python. Nothing fast is built with Ruby, right? Let’s find out…

Background

The majority of our clients run on either AWS or Google Cloud. Google Cloud provides the ability to export a Cloud Asset Inventory (CAI) with every resource in the organization presented in a single structured JSON collection. Having a predictable and consistent inventory makes a lot of downstream activities (detecting misconfigurations, monitoring for compliance, or examining security posture to name a few) much more feasible.

Unfortunately, AWS doesn’t provide a simlar inventory. You may be thinking AWS Config does exactly this, but that’s not quite the case. AWS Config does track resources, but as of this writing, it only tracks 86 of the many hundreds of different AWS resources. For the resources it does track, it offers some unique capabilities, like enforcing compliance rules and triggering auto-remediation to fix certain issues. However, Config has historically been way behind on supporting many services and their resources. Even today, EKS and ECS resources are not tracked by AWS Config. This deficiency makes AWS Config a non-starter for us.

Goals

We set out to solve the problem of getting a full AWS asset inventory in a way that would support workflow automation and integrate with our other tooling. When we started, we thought it would turn out to be a solved problem. Surely there was an open-source package that could help with this? It turns out there are a few tools that come close - some open-source and even a few commercial configuration management database tools. Everything we found was either much too complicated to integrate into our workflow, or simply didn’t have the comprehensive resource coverage we needed. One of our favorite cloud audit tools, CloudSploit, was the closest to meeting our needs, but it didn’t cover all of the AWS services we needed and was tricky to debug and extend. It was clear we were going to have to build our own.

Our goals for the project were pretty straightforward:

Actually, multi-threading wasn’t a goal initially. That surfaced later for performance reasons that will become evident shortly.

Approach

When we’re builing prototype tooling, we tend to reach for scripted languages like Node.js, Ruby, and Python first. Scripted languages are easier to iterate on quickly. If performance becomes a limiting factor, we revisit language choices later.

First Attempt

Since we anticipated that querying dozens of API endpoints in various regions all over the world would be slow due to Internet connection latency, we decided to reach first for Node.js. The asynchronous nature of Node.js seems like a good fit for this situation. As an added benefit, Node.js tends be fast and excels in handling JSON, which is what we’ll be dealing with when interacting with AWS APIs.

$ node recon.js 

Finished in 5 seconds. Saving resources to output.json.

Initial results looked promising on a small account. However, once we tested on larger accounts, it quickly became apparent that we would need flexible retry logic, rate limiting, and support for response paging to handle large result sets.

Some AWS calls provide paged responses to limit the amount of data returned with each response. A page of data represents up to 1,000 items.

The AWS JavaScript SDK does support paging, but it quickly becomes cumbersome to deal with page requests inside of callbacks while also making multiple follow-on nested API requests.

s3.listObjects({ Bucket: 'bucket' }).on('success', function handlePage(response) {
    // do something with response.data
    if (response.hasNextPage()) {
        response.nextPage().on('success', handlePage).send();
    }
}).send();

One of the challenging aspects of working with the AWS APIs is the lack of consistency across different services. For some service (e.g. EC2), the API returns all of the resources and much of the associated resource attributes. For others (e.g. S3), the API returns a simple list of resource (bucket) names, which you then have to use as parameters in follow-on API requests to retrieve the detail attributes. Some cases even require chaining multiple API requests three levels deep.

It was at this point that we started to weigh the speed benefits of a tool written in Node.js versus the complexity of writing the code necessary to deal with lots of nested JavaScript callbacks. We decided to take a look at the aforementioned CloudSploit to get a frame of reference in terms of the type of performance we could expect from a Node.js collection tool. Though CloudSploit doesn’t support all of the resources we need, it does cover a significant number of AWS services.

On a small test account, we can collect the inventory across all regions globally in under a minute.

$ time node index.js 
INFO: Determining API calls to make...
INFO: API calls determined.
INFO: Collecting metadata. This may take several minutes...
...
real	0m46.520s
user	0m6.391s
sys	0m0.634s

On a larger account with ~20,000 resources, collection takes a little over two minutes.

$ time node index.js 
INFO: Determining API calls to make...
INFO: API calls determined.
INFO: Collecting metadata. This may take several minutes...
...
real	2m4.239s
user	0m14.402s
sys	0m1.424s

So now we had a performance goal to aim for, but by this point, we had decided not to build this tool with Node.js.

Second Attempt

We had originally planned to write this tool in Ruby, our language of choice. However, our hunch was that overall speed was important enough to warrant taking a different approach, which is why we tested the waters with Node first. We tend to prefer Ruby for custom tooling because it’s just plain easy to work with. Lots of people are quick to point to the lackluster performance of Ruby, but in many use cases performance problems aren’t due to the language itself, but rather poor solution design.

Somewhat surprisingly, the AWS Ruby SDK has automatic retry logic and response paging built it. This was a welcome change from the JavaScript SDK and made prototyping the Ruby version of the tool much faster.

s3 = Aws::S3::Client.new

s3.list_objects({ bucket: 'aws-sdk' }).each do |response|
  puts response.contents.map(&:key)
end

Notice how there is no next_page method? As long as you use the built-in enumerator in the response object, the client automatically calls the next page for you if needed. It’s also trivial to implement a progressive retry backoff to stay within API rate limits.

 s3 = Aws::S3::Client.new({
   retry_limit: 5,
   retry_backoff: ->(context) { sleep(5 * context.retries + 1) }
 })

There is an adaptive retry mode in the latest version of the AWS Ruby SDK, but our experience was more consistent using the legacy mode shown above.

Performance

We convinced ourselves that the performance trade-off in switching to Ruby was worth the improved development experience for our team. We were about to see just how much of a trade-off we were making. During development, we were testing the functionality of each AWS service module as we wrote them. Once we built out collection modules for about 50 AWS services, we started testing them in aggregate, across multiple regions.

$ ./recon.rb

Finished in 1464 seconds. Saving resources to output.json.

Ouch. This is the first test on a moderate sized account. Using the Ruby tool, we can collect ~20,000 resources in about 24 minutes. Compare this to just over 2 minutes for a comparable Node.js tool. Ruby haters rejoice, 12x slower performance seems to prove them right. Note that the Ruby tool did have more AWS service coverage than our reference tool at this point, but nowhere near enough to account for 12x slower performance.

Multi-threading

Luckily, we have not exhausted all of our options just yet. Let’s see if we can leverage some basic parallelization to improve our performance situation. According to the Parallel gem, we can:

Run any code in parallel processes (use multiple CPUs) or threads (speedup blocking operations).

Since we’re not CPU bound, we probably don’t need parallel processes. But since we’re I/O bound, parallel threads may help significantly. We can make a simple, one-line change to our code to try to parallelize the requests.

services.each do |service|
  collect(service, @region)
end

We simply wrap the main loop in a Parallel enumerable.

Parallel.map(services.each, in_threads: num_threads) do |service|
  collect(service, @region)
end

Now we can run the same test from earlier, but with 8 threads instead of a single thread.

$ ./recon.rb -t8

Finished in 756 seconds. Saving resources to output.json.

We cut our processing time almost exactly in half. Much better, but still a long way off from our Node.js tests. Let’s bump up the thread limit even higher.

$ ./recon.rb -t64

Finished in 115 seconds. Saving resources to output.json.

Whoa! We’ve managed to match the performance of our Node.js benchmark with Ruby! In this case, multi-threading yields a massive performance improvement. This makes sense since we are making over 1,000 API calls spread out over 16 regions in various parts of the world. Network latency makes the response times of of those requests vary wildly from one to the next. This type of I/O bound activity is a perfect case for a multi-threaded client, where each thread’s performance stays isolated from the rest (up to the max thread limit).

More threads are always better, right? Why not use even more threads for even better performance. Well, yes and no. On the surface, doing more things in parallel should be faster, but multi-threading immediately brings more complexity into the tool - multiple threads are, as one might expect, harder to manage and troubleshoot if things aren’t working as expected. Luckily, the Parallel gem handles a lot of this for us. But then there’s the provider API (AWS in this case). The APIs understandably all have quotas.

AWS uses the token bucket algorithm to implement API throttling. With this algorithm, your account has a bucket that holds a specific number of tokens. The number of tokens in the bucket represents your throttling limit at any given second.

Overwhelming the APIs will quickly lead to rate limiting and throttling, which would ultimately slow down our client as we wait for request quotas to refresh.

At the end of this experiment, we were pleasantly surprised that we were able to get comparable performance out of our Ruby client when compared to a Node.js implementation. The fastest code in the world doesn’t make a difference if you’re waiting for I/O the majority of the time.

Uses

So, what can you do with a full inventory of your AWS account resources parsed into a single, consistent JSON structure? You could:

There are a lot of challenges in the cloud-native world that are made harder by the lack asset inventory and visilbility. When you have a consistent and predictable asset inventory, you can focus on solving some of the more interesting challenges problems in the cloud landscape.

Try It

We’ve released AWS Recon as an open-source tool to help with AWS inventory collection. Hopefully others will find it as useful as we do.