Testing TLS Cipher Performance

Last updated on August 17, 2018

As part of my investigation of TLS performance, I decided to benchmark various ciphers and hashing algorihtms on my dev server. My dev machines is a Xeon E3-1220 v2 with 8GB of RAM. For these tests I set the CPU governor performance to insure I wasn’t seeing effects from speedstep throttling the CPU up or down.

The short of it is that I was seeing significantly higher baseline CPU load after enabling H2 on my VPS compared to what I expected. Up from 0.5% to 2-3%. AWS t2.micro instances are burstable configurations designed to operate at a baseline CPU load of 10%. Going from from <1% to ~3% was pretty significant. Not a deal killer, but with no change in traffic that increase in compute load would dramatically decrease the headroom I had to grow before I had to consider a higher tier instance.

I appear to have resolved the production problem by applying the simple principal; encryption strength is proportional to computational complexity, so if there’s a lot of computational load, turning down the encryption strength may improve performance. What I didn’t do was much in the way of actual controlled testing to see if my premise was reliable.

My initial configuration was the one recommend on on this page.

ssl_ciphers 'EECDH+AESGCM:EDH+AESGCM:AES256+EECDH:AES256+EDH';

Which is shorthand that expands to the following.

ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:DHE-DSS-AES256-GCM-SHA384:DHE-RSA-AES256-GCM-SHA384:DHE-DSS-AES128-GCM-SHA256:DHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA

The Mozilla foundation also has a really nice SSL/TLS configuration generator. I also tried their intermediate cipher suite, before really thinking about the problem. It follows below.

ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA256

In both cases, AES256 is the vastly preferred bulk encryption algorithm.

There are a number of places along the process where tradeoffs are made, though in some cases the don’t appear to matter much or simply are outweighed by other factors.

For example, take key exchange algorithms.

RSA is really fast, especially with smallish keys 2048 bits or smaller (though keys less than 2048 bits are considered weak and keys less than 1024 bits are considered insecure). But it doesn’t support perfect forward security (PFS). ECDHE is slightly slower, but it supports PFS. DHE is considerably slower than either RSA or ECDHE, and supports PFS.

However, the handshake generally only happens once per connection. Moreover, many servers support TLS session caches and TLS session tickets. Both of which are used to allow clients that already undergone the handshake process to use an abridged handshake to request further data from the server.

When the handshake only happens once, it’s not crucially important to use the absolute fastest algorithm. However, ECDHE is yields both good performance and a desirable feature in PFS, and should probably be preferred even if it’s a few percent slower than small keyed RSA.

Note: I’m having a hard time finding more recent benchmarks on key exchanges, if I find one I may have to revise these conclusions.

Far more time is spent in the bulk encryption; that’s whats used to encrypt all the data being transferred. Lots of data, means a lot of encryption, which means a lot more time spent by the CPU encrypting.

There are four parameters here that I want to look at.

Hardware acceleration for AES
Cipher mode (GCM or CBC)
Cipher strength (128 or 256)
Hashing algorithm

I’ll start with hardware acceleration.

Most modern CPUs from AMD and Intel support AES-NI instructions, which accelerate the AES processing. The performance advantages for having AES-NI should not be underestimated. On my Xeon E3-1220 v2, running the OpenSSL benchmark turning off AES-NI less than halv

type          16 bytes    64 bytes     256 bytes    1024 bytes   8192 bytes
aes-ni-off    88047.90k   101570.11k   238936.75k   254754.47k   258738.86k
aes-ni        339001.88k  878209.19k   1210872.32k  1331487.74k  1365303.30k
              385%        864%         507%         523%         528%

aes-ni (AWS)  372152.35k  877413.76k   1806772.76k  2260340.33k  2612296.03k
              423%        864%         756%         887%         971%

For workloads I think we’d expect in an SSL system, that is blocks 1KB or bigger, AES-NI improves AES-128 performance by between 5 and 10 times.

The other big takeaway here, either having 30MB of L3 cache or Haswell’s micro architecture makes a big difference. Of course my AWS instances run on Amazon’s custom Xeons, which may have more special sauce in them that’s not readily obvious.

Moving on to GCM v. CBC, GMC also is a pretty substantial win in terms of realistic work sizes.

type          16 bytes    64 bytes     256 bytes    1024 bytes   8192 bytes
aes-128-cbc   634449.49k  676019.24k   686977.62k   689722.03k   690656.60k
aes-128-gcm   339001.88k  878209.19k   1210872.32k  1331487.74k  1365303.30k
              53%         130%         176%         193%         198%

While CBC is more consistent, GMC does way better even just past 64 bytes. At blocks that are the size of small buffers (or what look like they might be reasonable for buffers), the performance is almost 2 fold better.

Then there’s the big question, AES128 v. AES256.

type          16 bytes    64 bytes     256 bytes    1024 bytes    8192 bytes
aes-128-gcm   336307.95k  880610.24k   1209631.32k  1330737.83k   1365079.38k
aes-256-gcm   279421.27k  758233.43k   1080570.45k  1182194.69k   1210840.41k
              83%         86%          89%          89%           89%

So AES256 is ~11% slower than AES128.

Finally the hashing algorithm. On my 64-bit server I get the following results.

type          16 bytes    64 bytes     256 bytes    1024 bytes    8192 bytes
sha256        65269.04k   146854.27k   257154.99k   319237.31k    340817.24k
sha384        38362.10k   153405.99k   270212.10k   407485.22k    477689.17k
              56%         104%         105%         128%          140%

Interestingly SHA512 (which SHA384 is derived from) is faster on 64-bit machines than SHA256. This post, suggests that SHA512 is faster than SHA256 because it processes 2x as much data per cycle, even though there’s a few more cycles. It’s also worth pointing out that SHA384 is immune to certain attacks that work against SHA512.

So what does this all tell us.

Well…..

It does show that AES256 is slower than AES128 by around 11%. Though that doesn’t explain why I would see a 500% increase in CPU load when it was running.

AES-NI also accelerates AES256, in fact it seem to do even better with AES256 than it does with AES128 in terms of the performance boost. So that shouldn’t have been a factor. Moreover, I see a 5.4x speed up for AES-NI. 11% performance difference times 5.5x if AES-NI isn’t running would give me a 60% performance difference, not a 500% one.

Given the benchmark data, SHA2 hashes would appear to be the bigger limiting factor, but SHA384 is faster than SHA256, so that doesn’t account for the performance problems either.

In short, I’m stumped. I know the performance was related to real traffic, as the increase in CPU load directly corresponded to network traffic, so it wasn’t some runaway processes doing something in the background. Moreover, that network traffic corresponded to my typical dial traffic curves, so I think it was legitimate traffic.