As part of my investigation of TLS performance, I decided to benchmark various ciphers and hashing algorihtms on my dev server. My dev machines is a Xeon E3-1220 v2 with 8GB of RAM. For these tests I set the CPU governor performance to insure I wasn’t seeing effects from speedstep throttling the CPU up or down.
The short of it is that I was seeing significantly higher baseline CPU load after enabling H2 on my VPS compared to what I expected. Up from 0.5% to 2-3%. AWS t2.micro instances are burstable configurations designed to operate at a baseline CPU load of 10%. Going from from <1% to ~3% was pretty significant. Not a deal killer, but with no change in traffic that increase in compute load would dramatically decrease the headroom I had to grow before I had to consider a higher tier instance.
I appear to have resolved the production problem by applying the simple principal; encryption strength is proportional to computational complexity, so if there’s a lot of computational load, turning down the encryption strength may improve performance. What I didn’t do was much in the way of actual controlled testing to see if my premise was reliable.
My initial configuration was the one recommend on on this page.
Which is shorthand that expands to the following.
The Mozilla foundation also has a really nice SSL/TLS configuration generator. I also tried their intermediate cipher suite, before really thinking about the problem. It follows below.
In both cases, AES256 is the vastly preferred bulk encryption algorithm.
There are a number of places along the process where tradeoffs are made, though in some cases the don’t appear to matter much or simply are outweighed by other factors.
For example, take key exchange algorithms.
RSA is really fast, especially with smallish keys 2048 bits or smaller (though keys less than 2048 bits are considered weak and keys less than 1024 bits are considered insecure). But it doesn’t support perfect forward security (PFS). ECDHE is slightly slower, but it supports PFS. DHE is considerably slower than either RSA or ECDHE, and supports PFS.
However, the handshake generally only happens once per connection. Moreover, many servers support TLS session caches and TLS session tickets. Both of which are used to allow clients that already undergone the handshake process to use an abridged handshake to request further data from the server.
When the handshake only happens once, it’s not crucially important to use the absolute fastest algorithm. However, ECDHE is yields both good performance and a desirable feature in PFS, and should probably be preferred even if it’s a few percent slower than small keyed RSA.
Note: I’m having a hard time finding more recent benchmarks on key exchanges, if I find one I may have to revise these conclusions.
Far more time is spent in the bulk encryption; that’s whats used to encrypt all the data being transferred. Lots of data, means a lot of encryption, which means a lot more time spent by the CPU encrypting.
There are four parameters here that I want to look at.
- Hardware acceleration for AES
- Cipher mode (GCM or CBC)
- Cipher strength (128 or 256)
- Hashing algorithm
I’ll start with hardware acceleration.
Most modern CPUs from AMD and Intel support AES-NI instructions, which accelerate the AES processing. The performance advantages for having AES-NI should not be underestimated. On my Xeon E3-1220 v2, running the OpenSSL benchmark turning off AES-NI less than halv
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-ni-off 88047.90k 101570.11k 238936.75k 254754.47k 258738.86k aes-ni 339001.88k 878209.19k 1210872.32k 1331487.74k 1365303.30k 385% 864% 507% 523% 528% aes-ni (AWS) 372152.35k 877413.76k 1806772.76k 2260340.33k 2612296.03k 423% 864% 756% 887% 971%
For workloads I think we’d expect in an SSL system, that is blocks 1KB or bigger, AES-NI improves AES-128 performance by between 5 and 10 times.
The other big takeaway here, either having 30MB of L3 cache or Haswell’s micro architecture makes a big difference. Of course my AWS instances run on Amazon’s custom Xeons, which may have more special sauce in them that’s not readily obvious.
Moving on to GCM v. CBC, GMC also is a pretty substantial win in terms of realistic work sizes.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-cbc 634449.49k 676019.24k 686977.62k 689722.03k 690656.60k aes-128-gcm 339001.88k 878209.19k 1210872.32k 1331487.74k 1365303.30k 53% 130% 176% 193% 198%
While CBC is more consistent, GMC does way better even just past 64 bytes. At blocks that are the size of small buffers (or what look like they might be reasonable for buffers), the performance is almost 2 fold better.
Then there’s the big question, AES128 v. AES256.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-128-gcm 336307.95k 880610.24k 1209631.32k 1330737.83k 1365079.38k aes-256-gcm 279421.27k 758233.43k 1080570.45k 1182194.69k 1210840.41k 83% 86% 89% 89% 89%
So AES256 is ~11% slower than AES128.
Finally the hashing algorithm. On my 64-bit server I get the following results.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes sha256 65269.04k 146854.27k 257154.99k 319237.31k 340817.24k sha384 38362.10k 153405.99k 270212.10k 407485.22k 477689.17k 56% 104% 105% 128% 140%
Interestingly SHA512 (which SHA384 is derived from) is faster on 64-bit machines than SHA256. This post, suggests that SHA512 is faster than SHA256 because it processes 2x as much data per cycle, even though there’s a few more cycles. It’s also worth pointing out that SHA384 is immune to certain attacks that work against SHA512.
So what does this all tell us.
It does show that AES256 is slower than AES128 by around 11%. Though that doesn’t explain why I would see a 500% increase in CPU load when it was running.
AES-NI also accelerates AES256, in fact it seem to do even better with AES256 than it does with AES128 in terms of the performance boost. So that shouldn’t have been a factor. Moreover, I see a 5.4x speed up for AES-NI. 11% performance difference times 5.5x if AES-NI isn’t running would give me a 60% performance difference, not a 500% one.
Given the benchmark data, SHA2 hashes would appear to be the bigger limiting factor, but SHA384 is faster than SHA256, so that doesn’t account for the performance problems either.
In short, I’m stumped. I know the performance was related to real traffic, as the increase in CPU load directly corresponded to network traffic, so it wasn’t some runaway processes doing something in the background. Moreover, that network traffic corresponded to my typical dial traffic curves, so I think it was legitimate traffic.