Category: Google Cloud Functions

Azure Functions – Significant Improvements in HTTP Trigger Scaling

A while back I wrote about the improvements Microsoft were working on in regard to the HTTP trigger function scaling issues. The Functions team got in touch with me this week to let me know that they had an initial set of improvements rolling out to Azure.

To get an idea of how significant these improvements are I’m first going to contrast this new update to Azure Functions with my previous measurements and then re-examine Azure Functions in the wider context of the other cloud vendors. I’m specifically separating out the Azure vs Azure comparison from the Azure vs Other Cloud Vendors comparison as while the former is interesting given where Azure found itself in the last set of tests and to highlight how things have improved but isn’t really relevant in terms of a “here and now” vendor comparison.

A quick refresh on the tests – the majority of them are run with a representative typical real world mix of a small amount of compute and a small level of IO though tests are included that remove these and involve no IO and practically no computer (return a string).

Although the improvements aren’t yet enabled by default towards the end of this post I’ll highlight how you can enable these improvements for your own Function Apps.

Azure Function Improvements

First I want to take a look at Azure Functions in isolation and see just how the new execution and scaling model differs from the one I tested in January. For consistency the tests are conducted against the exact same app I tested back in January using the same VSTS environment.

Gradual Ramp Up

This test case starts with 1 user and adds 2 users per second up to a maximum of 500 concurrent users to demonstrate a slow and steady increase in load.

This is the least demanding of my tests but we can immediately see how much better the new Functions model performs. When I ran these tests in January the response time was very spiky and averaged out around the 0.5 second mark – the new model holds a fairly steady 0.2 seconds for the majority of the run with a slight increase at the tail and manages to process over 50% more requests.

Rapid Ramp Up

This test case starts with 10 users and adds 10 users every 2 seconds up to a maximum of 1000 concurrent users to demonstrate a more rapid increase in load and a higher peak concurrency.

In the previous round of tests Azure Functions really struggled to keep up with this rate of growth. After a significant period of stability in user volume it eventually reached a state of being semi-acceptable but the data vividly showed a system really straining to respond and gave me serious concerns about its ability to handle traffic spikes. In contrast the new model grows very evenly with the increasing demand and, other than a slight spike early on, maintaining a steady response time throughout.

Immediate High Demand

This test case starts immediately with 400 concurrent users and stays at that level of load for 5 minutes demonstrating the response to a sudden spike in demand.

Again this test highlights what a significant improvement has been made in how Azure Functions responds to demand – the new model is able to deal with the sudden influx of users immediately, whereas in January it took nearly the full execution of the test for the system to catch up with the demand.

Stock Functions

This test uses the stock “return a string” function provided by each platform (I’ve captured the code in GitHub for reference) with the immediate high demand scenario: 400 concurrent users for 5 minutes.

The minimalist nature of this test (return a string) very much highlights the changes made to the Azure Functions hosting model and we can see that not only is there barely any lag in growing to meet the 400 user demand but that response time has been utterly transformed. It’s, to say the least, a significant improvement over what I saw in January when even with essentially no code to execute and no IO to perform Functions suffered from horrendous performance in this test.

Percentile Performance

I was unable to obtain this data from VSTS and so resorted to running Apache Benchmarker. For this test I used settings of 100 concurrent requests for a total of 10000 requests, collected the raw data, and processed it in Excel. It should be noted that the network conditions were less predictable for these tests and I wasn’t always as geographically close to the cloud function as I was in other tests though repeated runs yielded similar patterns:

Yet again we can see the massive improvements made by the Azure Functions team – performance remains steady up until 99.9th percentile. Full credit to the team – the improvement here is so significant that I actually had to add in the fractional percentiles to uncover the fall off.

Revised Comparison With Other Vendors

We can safely say by now that this new hosting model for Azure Functions is a dramatic improvement for HTTP triggered functions – but how does it compare with the other vendors? Last time round Functions was barely at the party – this time… lets see!

Gradual Ramp Up

On our gradual ramp up test Azure still lags behind both AWS and Google in terms of response time but actually manages a higher throughput than Google. As demand grows Azure is also experiencing a slight deterioration in response time where the other vendors remain more constant.

Rapid Ramp Up

Response time and throughput results for our rapid ramp up test are not massively dissimilar to the gradual ramp up test. Azure experiences a significant fall in performance around the 3 minute mark as the number of users approaches 1000 – but as I said earlier the Functions team are working on further area at this level of scale and beyond and I would assume at this point that some form of resource reallocation is causing this that needs smoothing out.

It’s also notable that although some way behind AWS Lambda Azure manages a reasonably higher throughput that Google Cloud – in fact it’s almost half way between the two competing vendors so although response times are longer there seems to be more overall capacity which could be an important factor in any choice between those two platforms.

Immediate High Demand

Again we see very much the same pattern – AWS Lambda is the clear leader in both response time and throughput while 2nd place for response time goes to Google and 2nd place for throughput goes to Azure.

Stock Functions

Interestingly in this comparison of stock functions (returning a string and so very isolated) we can see that Azure Functions has drawn extremely close to AWS Lambda and ahead of Google Cloud which really is an impressive improvement.

This suggests that other factors are now playing a proportionally bigger factor in the scaling tests than Functions capability to scale – previously this was clearly driving the results. Additional tests would need to be run to isolate if this is the case and whether or not this is related to the IO capabilities of the Functions host or the capabilities of external dependencies.

Percentile Performance

The percentile comparison shows some very interesting differences between the three platforms. At lower percentiles AWS and Google outperform Azure however as we head into the later percentiles they both deteriorate while Azure deteriorates more gradually with the exception of the worst case response time.

Across the graph Azure gives a more generally even performance suggesting that if consistent performance across a broader percentile range is more important than outright response time speed it may be a better choice for you.

Enabling The Improvements

The improvements I’ve measured and highlighted here are not yet enabled by default, but will be with the next release. In the meantime you can give them a go by adding an App Setting with the name WEBSITE_HTTPSCALEV2_ENABLED to 1.

Conclusions

In my view the Azure Functions team have done some impressive work in a fairly short space of time to transform the performance of Azure Functions triggered by HTTP requests. Previously the poor performance made them difficult to recommend except in a very limited range of scenarios but the work the team have done has really opened this up and made this a viable platform for many more scenarios. Performance is much more predictable and the system scales quickly to deal with demand – this is much more in line with what I’d hoped for from the platform.

I was sceptical about how much progress was possible without significant re-architecture but, as an Azure customer and someone who wants great experiences for developers (myself included), I’m very happy to have been wrong.

In the real world representative tests there is still a significant response time gap for HTTP triggered compute between Azure Functions and AWS Lambda however it is not clear from these tests alone if this is related to Functions or other Azure components. Time allowing I will investigate this further.

Finally my thanks to the @azurefunctions team, @jeffhollan and @davidebbo both for their work on improving Azure Functions but also for the ongoing dialogue we’ve had around serverless on Azure – it’s great to see a team so focused on developer experience and transparent about the platform.

If you want to discuss my findings or tech in general then I can be found on Twitter: @azuretrenches.

Azure Functions vs AWS Lambda vs Google Cloud Functions – JavaScript Scaling Face Off

Since I published this piece Microsoft have made significant improvements to HTTP scaling on Azure Functions and the below is out of date. Please see this post for a revised comparison.

I had a lot of interesting conversations and feedback following my recent post on scaling a serverless .NET application with Azure Functions and AWS Lambda. A common request was to also include Google Cloud Functions and a common comment was that the runtimes were not the same: .NET Core on AWS Lambda and .NET 4.6 on Azure Functions. In regard to the latter point I certainly agree this is not ideal but continue to contend that as these are your options for .NET and are fully supported and stated as scalable serverless runtimes by each vendor its worth understanding and comparing these platforms as that is your choice as a .NET developer. I’m also fairly sure that although the different runtimes might make a difference to outright raw response time, and therefore throughput and the ultimate amount of resource required, the scaling issues with Azure had less to do with the runtime and more to do with the surrounding serverless implementation.

Do I think a .NET Core function in a well architected serverless host will outperform a .NET Framework based function in a well architected serverless host? Yes. Do I think .NET Framework is the root cause of the scaling issues on Azure? No. In my view AWS Lambda currently has a superior way of managing HTTP triggered functions when compared to Azure and Azure is hampered by a model based around App Service plans.

Taking all that on board and wanting to better evidence or refute my belief that the scaling issues are more host than framework related I’ve rewritten the test subject as a tiny Node / JavaScript application and retested the platforms on this runtime – Node is supported by all three platforms and all three platforms are currently running Node JS 6.x.

My primary test continues to be a mixed light workload of CPU and IO (load three blobs from the vendors storage offering and then compile and run a handlebars template), the kind of workload its fairly typical to find in a HTTP function / public facing API. However I’ve also run some tests against “stock” functions – the vendor samples that simply return strings. Finally I’ve also included some percentile based data which I obtained using Apache Benchmark and I’ve covered off cold start scenarios.

I’ve also managed to normalise the axes this time round for a clearer comparison and the code and data can all be found on GitHub:

https://github.com/JamesRandall/serverlessJsScalingComparison

(In the last week AWS have also added full support for .NET Core 2.0 on Lambda – expect some data on that soon)

Gradual Ramp Up

This test case starts with 1 user and adds 2 users per second up to a maximum of 500 concurrent users to demonstrate a slow and steady increase in load.

The AWS and Azure results for JavaScript are very similar to those seen for .NET with Azure again struggling with response times and never really competing with AWS when under load. Both AWS and Azure exhibit faster response times when using JavaScript than .NET.

Google Cloud Functions run fairly close to AWS Lambda but can’t quite match it for response time and fall behinds on overall throughput where it sits closer to Azure’s results. Given the difference in response time this would suggest Azure is processing more concurrent incoming requests than Google allowing it to have a similar throughput after the dip Azure encounters at around the 2:30 mark – presumably Azure allocates more resource at that point. That dip deserves further attention and is something I will come back to in a future post.

Rapid Ramp Up

This test case starts with 10 users and adds 10 users every 2 seconds up to a maximum of 1000 concurrent users to demonstrate a more rapid increase in load and a higher peak concurrency.

Again AWS handles the increase in load very smoothly maintaining a low response time throughout and is the clear leader.

Azure struggles to keep up with this rate of request increase. Response times hover around the 1.5 second mark throughout the growth stage and gradually decrease towards something acceptable over the next 3 minutes. Throughput continues to climb over the full duration of the test run matching and perhaps slightly exceeding Google by the end but still some way behind Amazon.

Google has two quite distinctively sharp drops in response time early on in the growth stageas the load increases before quickly stabilising with a response time around 140ms and levels off with throughput in line with the demand at the end of the growth phase.

I didn’t run this test with .NET, instead hitting the systems with an immediate 1000 users, but nevertheless the results are inline with that test particularly once the growth phase is over.

Immediate High Demand

This test case starts immediately with 400 concurrent users and stays at that level of load for 5 minutes demonstrating the response to a sudden spike in demand.

Both AWS and Google scale quickly to deal with the sudden demand both hitting a steady and low response time around the 1 minute mark but AWS is a clear leader in throughput – it is able to get through many more requests per second than Google due to its lower response time.

Azure again brings up the rear – it takes nearly 2 minutes to reach a steady response time that is markedly higher than both Google and AWS. Throughput continues to increase to the end of the test where it eventually peaks slightly ahead of Google but still some way behind AWS. It then experiences a fall off which is difficult to explain from the data available.

Stock Functions

This test uses the stock “return a string” function provided by each platform (I’ve captured the code in GitHub for reference) with the immediate high demand scenario: 400 concurrent users for 5 minutes.

With the functions essentially doing no work and no IO the response times are, as you would expect, smaller across the board but the scaling patterns are essentially unchanged from the workload function under the same load. AWS and Google respond quickly while Azure ramps up more slowly over time.

Percentile Performance

I was unable to obtain this data from VSTS and so resorted to running Apache Benchmarker. For this test I used settings of 100 concurrent requests for a total of 10000 requests, collected the raw data, and processed it in Excel. It should be noted that the network conditions were less predictable for these tests and I wasn’t always as geographically close to the cloud function as I was in other tests though repeated runs yielded similar patterns:

AWS maintains a pretty steady response time up to and including the 98th percentile but then shows marked dips in performance in the 99th and 100th percentiles with a worst case of around 8.5 seconds.

Google dips in performance after the 97th percentile with it’s 99th percentile roughly equivalent to AWSs 100th percentile and it’s own 100th percentile being twice as slow.

Azure exhibits a significant dip in performance at the 96th percentile with a sudden drop in response time from a not great 2.5 seconds to 14.5 seconds – in AWSs 100th percentile territory. Beyond the 96th percentile their is a fairly steady decrease in performance of around 2.5 seconds per percentile.

Cold Starts

All the vendors solutions go “cold” after a time leading to a delay when they start. To get a sense for this I left each vendor idle overnight and then had 1 user make repeat requests for 1 minute to illustrate the cold start time but also get a visual sense of request rate and variance in response time:

Again we have some quite striking results. AWS has the lowest cold start time of around 1.5 seconds, Google is next at 2.5 seconds and Azure again the worst performer at 9 seconds. All three systems then settle into a fairly consistent response time but it’s striking in these graphs how AWS Lambda’s significantly better performance translates into nearly 3x as many requests as Google and 10x more requests than Azure over the minute.

It’s worth noting that the cold start time for the stock functions is almost exactly the same as for my main test case – the startup is function related and not connected to storage IO.

Conclusions

AWS Lambda is the clear leader for HTTP triggered functions – on all the runtimes I’ve tried it has the lowest response times and, at least within the volumes tested, the best ability to deal with scale and the most consistent performance. Google Cloud Functions are not far behind and it will be interesting to see if they can close the gap with optimisation work over the coming year – if they can get their flat our response times reduced they will probably pull level with AWS. The results are similar enough in their characteristics that my suspicion is Google and AWS have similar underlying approaches.

Unfortunately, like with the .NET scenarios, Azure is poor at handling HTTP triggered functions with very similar patterns on show. The Azure issues are not framework based but due to how they are hosting functions and handling scale. Hopefully over the next few months we’ll see some improvements that make Azure a more viable host for HTTP serverless / API approaches when latency matters.

By all means use the above as a rough guide but ultimately whatever platform you choose I’d encourage you to build out the smallest representative vertical slice of functionality you can and test it.

Thanks for reading – hopefully this data is useful.

Recent Posts

Recent Tweets

Recent Comments

Archives

Categories

Meta

GiottoPress by Enrique Chavez