Category: Azure

Writing and Testing Azure Functions with Function Monkey – Part 3

Part 3 of my series on writing Azure Functions with Function Monkey focuses on writing tests using the newly released testing package – while this is by no means required it does make writing high value acceptance tests that use your applications full runtime easy and quick.

Lessons Learned

It really is amazing how quickly time passes when you’re talking and coding – I really hadn’t realised I’d recorded over an hours footage until I came to edit the video. I thought about splitting it in two but the contents really belonged together so I’ve left it as is.

Bike Reminders – A breakdown of a real Azure application (Part 1)

I’ve been meaning to write about a real cloud based project for some time but the criteria a good candidate project needs to fit are challenging:

  • Significant enough to illustrate numerous design and implementation decisions
  • Not so large that the time investment for a reader to get into it is prohibitive
  • I need to own, or have free access to, the intellectual property
  • It needs to be something I want, or am contracted, to build for reasons beyond writing about it

To expand upon that last point a little – I don’t have the time to build something just for a series of blog posts and if I did I suspect it would be too artificial and essentially would end up a strawman.

The real world and real development is constrained messy, you come across things that you can’t economically solve in an ivory towered fashion. You can’t always predict everything in advance, you get things wrong and don’t always have the time available to start again and so have to do the best that you can with what you have.

In the case of this project I hadn’t really thought about it as a candidate for writing about until I neared the end of building the MVP and so it comes, rather handily, with warts and all. For sure I’ve refactored things but no more than you’d expect to on any time and budget constrained project.

My intention is, over the course of a series of posts, to explore this application in an end to end fashion: the requirements, the architecture, the code, testing, deployment – pretty much its end to end lifecycle. Hopefully this will contain useful nuggets of information that can be applied on other projects and help those new to Azure get up and running.

About the project

So what does the project do?

If you’re a keen cyclist you’ll know that you need to check various components on your bike at regular intervals. You’ll also know that some of the components last just long enough that you’ll forget about them – my personal nemesis is chain wear, more than once I’ve taken that to the point where it is likely to start damaging the rear cassette having completely forgotten about it.

I’m fortunate enough to have a rather nice bike and so there is nothing cheap about replacing anything so really not a mistake you want to be making. Many bikes also now contain components that need charging – Di2 and eTap are increasingly common and though I’ve yet to get caught out on a ride I’ve definitely run it closer than I realised.

After the last time I made this mistake I decided to do something about it and thus was born Bike Reminders: a website that links up with Strava to send you reminders as you accrue mileage on each of your bikes. While not a substitute for regularly checking your bike I’m hopeful it will at least give me a prod over chain wear! I contemplated going direct to Garmin but they seem to want circa $5000 for API access and thats a lot of component damage before I break even – ouch.

In terms of an MVP that distilled out into a handful of high level requirements:

  • Authenticate with Strava
  • Access a users bikes in Strava
  • Allow a mileage based maintenance schedule to be set up against a bike
  • Allow email reminders to be dismissed / reset
  • Allow email reminders to be snoozed
  • Update the progress towards each reminder based on rider activity in Strava

There were also some requirements I wanted to keep in mind for the future:

  • Time based reminders
  • “First ride of the week” type reminders
  • Allow reminders to be sent via push notifications
  • Predictive information – based on a riders history when is a reminder likely to be triggered, this is useful if you’re going away on a training camp for example and want to get maintenance done before you go

Setting off on the project I set a number of overarching goals / none functional requirements for it:

  • Keep it small enough that it could be built alongside a two (expanded to three!) week cycling training block in Mallorca
  • To have a very low cost to run both in terms of minimum footprint (cost to run 1 user) and per user cost as the system scales up
  • To require little to no maintenance and a fully automated delivery mechanism
  • To support multiple client types (initially web but to be followed up with a Flutter app)
  • Keep personal data out of it as far as possible
  • As far as possible spin out any work that isn’t specific to the problem domain as open source (I’m fairly likely to reuse it myself if nothing else)

And although I try not to jump ahead of myself that mapped nicely onto using Azure as a cloud provider with Azure Functions for compute and Azure DevOps and Application Insights for the operational side of things.

Architecture

The next step was to figure out what I’d need to build – initially I worked this through on a “mental beermat” while out cycling but I like to use the C4 Model to describe software systems. It gives a basic structure and just enough tools to think about and describe systems at different levels of architecture without disappearing up its own backside in complexity and becoming an end in and of itself.

System Context

For this fairly simple and greenfield system establishing the big picture was fairly straight forward. It’s initially going to comprise of a website accessed by cyclists with their Strava logins, connecting to Strava for tracking mileage, and sending emails for which I chose SendGrid due to existing familiarity with it.

Containers

Breaking this down into more detail forced me to start making some additional decisions. If I was going to build an interactive website / app I’d need some kind of API for which I decided to use Azure Functions. I’ve done a lot of work with them, have a pretty good library for building REST APIs with them (Function Monkey) and they come with a generous free usage allowance which would help me meet my low cost to operate criterion. The event based programming model would also lend itself to handling things like processing queues which is how I envisaged sending emails (hence a message broker – the Azure Service Bus).

For storage I wanted something simple – although at an early stage it seemed to me that I’d be able to store all the key details about cyclists, their bikes and reminders in a JSON document keyed off the cyclists ID. And if something more complex emerged I reasoned it would be easy to convert this kind of format into another. Again cost was a factor and as I couldn’t see, based on my simple requirements, any need for complex queries I decided to at least start with plain old Azure Storage Blob Containers and a filename based on the ID. This would have the advantage of being really simple and really cheap!

The user interface was a simple decision: I’ve done a lot of work with React and I saw no reason it wouldn’t work for this project. Over the last few months I’ve been experimenting with TypeScript and I’ve found it of help with the maintainability of JavaScript projects and so decided to use that from the start on this project.

Finally I needed to figure out how I’d most likely interact with the Strava API to track changes in mileage. They do have a push API that is available by email request but I wanted to start quickly (and this was Christmas and I had no idea how soon I’d hear back from them) and I’d probably have to do some buffering around the ingestion – when you upload a route its not necessarily associated with the right bike (for example my Zwift rides always end up on my main road bike, not my turbo trainer mounted bike) to prevent confusing short term adjustments.

So to begin with I decided to poll Strava once a day for updates which would require some form of scheduling. While I wasn’t expecting huge amounts of overnight for the website Strava do rate limit APIs and so I couldn’t use a timer function with Azure as that would run the risk of overloading the API quite easily. Instead I figured I could use enqueue visibility on the Service Bus and spread out athletes so that the API would never be overloaded. I’ve faced a similar issue before and so I figured this might also make for a useful piece of open source (it did).

All this is summarised in the diagram below:

Azure Topology

Mapped (largely) onto Azure I expected the system to look something like the below:

The notable exception is the introduction of Netlify for my static site hosting. While you can host static sites on Azure it is inelegant at best (and the Azure Storage SPA support is useless as you can’t use SSL and a custom domain) and so a few months back I went searching for an alternative and came across Netlify. It makes building, deploying and hosting sites ridiculously easy and so I’ve been gradually switching my work over to here.

I also, currently, don’t have API Management in front of the Azure Functions that present the REST API – the provisioned approach is simply too expensive for this system at the moment and the consumption model, at least at the time of writing, has a horrific cold start time. I do plan to revisit this.

Next Steps

In the next part we’ll break out the code and begin by taking a look at how I structured the Azure Function app.

SPA Hosting on Azure / Can We Have More Boring Stuff Please

Microsoft announced the long awaited SPA hosting support for Azure yesterday. Before rushing in and adopting it make sure to read the small print – there is no support for custom domain names and SSL. Which given SSL is becoming almost mandatory (and about to become super-prominent as a warning in Chrome) makes it pretty useless for public facing websites in 2018.

Its based on Azure Storage which has outstanding user voice requests from 2012 asking for SSL support with custom domains so it may also not be something that appears soon. I would love to be wrong on that.

The options for resolving it are much like they were before this feature was introduced: proxies of one form or another (CDN, Functions, CloudFlare etc.).

If you’re happy with an Azure storage based domain name, so probably working on an internal facing system I guess, it may still work for you.

As someone who does a lot of work with SPAs I must admit I’m really disappointed in this – it seems pretty ridiculous to me that in 2018 amidst all the “big” Azure announcements (machine learning on the edge, globally distributed planet scale data etc.) that hosting  a single page app, essentially static files, cheaply and easily is still awkward on the platform and requires multiple services to deliver.

I’d love to see the Azure teams really focus on, and really finish, some of the “boring stuff” like this – it might not get marketing column inches, and might not make for a snazzy new “you can do this in 5 minutes” video to impress purchasing CTOs but it would save time and effort amongst hundreds of thousands of engineers and development shops and would deliver real value to those people. I’m often reminded of the Saving Lives story that Andy Hertzfeld tells.

Using ReactJS with Azure AD B2C

Azure AD B2C is Microsoft’s identity provider for social and enterprise logins allowing you to, for example, unify the login process across Twitter, Facebook, and Azure AD / Office 365. It comes with a generous free tier and following that pricing is reasonable particularly compared to the pricing for “enterprise” logins with some of the competition.

However the downside is the documentation for B2C and integration with specific technologies isn’t that clear – there’s nothing particularly strange about B2C, ultimately its just an OpenID Connect identity provider, but there is some nuance in it.

In parallel Microsoft provide MSAL (MicroSoft Authentication Library) for handling authentication from JavaScript clients and here documentation is clearer but still a little incomplete and it can be difficult to figure out the implementation required for a particular scenario – not helped by the library reference having no content other than to repeat method definitions.

I’m currently working with a handful of projects based around React JS, Azure AD B2C, and a combination of ASP.Net Core MVC and Azure Functions and found myself grappling with this. What I was doing seemed eminently reusable (and I hope useful) and so I set some time aside to take what I’d learned and create a B2C specific npm package – react-azure-adb2c.

To install it if you’re using npm:

npm install react-azure-adb2c --save

Or if you’re using yarn:

yarn add react-azure-adb2c

Before continuing you’ll need to set up Azure AD B2C for API access and the three tutorials here are a reasonably easy to follow guide on how to do that. At the end of that process you should have a tenant name, a sign in and/or up policy, an application ID, and one or more scopes.

The first change to make in your app to use the package is to initialize it with your B2C details:

import authentication from 'react-azure-adb2c';

authentication.initialize({
    // your B2C tenant
    tenant: 'myb2ctenant.onmicrosoft.com',
    // the policy to use to sign in, can also be a sign up or sign in policy
    signInPolicy: 'mysigninpolicy',
    // the the B2C application you want to authenticate with
    applicationId: '75ee2b43-ad2c-4366-9b8f-84b7d19d776e',
    // where MSAL will store state - localStorage or sessionStorage
    cacheLocation: 'sessionStorage',
    // the scopes you want included in the access token
    scopes: ['https://myb2ctenant.onmicrosoft.com/management/admin'],
    // optional, the URI to redirect to after logout
    postLogoutRedirectUri: 'http://myapp.com'
});

There are then two main ways you can use the library. You can either protect the entire application (for example if you have a React app that is launched from another landing area) or specific components. To protect the entire application simply wrap the app startup code in index.js as shown below:

authentication.run(() => {
  ReactDOM.render(<App />, document.getElementById('root'));
  registerServiceWorker();  
});

To require authentication for specific components the react-azure-adb2c library provides a function that will wrap a component in a higher order component as shown in the example below:

import React, { Component } from 'react';
import authentication from 'react-azure-adb2c'
import { BrowserRouter as Router, Route, Switch } from "react-router-dom";
import HomePage from './Homepage'
import MembersArea from './MembersArea'

class App extends Component {
  render() {
    return (
      <Router basename={process.env.PUBLIC_URL}>
        <Switch>
          <Route exact path="/" component={HomePage} />
          <Route exact path="/membersArea" component={authentication.required(MembersArea)}>
        </Switch>
      </Router>
    );
  }
}

And finally to get the access token to use with API calls:

import authentication from 'react-azure-adb2c'

// ...

const token = authentication.getAccessToken();

If you find any issues please let me know over on GitHub.

Hopefully that’s useful and takes some of the pain out of using ReactJS with Azure AD B2C and as ever I can be reached on Twitter for discussion.

 

 

Azure Functions – Microsoft Feedback on HTTP Trigger Scaling

Since I published this piece Microsoft have made significant improvements to HTTP scaling on Azure Functions and the below is out of date. Please see this post for a revised comparison.

Following the analysis I published on Azure Functions and the latency in scaling HTTP triggered functions the Microsoft development team got in touch to discuss my findings and provide some information about the future which they were happy for me to share.

Essentially the team are already at work making improvements in this area. Understandably they were unable to commit to timescales or make specific claims as to how significant those improvements but my sense is we’re looking at a handful of months and so, hopefully, half one of this year. They are going to get in touch with me once something is available and I’ll rerun my tests.

I must admit I’m slightly sceptical as to if they’ll be able to match the scaling capability of AWS Lambda (and to be clear they did not make any such claim), which is what I’d like to see, as that looks to me as if it would require a radical uprooting of the Functions runtime model rather than an evolution but ultimately I’m just a random, slightly informed, punter. Hopefully they can at least get close enough that Azure Functions can be used in more latency critical and spiky scenarios.

I’d like to thank @jeffhollan and the team for the call – as a predominantly Azure and .NET developer it’s both helpful and encouraging to be able to have these kinds of dialogues around the platform so critical to our success.

In the interim I’m still finding I can use HTTP functions – I just have to be mindful of their current limitations – and have some upcoming blog posts on patterns that make use of them.

Azure Functions – Scaling with a Dedicated App Service Plan

Since I published this piece Microsoft have made significant improvements to HTTP scaling on Azure Functions. I’ve not yet had the opportunity to test performance on dedicated app service plans but please see this post for a revised comparison on the Consumption Plan.

After my last few posts on the scaling of Azure Functions I was intrigued to see if they would perform any better running on a dedicated App Service Plan. Hosting them in this way allows for the functions to take full advantage of App Service features but, to my mind, is no long a serverless approach as rather than being billed based on usage you are essentially renting servers and are fully responsible for scaling.

I conducted a single test scenario: an immediate load of 400 concurrent users running for 5 minutes against the “stock” JavaScript function (no external dependencies, just returns a string) on 4 configurations:

  1. Consumption Plan – billed based on usage – approximately $130 per month
    (based on running constantly at the tested throughput that is around 648 million functions per month)
  2. Dedicated App Service Plan with 1 x S1 server -$73.20 per month
  3. Dedicated App Service Plan with 2 x S1 server – $146.40 per month
  4. Dedicated App Service Plan with 4 x S1 server – $292.80 per month

I also included AWS Lambda as a reference point.

The results were certainly interesting:

With immediately available resource all 3 App Service Plan configurations begin with response times slightly ahead of the Consumption Plan but at around the 1 minute mark the Consumption Plan overtakes our single instance configuration and at 2 minutes creeps ahead of the double instance configuration and, while the advantage is slight, at 3 minutes begins to consistently outperform our 4 instance configuration. However AWS Lambda remains some way out in front.

From a throughput perspective the story is largely the same with the Consumption Plan taking time to scale up and address the demand but ultimately proving more capable than even the 4x S1 instance configuration and knocking on the door of AWS Lambda. What I did find particularly notable is the low impact of moving from 2 to 4 instances on throughput – the improvement in throughput is massively disappointing – for incurring twice the cost we are barely getting 50% more throughput. I have insufficient data to understand why this is happening but do have some tests in mind that, time allowing, I will run and see if I can provide further information.

At this kind of load (650 million requests per month) from a bang per buck point of view Azure Functions on the Consumption Plan come out strongly compared to App Service instances even if we don’t allowing for quiet periods when Functions would incur less cost. If your scale profile falls within the capabilities of the service it’s worth considering though it’s worth remembering their isn’t really an SLA around Functions at the moment when running on the Consumption Plan (and to be fair the same applies to AWS Lambda).

If you don’t want to take advantage of any of the additional features that come with a dedicated App Service plan and although they can be provisioned to avoid the slow ramp up of the Consumption Plan are expensive in comparison.

Azure Functions vs AWS Lambda vs Google Cloud Functions – JavaScript Scaling Face Off

Since I published this piece Microsoft have made significant improvements to HTTP scaling on Azure Functions and the below is out of date. Please see this post for a revised comparison.

I had a lot of interesting conversations and feedback following my recent post on scaling a serverless .NET application with Azure Functions and AWS Lambda. A common request was to also include Google Cloud Functions and a common comment was that the runtimes were not the same: .NET Core on AWS Lambda and .NET 4.6 on Azure Functions. In regard to the latter point I certainly agree this is not ideal but continue to contend that as these are your options for .NET and are fully supported and stated as scalable serverless runtimes by each vendor its worth understanding and comparing these platforms as that is your choice as a .NET developer. I’m also fairly sure that although the different runtimes might make a difference to outright raw response time, and therefore throughput and the ultimate amount of resource required, the scaling issues with Azure had less to do with the runtime and more to do with the surrounding serverless implementation.

Do I think a .NET Core function in a well architected serverless host will outperform a .NET Framework based function in a well architected serverless host? Yes. Do I think .NET Framework is the root cause of the scaling issues on Azure? No. In my view AWS Lambda currently has a superior way of managing HTTP triggered functions when compared to Azure and Azure is hampered by a model based around App Service plans.

Taking all that on board and wanting to better evidence or refute my belief that the scaling issues are more host than framework related I’ve rewritten the test subject as a tiny Node / JavaScript application and retested the platforms on this runtime – Node is supported by all three platforms and all three platforms are currently running Node JS 6.x.

My primary test continues to be a mixed light workload of CPU and IO (load three blobs from the vendors storage offering and then compile and run a handlebars template), the kind of workload its fairly typical to find in a HTTP function / public facing API. However I’ve also run some tests against “stock” functions – the vendor samples that simply return strings. Finally I’ve also included some percentile based data which I obtained using Apache Benchmark and I’ve covered off cold start scenarios.

I’ve also managed to normalise the axes this time round for a clearer comparison and the code and data can all be found on GitHub:

https://github.com/JamesRandall/serverlessJsScalingComparison

(In the last week AWS have also added full support for .NET Core 2.0 on Lambda – expect some data on that soon)

Gradual Ramp Up

This test case starts with 1 user and adds 2 users per second up to a maximum of 500 concurrent users to demonstrate a slow and steady increase in load.

The AWS and Azure results for JavaScript are very similar to those seen for .NET with Azure again struggling with response times and never really competing with AWS when under load. Both AWS and Azure exhibit faster response times when using JavaScript than .NET.

Google Cloud Functions run fairly close to AWS Lambda but can’t quite match it for response time and fall behinds on overall throughput where it sits closer to Azure’s results. Given the difference in response time this would suggest Azure is processing more concurrent incoming requests than Google allowing it to have a similar throughput after the dip Azure encounters at around the 2:30 mark – presumably Azure allocates more resource at that point. That dip deserves further attention and is something I will come back to in a future post.

Rapid Ramp Up

This test case starts with 10 users and adds 10 users every 2 seconds up to a maximum of 1000 concurrent users to demonstrate a more rapid increase in load and a higher peak concurrency.

Again AWS handles the increase in load very smoothly maintaining a low response time throughout and is the clear leader.

Azure struggles to keep up with this rate of request increase. Response times hover around the 1.5 second mark throughout the growth stage and gradually decrease towards something acceptable over the next 3 minutes. Throughput continues to climb over the full duration of the test run matching and perhaps slightly exceeding Google by the end but still some way behind Amazon.

Google has two quite distinctively sharp drops in response time early on in the growth stageas the load increases before quickly stabilising with a response time around 140ms and levels off with throughput in line with the demand at the end of the growth phase.

I didn’t run this test with .NET, instead hitting the systems with an immediate 1000 users, but nevertheless the results are inline with that test particularly once the growth phase is over.

Immediate High Demand

This test case starts immediately with 400 concurrent users and stays at that level of load for 5 minutes demonstrating the response to a sudden spike in demand.

Both AWS and Google scale quickly to deal with the sudden demand both hitting a steady and low response time around the 1 minute mark but AWS is a clear leader in throughput – it is able to get through many more requests per second than Google due to its lower response time.

Azure again brings up the rear – it takes nearly 2 minutes to reach a steady response time that is markedly higher than both Google and AWS. Throughput continues to increase to the end of the test where it eventually peaks slightly ahead of Google but still some way behind AWS. It then experiences a fall off which is difficult to explain from the data available.

Stock Functions

This test uses the stock “return a string” function provided by each platform (I’ve captured the code in GitHub for reference) with the immediate high demand scenario: 400 concurrent users for 5 minutes.

With the functions essentially doing no work and no IO the response times are, as you would expect, smaller across the board but the scaling patterns are essentially unchanged from the workload function under the same load. AWS and Google respond quickly while Azure ramps up more slowly over time.

Percentile Performance

I was unable to obtain this data from VSTS and so resorted to running Apache Benchmarker. For this test I used settings of 100 concurrent requests for a total of 10000 requests, collected the raw data, and processed it in Excel. It should be noted that the network conditions were less predictable for these tests and I wasn’t always as geographically close to the cloud function as I was in other tests though repeated runs yielded similar patterns:

AWS maintains a pretty steady response time up to and including the 98th percentile but then shows marked dips in performance in the 99th and 100th percentiles with a worst case of around 8.5 seconds.

Google dips in performance after the 97th percentile with it’s 99th percentile roughly equivalent to AWSs 100th percentile and it’s own 100th percentile being twice as slow.

Azure exhibits a significant dip in performance at the 96th percentile with a sudden drop in response time from a not great 2.5 seconds to 14.5 seconds – in AWSs 100th percentile territory. Beyond the 96th percentile their is a fairly steady decrease in performance of around 2.5 seconds per percentile.

Cold Starts

All the vendors solutions go “cold” after a time leading to a delay when they start. To get a sense for this I left each vendor idle overnight and then had 1 user make repeat requests for 1 minute to illustrate the cold start time but also get a visual sense of request rate and variance in response time:

Again we have some quite striking results. AWS has the lowest cold start time of around 1.5 seconds, Google is next at 2.5 seconds and Azure again the worst performer at 9 seconds. All three systems then settle into a fairly consistent response time but it’s striking in these graphs how AWS Lambda’s significantly better performance translates into nearly 3x as many requests as Google and 10x more requests than Azure over the minute.

It’s worth noting that the cold start time for the stock functions is almost exactly the same as for my main test case – the startup is function related and not connected to storage IO.

Conclusions

AWS Lambda is the clear leader for HTTP triggered functions – on all the runtimes I’ve tried it has the lowest response times and, at least within the volumes tested, the best ability to deal with scale and the most consistent performance. Google Cloud Functions are not far behind and it will be interesting to see if they can close the gap with optimisation work over the coming year – if they can get their flat our response times reduced they will probably pull level with AWS. The results are similar enough in their characteristics that my suspicion is Google and AWS have similar underlying approaches.

Unfortunately, like with the .NET scenarios, Azure is poor at handling HTTP triggered functions with very similar patterns on show. The Azure issues are not framework based but due to how they are hosting functions and handling scale. Hopefully over the next few months we’ll see some improvements that make Azure a more viable host for HTTP serverless / API approaches when latency matters.

By all means use the above as a rough guide but ultimately whatever platform you choose I’d encourage you to build out the smallest representative vertical slice of functionality you can and test it.

Thanks for reading – hopefully this data is useful.

Azure Functions vs AWS Lambda – Scaling Face Off

Since I published this piece Microsoft have made significant improvements to HTTP scaling on Azure Functions and the below is out of date. Please see this post for a revised comparison.

If you’ve been following my blog recently you’ll know I’ve been spending a lot of time with the Azure Functions – Microsoft’s implementation of a serverless platform. The idea behind serverless appeals to me massively and seems like the natural next evolution of compute on the cloud with scaling and pricing being, so the premise goes, fully dynamic and consumption based.

The use of App Service Plans (more later) as a host mechanism for Azure Functions gave me some concern about how “serverless” Azure Functions might actually be and so to verify suitability for my use cases I’ve been running a range of different tests around response time and latency that culminated in the “real” application I described in my last blog post and some of the performance tests I ran along the way. I quickly learned that the hosting implementation is not particularly dynamic and so wanted to run comparable tests on AWS Lambda.

To do this I’ve ported the serverless blog over to AWS Lambda, S3 and DynamoDB (the, rather scruffy, code is in a branch on GitHub – I will tidy this up but the aim was to get the tests running) and then I’ve run a number of user volume scenarios against a single test case: loading the homepage. The operations involved in this are:

  1. A GET request to a serverless HTTP endpoint that:
    1. Loads 3 resources from storage (Blob Storage on Azure, S3 on AWS) in an asynchronous batch.
    2. Combines them together using a Handlebars template
    3. Returns the response as a string of type text/html.

On Azure I’m using .NET 4.6 on the v1 runtime while on AWS I’m using the same code running under .NET Core 1.0. It’s worth noting that latency on blob access remained minimal throughout all these tests (6ms on average across all loads) and when removing blob access from the tests it made little difference to the patterns.

Although the .NET 4.6 and Core runtimes are different (and accepted may exhibit different behaviours) these are the current general availability options for implementing serverless on the two platforms using .NET and both vendors claim full support for them. In Microsoft’s case some of the languages supported on the v1 Azure Functions runtime, the one tested here (v2 is in preview and has serious performance issues with .NET Core), are experimental and documented as having scale problems but C# (which runs under full framework .NET) is not one of them. Both vendors have .NET Core 2.0 support on the way and in preview but given the issues I’m waiting until they go on general availability until I compare them.

The results are, frankly, pretty damning when it comes to Azure Functions ability to scale dynamically and so let’s get into the data and then look at why.

A quick note on the graphs: I’ve pulled these from VSTS, it’s quite hard (or at least I don’t know how to!) equalise the scales and so please do look at the numbers carefully – the difference is quite startling.

Add 2 Users per Second

In this test scenario I’ve started with a single user and then added 2 users per second over a 5 minutes run time up to a maximum of 500 users:

We can see from this test that AWS matches the growth in user load almost exactly, it has no issue dealing with the growing demand and page requests time hover around the 100ms mark. Contrast this with Azure which always lags a little behind the demand, is spikier, and has a much higher response time hovering around the 700ms mark.

This is backed up by the average stats from the run:

It’s interesting to note just how many more requests AWS dealt with as a result of it’s better performance: 215271 as opposed to Azure’s 84419. Well over twice as many.

Constant Load of 400 Concurrent Users

This test hits the application with 400 concurrent users from a standing start and runs over a 10 minute period simulating a sudden spike or influx of traffic and looking at how quickly each serverless environment is able to deal with the load. Neither environment was completely cold as I’d been refreshing the view in the browser but neither had had any significant traffic for some time. The contrast is significant to say the least:

Let’s cover AWS first as it’s so simple: it quickly absorbs the load and hits a steady response time of around 80ms again in under a minute.

Azure, on the other hand, is more complex. Average response time doesn’t fall under a second until the test has been running for 7 minutes and it’s only around then that the system is able to get near the throughput AWS put out in a minute. Pretty disappointing and backed up by the overall stats for the run:

Again it’s striking just how improved the AWS stats over the Azure figures.

Constant Load of 1000 Concurrent Users

Same scenario as the last test but this time 1000 users. Lets get into the data:

Again we can see a similar pattern with Azure slow to scale up to meet the demand while with AWS it is business as usual in under a minute. Interestingly at this level of concurrency AWS also error’d heavily during the early scaling:

It should be noted that AWS specifically instructs you to implement retry and backoff handlers on the client which in the load test I am not doing, additionally at this point I am seeing throttle events in the logging for the AWS function – this is something I will look to come back to in the future. However its interesting to note the contrasting approaches of the two systems: Azure inflates it’s response time while AWS prefers to throw errors.

The average stats for the run:

Azure Functions

I don’t think there’s much point dancing around the issue: the above numbers are disappointing. Azure is slow to scale it’s HTTP triggered functions and once we get beyond the 100 concurrent users point the response times are never great and the experience is generally uneven. For customer facing API / web serving where low latency and response time are critical to a smooth user experience this really rules it out as an option. And it’s not just the .NET 4.6 variant that is poor as can be seen from my previous posts where I stripped test cases down to the most basic scenarios and used a variety of frameworks. The best case for Azure scaling I’ve found is using a CSX approach to return a string but even that lags behind AWS doing real work as the test cases in this post do:

using System.Net;

public static async Task<HttpResponseMessage> Run(HttpRequestMessage req, TraceWriter log)
{
    log.Info("C# HTTP trigger function processed a request.");

    var response = req.CreateResponse();
    response.StatusCode = HttpStatusCode.OK;
    response.Content = new StringContent("<html><head><title>Blog</title></head><body>Hello world</body></html>", System.Text.Encoding.UTF8, "text/html");

    return response;
}

With 1000 concurrent users over 5 minutes:

And with the add 2 users per second scenario:

Even in this final case, and remember this Azure Function is only returning a string, we can see the response time creeping up as the user load increases and the total number of requests served is only 77514 to AWS’s 215271 over the same period with a much lower number of requests per second.

In an additional attempt to validate my conclusion that the Azure Function system is poor at scaling I pointed the AWS Lambda installation at Azure Blob Storage instead of S3. In this test other than the function entry point semantics the code running on AWS is now taking exactly the same branches as the Azure tests and using the same underlying storage mechanism, albeit with a hop across the Internet to access the storage. I ran this scenario using the 400 concurrent user scenario:

We can see from this that other than a slightly increased response time due to the storage being hosted in another data centre AWS continues to perform well and scales up almost immediately and response time remains steady and low. We can also see their is no issue with Azure Blob Storage – if there was an issue there we’d expect to see it impact these results.

With these additional validation tests (an empty workload and AWS running against Blob Storage) that pretty much isolates the issue to the Azure Function runtime.

And it’s a shame as the developer experience is great, there is solid documentation, and plenty of samples, and the development team on Twitter are ludicrously responsive – to the point that I feel bad saying what I need to say here. I will reach out to them for feedback.

Why is this the case? Well I’d suggest the root of the issue is how the system has been built on top of App Service Plans. It’s not all that, well, serverless and you still find yourself worrying about, well, servers.

On Azure an App Service Plan is essentially a collection of rented servers / reserved compute power of a given spec (CPU, memory) and capabilities. Microsoft have layered what they call a Consumption Plan over this for Azure Functions which provides for automatic scaling and consumption based pricing. Unfortunately if you track what is going on your Functions are running on a limited number of these servers which you can evidence by tracking the instance ID and by sharing state between your functions (to be clear: this is not good!).

Essentially the level of granularity for scaling your functions remains, as in a traditional hosting model, at the server level and as your system scales up instances are slowly being added – but this is throttled tightly presumably to prevent Microsoft’s costs from spiralling out of control.

Now because they run on Application Service Plans you can switch hosting away from the Consumption plan onto a standard plan (which allows additional Azure features to be used) but this, to me, completely defeats the point of serverless. I’m paying for reserved compute again and managing server instance counts. I may as well not have bothered in the first place!

It’s hard to escape the feeling that Microsoft had to play catch up with AWS Lambda (it launched as a preview in late 2014 and went into general release in April 2015 whereas Azure Functions launched as a preview in March 2016 ) and built something they could market as serverless computing as quickly as they could by reusing existing compute and scaling systems on Azure.

Would I still use Azure Functions? Yes sure – in back end scenarios where latency isn’t all that important they’re a great fit. Anything that impacts user experience? No. Definitely not at this point.

It will be interesting to see if Microsoft revise the hosting model, I suspect if they do it’s some time off as currently they seem focused on the v2 runtime which isn’t a hosting change (as far as I can see) but rather giving Functions the ability to support more languages and .NET Core.

AWS Lambda

I’ll preface this by saying I am absolutely not an AWS expert so it’s harder for me to speculate about the underlying architecture of Lambda however… the numbers don’t lie: AWS manages to respond to changes in demand very quickly and, until I started to hit throttle limits (which I would need to speak to AWS Support to have lifted), is very consistent in response times.

I’ve not tried any state sharing but I would expect it to fail: it looks like Amazon have containerised at the Function level, rather than the host server, and this is what allows them to operate as you’d expect a serverless environment to. Both scaling and billing can then be at the function level.

Would I use AWS Lambda? Yes. But as most of my development work is on Azure I’m really hoping Microsoft bridge the capability gap.

Wrap Up and Next Steps

If you’ve followed this far – thanks! I’m a big fan of the serverless model but the Azure implementation of serverless looks like something of a compromised offering at this point and I’d be cautious of recommending it without understanding in detail the usage requirements as you will quickly hit choppy water.

I am planning on repeating similar experiments with the queue processing I began some time ago and if I get any information from Microsoft around this topic will make any corrections as appropriate. This is one of those times I’d love to have got things wrong.

Azure Functions v2 Preview Performance Issues (.NET Core / Standard)

I’ve been spending a little time building out a serverless web application as a small holiday project and as this is just a side project I’d taken the opportunity to try out the new .NET Core based v2 runtime for Azure Functions and the new tooling and support in Visual Studio 2017.

As soon as I had an end to end vertical slice I wanted to run some load tests to ensure it would scale up reliably – the short version is that it didn’t. The .NET Core v2 runtime is still in preview (and you are warned not to use this environment for production workloads due to potential breaking changes) so you would hope that this will get fixed by general release but right now there seem to be some serious shortcomings in the scalability and performance of this environment rendering it fairly unusable.

I used the VSTS load testing system to hit a single URL initially with a high volume of users for a few minutes. In isolation (i.e. if I run it from a browser with no activity) this function runs in less than 100ms and normally around the 70ms mark however as the number of users increases performance quickly takes a serious nosedive with requests taking seconds to return as can be seen below:

After things settled down a little (hitting a system like this from cold with a high concurrency is going to cause some chop while things scale out) average request time began to range in the 3 to 9 seconds and the anecdotal experience (me running it in a browser / PostMan while the test was going on) gave me highly variable performance. Some requests would take just a few hundred milliseconds while others would take over 20 seconds.

Worryingly no matter how long the test was run this never improved.

I began by looking at my code assuming I’d made a silly mistake but I couldn’t see anything and so boiled things down to a really simple test case, essentially the one that is created for you by the Visual Studio template:

[FunctionName("GetString")]
public static IActionResult Run([HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = null)]HttpRequest req, TraceWriter log)
{
    log.Info("C# HTTP trigger function processed a request.");

    var result = new OkObjectResult("hello world");

    return (IActionResult)result;
}

I expected this to scale and perform much better as it’s as simple as it gets: return a hard coded string. However to my surprise this exhibited very similar issues:

The response time, to return a string!, hovered around the 7 second mark and the system never really scaled sufficiently to deal with a small percentage of failures due to the volume.

Having run a fair few tests and racking up a lot of billable virtual user minutes on my credit card I tweaked the test slightly at this point moving to a 5 minute test length with step up concurrent user growth. Running this on the same simple test gave me, again, poor results with average response times of between 1.5 and 2 seconds for 100 concurrent users and a function that is as close to doing nothing as it gets (the response time is hidden by the page time in the performance chart below, it tracks almost exactly). The step up of users to a fairly low volume eliminates the errors, as you’d expect.

What these graphs don’t show are variance around this average response time which still ranged from a few hundred milliseconds up to around 15 seconds.

At this point I was beginning to suspect the Functions 2.0 preview runtime might be the issue and so created myself a standard Functions 1.0 runtime and deployed this simple function as a CSX script:

using System.Net;

public static async Task<HttpResponseMessage> Run(HttpRequestMessage req, TraceWriter log)
{
    var response = req.CreateResponse();
    response.StatusCode = HttpStatusCode.OK;
    response.Content = new StringContent("hello world", System.Text.Encoding.UTF8, "text/plain");

    return response;
}

Running the same ramp up test as above shows that this function behaves much more as you’d expect with average response times in the 300ms to 400ms range when running at 100 concurrent users:

Intrigued I did run a short 5 minute 400 concurrent user test with no ramp up and again the csx based function behaved much more in line with what I think are reasonable expectations with it taking a short time to scale up to deal with the sudden demand but doing so without generating errors and eventually settling down to a response time similar to the test above:

Finally I deployed a .NET 4.6 based function into a new 1.0 runtime Function app. I made a slight mistake when setting up this test and ramped it up to 200 users rather than 100 but it scales much more as you’d expect and holds a fairly steady response time of around 150ms. Interestingly this gives longer response times than .NET Core for single requests run in isolation around 170ms for .NET 4.6 vs. 70ms for .NET Core.

At this point I felt fairly confident that the issue I was seeing in my application was due to the v2 Function runtime and so made a quick change to target .NET 4.6 instead and spun up a new v1 runtime and ran my initial 400 concurrent user test again:

As the system scales up, giving no errors, this test eventually settles at around the 500ms average request per second mark which is something I can move ahead with. I’d like to get it closer to 150ms and it will be interesting to see what I can tweak so I can on the consumption plan as I think I’m starting to bump up against some of the other limits with Functions (ironically resolving that involves taking advantage of what is actually going on with the Functions runtime implementation and accepting that its a somewhat flawed serverless implementation as it stands today).

As a more general conclusion the only real takeaway I have from the above (beyond the general point that it’s always worth doing some basic load testing even on what you assume to be simple code) is that the Azure Function 2.0 runtime has some way to go before it comes out of Preview. What’s running in Azure currently is suitable only for the most trivial of workloads – I wouldn’t feel able to run this even in a beta system today.

Something else I’d like to see from Azure Functions is a more aggressive approach to scaling up/out, for spiky workloads where low latency is important there is a significant drag factor at the moment. While you can run on an App Service Plan and handle the scaling yourself this kind of flies in the face of the core value proposition of serverless computing – I’m back to renting servers. A reserved throughput or Premium Consumption offering might make more sense.

I do plan on running these tests again once the runtime moves out of preview – I’m confident the issue will be fixed, after all to be usable as a service it basically has to be.

Azure Functions – Queue Trigger Scaling (Part 1)

I’m a big fan of the serverless compute model (known as Azure Functions on the Azure platform), but in some ways its greatest strength is also its weakness: the serverless model essentially asks you to run small units of code inside a fully managed environment on a pay for what you need basis that in theory will scale infinitely in response to demand. With the increased granularity this is the next evolution in cloud elasticity with no more need to buy and reserve CPUs which sit partially idle until the next scaling point is reached. However as a result you lose control over the levers you might be used to pulling in a more traditional cloud compute environment – it’s very much a black box. Using typical queue processing patterns as an example this includes the number of “threads” or actors looking at a queue and the length of the back off timings.

Most of the systems I’ve transitioned onto Azure Functions to date have been more focused on cost than scale and have had no particular latency requirements and so I’ve just been happy to reduce my costs without a particularly close examination. However I’m starting to look at moving spikier higher volume queue systems onto Azure Functions and so I’ve been looking to understand the opaque aspects more fully through running a series of experiments.

Before continuing it’s worth noting that Microsoft continue to evolve the runtime host for Azure Functions and so the results are only really valid at the time they are run – run them again in 6 months and you’re likely to see, hopefully subtle and improved, changes in behaviour.

Most of my higher volume requirements are light in terms of compute power but heavy on volume and so I’ve created a simple function that pulls a message from a Service Bus queue and writes it, along with a timestamp, onto an event hub:

[FunctionName("DequeAndForwardOn")]
[return: EventHub("results", Connection = "EhConnectionString")]
public static string Run([ServiceBusTrigger("testqueue", Connection = "SbConnectionString")]string myQueueItem, TraceWriter log)
{
    log.Info($"C# ServiceBus queue trigger function processed message: {myQueueItem}");
    EventHubOutput message = JsonConvert.DeserializeObject<EventHubOutput>(myQueueItem);
    message.ProcessedAtUtc = DateTime.UtcNow;
    string json = JsonConvert.SerializeObject(message);
    return json;
}

Once the items are on the Event Hub I’m using a Streaming Analytics job to count the number of dequeues per second with a tumbling window and output them to table storage:

SELECT
    System.TimeStamp AS TbPartitionKey,
    '' as TbRowKey,
    SUBSTRING(CAST(System.TimeStamp as nvarchar(max)), 12, 8) as Time,
    COUNT(1) AS totalProcessed
INTO
    resultsbysecond
FROM
    [results]
TIMESTAMP BY EventEnqueuedUtcTime
GROUP BY TUMBLINGWINDOW(second,1)

For this initial experiment I’m simply going to pre-load the Service Bus queue with 1,000,000 messages and analyse the dequeue rate. Taking all the above gives us a workflow that looks like this:

Executing all this gave some interesting results as can be seen from the graph below:

From a cold start it took just under 13 minutes to dequeue all 1,000,000 messages with a fairly linear, if spiky, approach to scaling up the dequeue rate from a low at the beginning of 23 dequeues per second to a peak of over 3000 increasing at a very rough rate of 3.2 messages per second. It seems entirely likely that this will go on until we start to hit IO limits around the Service Bus. We’d need to do more experiments to be certain but it looks like the runtime is allocating more queue consumers while all existing consumers continue to find their are items on the queue to process.

In the next part we’re going to run a few more experiments to help us understand the scaling rate better and how it is impacted by quiet periods.

Contact

  • If you're looking for help with C#, .NET, Azure, Architecture, or would simply value an independent opinion then please get in touch here or over on Twitter.

Recent Posts

Recent Tweets

  • Been stopped multiple times for checks in the airport this morning. My sad face must look like my suspicious face!
  • My summer cycling mission is to nail my nutrition. I know what works for weight training but other than “more carb… https://t.co/Xo83puK5AO
  • Really struggling with the idea of leaving. It gets harder every time. Remembering that I get to see my awesome fr… https://t.co/t4KL8ZWBXV
  • RT @swardley: The best advert for serverless that I'm aware of is ... every Kubernetes presentation ever given.

Recent Comments

Archives

Categories

Meta

GiottoPress by Enrique Chavez