Category: Azure Functions

Azure Functions v2 Preview Performance Issues (.NET Core / Standard)

I’ve been spending a little time building out a serverless web application as a small holiday project and as this is just a side project I’d taken the opportunity to try out the new .NET Core based v2 runtime for Azure Functions and the new tooling and support in Visual Studio 2017.

As soon as I had an end to end vertical slice I wanted to run some load tests to ensure it would scale up reliably – the short version is that it didn’t. The .NET Core v2 runtime is still in preview (and you are warned not to use this environment for production workloads due to potential breaking changes) so you would hope that this will get fixed by general release but right now there seem to be some serious shortcomings in the scalability and performance of this environment rendering it fairly unusable.

I used the VSTS load testing system to hit a single URL initially with a high volume of users for a few minutes. In isolation (i.e. if I run it from a browser with no activity) this function runs in less than 100ms and normally around the 70ms mark however as the number of users increases performance quickly takes a serious nosedive with requests taking seconds to return as can be seen below:

After things settled down a little (hitting a system like this from cold with a high concurrency is going to cause some chop while things scale out) average request time began to range in the 3 to 9 seconds and the anecdotal experience (me running it in a browser / PostMan while the test was going on) gave me highly variable performance. Some requests would take just a few hundred milliseconds while others would take over 20 seconds.

Worryingly no matter how long the test was run this never improved.

I began by looking at my code assuming I’d made a silly mistake but I couldn’t see anything and so boiled things down to a really simple test case, essentially the one that is created for you by the Visual Studio template:

[FunctionName("GetString")]
public static IActionResult Run([HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = null)]HttpRequest req, TraceWriter log)
{
    log.Info("C# HTTP trigger function processed a request.");

    var result = new OkObjectResult("hello world");

    return (IActionResult)result;
}

I expected this to scale and perform much better as it’s as simple as it gets: return a hard coded string. However to my surprise this exhibited very similar issues:

The response time, to return a string!, hovered around the 7 second mark and the system never really scaled sufficiently to deal with a small percentage of failures due to the volume.

Having run a fair few tests and racking up a lot of billable virtual user minutes on my credit card I tweaked the test slightly at this point moving to a 5 minute test length with step up concurrent user growth. Running this on the same simple test gave me, again, poor results with average response times of between 1.5 and 2 seconds for 100 concurrent users and a function that is as close to doing nothing as it gets (the response time is hidden by the page time in the performance chart below, it tracks almost exactly). The step up of users to a fairly low volume eliminates the errors, as you’d expect.

What these graphs don’t show are variance around this average response time which still ranged from a few hundred milliseconds up to around 15 seconds.

At this point I was beginning to suspect the Functions 2.0 preview runtime might be the issue and so created myself a standard Functions 1.0 runtime and deployed this simple function as a CSX script:

using System.Net;

public static async Task<HttpResponseMessage> Run(HttpRequestMessage req, TraceWriter log)
{
    var response = req.CreateResponse();
    response.StatusCode = HttpStatusCode.OK;
    response.Content = new StringContent("hello world", System.Text.Encoding.UTF8, "text/plain");

    return response;
}

Running the same ramp up test as above shows that this function behaves much more as you’d expect with average response times in the 300ms to 400ms range when running at 100 concurrent users:

Intrigued I did run a short 5 minute 400 concurrent user test with no ramp up and again the csx based function behaved much more in line with what I think are reasonable expectations with it taking a short time to scale up to deal with the sudden demand but doing so without generating errors and eventually settling down to a response time similar to the test above:

Finally I deployed a .NET 4.6 based function into a new 1.0 runtime Function app. I made a slight mistake when setting up this test and ramped it up to 200 users rather than 100 but it scales much more as you’d expect and holds a fairly steady response time of around 150ms. Interestingly this gives longer response times than .NET Core for single requests run in isolation around 170ms for .NET 4.6 vs. 70ms for .NET Core.

At this point I felt fairly confident that the issue I was seeing in my application was due to the v2 Function runtime and so made a quick change to target .NET 4.6 instead and spun up a new v1 runtime and ran my initial 400 concurrent user test again:

As the system scales up, giving no errors, this test eventually settles at around the 500ms average request per second mark which is something I can move ahead with. I’d like to get it closer to 150ms and it will be interesting to see what I can tweak so I can on the consumption plan as I think I’m starting to bump up against some of the other limits with Functions (ironically resolving that involves taking advantage of what is actually going on with the Functions runtime implementation and accepting that its a somewhat flawed serverless implementation as it stands today).

As a more general conclusion the only real takeaway I have from the above (beyond the general point that it’s always worth doing some basic load testing even on what you assume to be simple code) is that the Azure Function 2.0 runtime has some way to go before it comes out of Preview. What’s running in Azure currently is suitable only for the most trivial of workloads – I wouldn’t feel able to run this even in a beta system today.

Something else I’d like to see from Azure Functions is a more aggressive approach to scaling up/out, for spiky workloads where low latency is important there is a significant drag factor at the moment. While you can run on an App Service Plan and handle the scaling yourself this kind of flies in the face of the core value proposition of serverless computing – I’m back to renting servers. A reserved throughput or Premium Consumption offering might make more sense.

I do plan on running these tests again once the runtime moves out of preview – I’m confident the issue will be fixed, after all to be usable as a service it basically has to be.

Azure Functions – Expect Significant Clock Skew

While running the experiment I posted about on Sunday I annotated the message I sent on to the event hub with the Azure Functions view of the current time (basically I set a property to DateTime.UtcNow) and, out of curiosity, grouped my results by second long tumbling windows based on that date. This gave me results that were observably different than when doing the same with the enqueue date and time logged by the Event Hub (as an aside there is some interesting information about Event Hubs and clock skew here). My experiments didn’t need massively accurate time tracking as I was really just looking for trends over a long, relative to the clock skew, period of time however I looked at some of the underlying numbers and became suspicious that there was a significant degree of clock skew across my functions.

I reached out to the @AzureFunctions team on Twitter asking how they handled clock sync on the platform and one of the engineers, Chris Anderson, replied confirming what I suspected: there are no guarantees about clock sync on the Azure Functions platform and, further, you should expect the time skew to be large.

That means you can’t really obtain a consistent view of “now” from within an Azure Function. You could go and get it from an external source but that in and of itself is going to introduce other inaccuracies. Essentially you can’t handle dynamic time reliably inside a function with any precision and you’re limited to working with reference points obtained upstream and passed in.

Definitely something to be aware of when designing systems that make use of Azure Functions as it would be easy to use them in scenarios applying timestamps expecting some sense of temporal sequence.

This isn’t, of course, a new problem – dealing with precise and accurate time in a distributed system is always a challenge and requires careful consideration but it does underline the importance of understanding your cloud vendors various runtime environments.

Azure Functions – Queue Trigger Scaling (Part 1)

I’m a big fan of the serverless compute model (known as Azure Functions on the Azure platform), but in some ways its greatest strength is also its weakness: the serverless model essentially asks you to run small units of code inside a fully managed environment on a pay for what you need basis that in theory will scale infinitely in response to demand. With the increased granularity this is the next evolution in cloud elasticity with no more need to buy and reserve CPUs which sit partially idle until the next scaling point is reached. However as a result you lose control over the levers you might be used to pulling in a more traditional cloud compute environment – it’s very much a black box. Using typical queue processing patterns as an example this includes the number of “threads” or actors looking at a queue and the length of the back off timings.

Most of the systems I’ve transitioned onto Azure Functions to date have been more focused on cost than scale and have had no particular latency requirements and so I’ve just been happy to reduce my costs without a particularly close examination. However I’m starting to look at moving spikier higher volume queue systems onto Azure Functions and so I’ve been looking to understand the opaque aspects more fully through running a series of experiments.

Before continuing it’s worth noting that Microsoft continue to evolve the runtime host for Azure Functions and so the results are only really valid at the time they are run – run them again in 6 months and you’re likely to see, hopefully subtle and improved, changes in behaviour.

Most of my higher volume requirements are light in terms of compute power but heavy on volume and so I’ve created a simple function that pulls a message from a Service Bus queue and writes it, along with a timestamp, onto an event hub:

[FunctionName("DequeAndForwardOn")]
[return: EventHub("results", Connection = "EhConnectionString")]
public static string Run([ServiceBusTrigger("testqueue", Connection = "SbConnectionString")]string myQueueItem, TraceWriter log)
{
    log.Info($"C# ServiceBus queue trigger function processed message: {myQueueItem}");
    EventHubOutput message = JsonConvert.DeserializeObject<EventHubOutput>(myQueueItem);
    message.ProcessedAtUtc = DateTime.UtcNow;
    string json = JsonConvert.SerializeObject(message);
    return json;
}

Once the items are on the Event Hub I’m using a Streaming Analytics job to count the number of dequeues per second with a tumbling window and output them to table storage:

SELECT
    System.TimeStamp AS TbPartitionKey,
    '' as TbRowKey,
    SUBSTRING(CAST(System.TimeStamp as nvarchar(max)), 12, 8) as Time,
    COUNT(1) AS totalProcessed
INTO
    resultsbysecond
FROM
    [results]
TIMESTAMP BY EventEnqueuedUtcTime
GROUP BY TUMBLINGWINDOW(second,1)

For this initial experiment I’m simply going to pre-load the Service Bus queue with 1,000,000 messages and analyse the dequeue rate. Taking all the above gives us a workflow that looks like this:

Executing all this gave some interesting results as can be seen from the graph below:

From a cold start it took just under 13 minutes to dequeue all 1,000,000 messages with a fairly linear, if spiky, approach to scaling up the dequeue rate from a low at the beginning of 23 dequeues per second to a peak of over 3000 increasing at a very rough rate of 3.2 messages per second. It seems entirely likely that this will go on until we start to hit IO limits around the Service Bus. We’d need to do more experiments to be certain but it looks like the runtime is allocating more queue consumers while all existing consumers continue to find their are items on the queue to process.

In the next part we’re going to run a few more experiments to help us understand the scaling rate better and how it is impacted by quiet periods.

Contact

  • If you're looking for help with C#, .NET, Azure, Architecture, or would simply value an independent opinion then please get in touch here or over on Twitter.

Recent Posts

Recent Tweets

Recent Comments

Archives

Categories

Meta

GiottoPress by Enrique Chavez