Author: James

Cosmos, Oh Cosmos – Graph API as a .NET Developer

I’ve done a lot of work with Cosmos over the last year as a document database and generally found it to be a rock solid experience, it does what it says on the tin, and so when I found myself with a project that was a great fit for a graph database my first port of call was Cosmos DB.

I’d done a little work with it as a Graph API but as this was a new project I visited the Azure website to refresh myself on using it as a Graph from .NET and found that Microsoft are now recommending that the Tinkerpop Gremlin .NET library be used from .NET. There are some pieces on the old Microsoft.Azure.Graphs package but it never made it out of preview and the direction of travel looks to be elsewhere.

While Gremlin .NET is easy to get started with if you try and use this in a realistic sense you quickly run into a couple of serious limitations due to it’s current design around error and response handling. It seems to be designed to support console applications rather than real world services:

  1. Vendor specific attributes of the responses such as RU costs, communicated as header values, are hidden from you.
  2. Errors from the server are presented only as text messages. Rather than expose status codes and interpretable values the Gremlin .NET library first converts these into messages designed for consumption by people. To interpret the server error you need to parse these strings.

The above two issues combine into a very unfortunate situation for a real world Cosmos Graph API application: when Cosmos rate limits you it returns a 500 error to the caller with the 429 error being communicated in a x-ms-status-code header. This doesn’t play well with resilience libraries such as Polly as you end up having to fish around in the response text for keywords.

I initially raised this as a documentation issue on GitHub and Microsoft have confirmed they are moving to open source libraries and working to improve them but best I can tell, today, for the moment you’ve got two choices:

  1. Continue to use the Microsoft.Azure.Graphs package – I’ve not used this in anger but I understand it has issues of its own related to client side performance and is a bit of a dead end.
  2. Use Gremlin.NET and work around the issues.

For the time being I’ve opted to go with option (2) as I don’t want to unpick an already obsolete package from new code. To support this I’ve forked the Gremlin.NET library and introduced a couple of changes that allow attributes and response codes to be inspected for regular requests and for exceptions. I’ve done this in a none breaking way – you should be able to replace the official Gremlin.NET package with this replacement and your code should continue to work just fine but you can more easily implement resilience patterns. You can find it on GitHub here.

If I was designing the API from fresh to work in a real world situation I would probably expose a different API surface – at the moment I really have followed the path of least none-breaking resistance in terms of getting these things visible. That makes me uncertain as to whether or not this is an appropriate Pull Request to submit – I probably will, if nothing else hopefully that will start a conversation.

Developing Fluent Flowchart – A Retrospective

I recently released a new app to the Windows Store called Fluent Flowchart – a side project I’ve been working on since, my commit history tells me, late June 2017. It struck me it might be interesting to write up a short retrospective on the development process. So without further ado a bit of a warts and all look at development follows.

Motivations

I can’t remember exactly what I was doing when I started development but I do remember it involved a lot of Visio and that I’d recently finished work on an update to Fluent Mindmap. Visio and it’s, so called, drawing aids have a habit of sending me into an incandescent rage – it’s akin to playing a game of Dark Souls only without the satisfying pay off.

I vividly recall looking at the code for Fluent Mindmap, looking at Visio, and thinking to myself that surely it wouldn’t be much of an effort to adapt one diagramming tool into another. In fact, I recall thinking to myself, without having to auto-manage a complex graph of recursive connections it would be a lot easier.

25 years of professional software development has clearly taught me nothing. Not a damn thing.

To be fair to myself in truth I wasn’t that wrong – however what I did forget as I optimistically forged off into the code was that polishing an app, particularly a highly interactive app, takes a lot of time and that although you might get the core running quickly all those “little” features that make it usable all do add up.

Phases of Development

There were definitely two distinct phases of development over a 8 month period which is borne out by the change stats I’ve graphed below.

It’s worth noting that this wasn’t solid effort even on the days where I dod commit code most of the work was done on the train or late at night in hotel rooms with some crunching over weekends. I’m tempted to run a correlation between weekend commit rate and the weather – living in the UK there’s a fair chance that the weekend crunches will align with bleak rain and grey skies.

I pulled the stats for the above graph from the git log with a PowerShell script – I don’t pretend to be a PowerShell expert and I’m pretty sure it could be expressed much more concisely but if you want to get stats like this for your project you can find the script here.

Phase 1 – The Rapid Rise and Steady Decline of Optimism

Fueled by irritation with Visio and that wonderful “early project” feeling development shot out the gate. The first order of business was to copy the Fluent Mindmap codebase and strip out all the code I wouldn’t need. While doing this I remembered that the Mindmap code wasn’t that great – that project was a port of an iOS app from Xamarin into UWP and I’d used it to first learn Xamarin and then learn about UWP. You can begin to imagine… I’d also tried some experiments with a couple of patterns that didn’t really pan out. It works – but I’m not particularly proud of it (though there is a crash to do with connectors that I’ve never been able to replicate but I know users get, stack trace isn’t helpful, if you try the app and you have the issue please let me know!).

Deleting code turned out to be an ongoing theme – as I replaced mind map optimised systems with flowchart systems I would remove more code.

I found myself faced with an early decision – try and make the patterns I’d used work or ditch them and adopt what I knew to be a better approach. I decided to ditch the broken patterns and replace them with a better approach on an “as I go” basis. This worked out ok but did leave me with a slightly disjoint codebase in the middle section of the project.

With all that done the first thing I needed to get working was a palette. My first stab at this, complete with garish colours, can be seen in the screenshot below:

The next step was to implement drag and drop from the palette. My first effort at this used the built in UWP drag and drop feature but that gave me a horrendous user experience. I spent some time trying to tweak it but it was miles away from what I wanted so I ended up implementing a custom approach that makes use of a transparent canvas overlaid over the entire editor. This gave me the fine level of control I wanted and better visuals but took a fair amount of effort. Still – it’s core to the experience and so the effort was worth it.

The next step was to introduce a grid for aligning elements – for many diagrams it’s really all I need to lay things out correctly. I find a grid to be predictable and intuitive as there is no second guessing about what a more “intelligent” alignment tool is going to do. At this point I was also starting to sort out the command bars with icons that made sense for the planned features but you can see I hadn’t yet implemented grab handles on shapes – the highlight is still based on that in the mind map tool. The screenshot below shows the app at this stage of its development:

At this point I could drop shapes but next I needed a means of connecting them. My initial attempt at this was really clunky – I was trying to make something touch friendly and not fiddly but this really was grim: you tap the connector button on the tool bar and then tap the connector points on the canvas (the small circles).

About the same time I also introduced grab handles for resizing shapes and another side bar for editing shape properties – the Mindmap tool uses a bottom app / command bar butI knew I would have too many editable properties to use for a bottom bar. Additionally although that fit into Windows 8 styling quite well I don’t really like it and it really doesn’t feel right on Windows 10. The screenshot shows this and represents development about a month after starting.

Next up was adding the ability to select connectors – this involves inflating a line into a polygon so that their is a practical hit test area. It’s just basic geometry but this proved rather taxing at 7 in the morning on the train! Eventually I got this done and also added a development flag so I can see all the hit test areas for connectors. Over the next week or two I also added some basic multi-select support and spent time expanding out the property sidebar further. For some reason, again early in the morning!, I really struggled to implement the arrow head picker in the sidebar. The implementation is actually quite simple but I really struggled to wrap my head round it.

At this point there was no hiding from the realisation that completing the application was going to take quite a bit longer particularly working with such limited time as I had. And the next task I had to tackle was updating the theme editor which really did feel like laborious work and for some reason I got stuck on arrow heads again!

I knew I’d come back to the project (probably after another grapple with Visio) but I shut the lid on my laptop on the train one morning and didn’t come back to Fluent Flowchart for some time. The screenshot below shows the last thing I worked on at this stage – the theme editor.

Phase 2 – Renewed Hope and The Grind

Around 3 months later I had a bit of time on my hands during the Christmas break and had a strong urge to resume development – I really don’t like leaving things unfinished, Visio was still annoying me, and I figured that if I knuckled down and gave a final big push I could complete the app quite quickly. I started by getting myself organised: I made a list of the features and bugs I knew were outstanding in a MarkDown file (I like to have offline access) and set it up so as I completed things I would move them to a DONE section. My cheap and cheerful version of a typical SCRUM or Kanban board.

I quickly completed the theme editor I’d been struggling with though again this felt like a grind. I figured adding a richer set of flowchart shapes would be a nice set of quick wins to really get me going again and would help me work through the remaining user experience and design issues – of which there were many, including a delightfully garish set of colours and ugly palette.

Next up was sorting out two long standing issues – the horrible oversized palette and the unpleasant way connecting shapes together functioned. As such core parts of the experience I really ought to have tackled them earlier but one of the problems with working on projects by yourself is that you get used to and blind to these things. There was still a lot left to do but this pair of tasks was the last big pieces of work I had on my list (my list, I should add, kept on growing). They took a while to implement and get right but once they were the experience was transformed:

Next there was a lot of bug fixing and small nips and tucks – screenshot feature, undo support, more connector tweaks, copy and paste and branding. At numerous points in this process I thought I was nearing completion before doing another full pass and deciding something wasn’t quite right or broken.

Working by myself this was definitely a bit of a grind – I got through it by setting myself a target of moving one thing, no matter how small, from the outstanding to done section of my task list. That way it always felt like I had momentum and eventually I arrived at the app shown in the screenshot below which was starting to look much cleaner and was a fairly stable app.

Another pass through and I found yet more things I wasn’t happy with so ran through more nips and tucks.

Finally. After 7 months of on and off part time development squeezed into train journeys, hotels, and weekend I had a finished app that I could ship to the store.

Leftovers

I did ship the project with a couple of features missing that  I wanted:

  • Right angle connectors
  • Free standing connectors (by which I mean connectors that aren’t connected to shapes)
  • SVG export

However at some point you have to ship and I really wanted to start getting feedback.

Observations and Lessons Learned

As with every project there were things I learned and observed throughout the process.

  1. Time yourself. I’ve started using Harvest to track time on all my projects whether fun side-projects, open source, or commercial. Like most developers I’m pretty poor at estimating and so occasionally create a rod for my own back – I do at least know it though sometimes this causes me to lurch into absolute pessimism as a defence mechanism. I know that too. Over the years I’ve managed to deliver the majority of my projects on time, some even early, but on occasion I’ve had to work like a demon to ensure that’s the case. I’d like to do better and I think the biggest issue I have in terms of estimating my own capacity is a lack of data. My hope is that by capturing this I’ll have a corpus of data I can use as a reference point for future work.
  2. Look through your commit log from time to time and go back to old builds. To pull out the stats for this blog post and some screenshots I went back from the commit log for the project. It was really interesting to look at the changes I’d made and see the steady progress that was made.

    Before I did this I felt somewhat dissatisfied with how development had gone. The stop start nature made it seem like an epic scale development for something fairly modest – but looking at the project like this I can see that steady progress was made in fairly limited time and instead I now feel a sense of satisfaction with the process.

  3. I know this from previous projects but its always worth a reminder: user experience takes time and iteration to get right and improve and sometimes you just need to use it and feel it to know if its right. This of course takes time. If you’re working on a project that demands a high level of user experience and you don’t allow time to fail and iterate you’re just not going to get it.
  4. Lowball / easy tasks can be really useful for getting you in the zone and moving again after an impasse.
  5. Optimising something for both touch and keyboard / mouse is really hard. In the end I focused on keyboard and mouse as its the main way I interact with creative tools but I want to improve touch support.
  6. Even if you’re working on your own maintain a task list. Ticking things off, no matter how small, can be a real motivator and I definitely found my “one task a day” goal helped me get through the grind of bug fixing and polishing towards the end of development.
  7. When you finally ship your app its super rewarding!

 

Hope that’s interesting. If there’s anything you want to discuss I can, as ever, be reached on Twitter.

Azure Functions – Significant Improvements in HTTP Trigger Scaling

A while back I wrote about the improvements Microsoft were working on in regard to the HTTP trigger function scaling issues. The Functions team got in touch with me this week to let me know that they had an initial set of improvements rolling out to Azure.

To get an idea of how significant these improvements are I’m first going to contrast this new update to Azure Functions with my previous measurements and then re-examine Azure Functions in the wider context of the other cloud vendors. I’m specifically separating out the Azure vs Azure comparison from the Azure vs Other Cloud Vendors comparison as while the former is interesting given where Azure found itself in the last set of tests and to highlight how things have improved but isn’t really relevant in terms of a “here and now” vendor comparison.

A quick refresh on the tests – the majority of them are run with a representative typical real world mix of a small amount of compute and a small level of IO though tests are included that remove these and involve no IO and practically no computer (return a string).

Although the improvements aren’t yet enabled by default towards the end of this post I’ll highlight how you can enable these improvements for your own Function Apps.

Azure Function Improvements

First I want to take a look at Azure Functions in isolation and see just how the new execution and scaling model differs from the one I tested in January. For consistency the tests are conducted against the exact same app I tested back in January using the same VSTS environment.

Gradual Ramp Up

This test case starts with 1 user and adds 2 users per second up to a maximum of 500 concurrent users to demonstrate a slow and steady increase in load.

This is the least demanding of my tests but we can immediately see how much better the new Functions model performs. When I ran these tests in January the response time was very spiky and averaged out around the 0.5 second mark – the new model holds a fairly steady 0.2 seconds for the majority of the run with a slight increase at the tail and manages to process over 50% more requests.

Rapid Ramp Up

This test case starts with 10 users and adds 10 users every 2 seconds up to a maximum of 1000 concurrent users to demonstrate a more rapid increase in load and a higher peak concurrency.

In the previous round of tests Azure Functions really struggled to keep up with this rate of growth. After a significant period of stability in user volume it eventually reached a state of being semi-acceptable but the data vividly showed a system really straining to respond and gave me serious concerns about its ability to handle traffic spikes. In contrast the new model grows very evenly with the increasing demand and, other than a slight spike early on, maintaining a steady response time throughout.

Immediate High Demand

This test case starts immediately with 400 concurrent users and stays at that level of load for 5 minutes demonstrating the response to a sudden spike in demand.

Again this test highlights what a significant improvement has been made in how Azure Functions responds to demand – the new model is able to deal with the sudden influx of users immediately, whereas in January it took nearly the full execution of the test for the system to catch up with the demand.

Stock Functions

This test uses the stock “return a string” function provided by each platform (I’ve captured the code in GitHub for reference) with the immediate high demand scenario: 400 concurrent users for 5 minutes.

The minimalist nature of this test (return a string) very much highlights the changes made to the Azure Functions hosting model and we can see that not only is there barely any lag in growing to meet the 400 user demand but that response time has been utterly transformed. It’s, to say the least, a significant improvement over what I saw in January when even with essentially no code to execute and no IO to perform Functions suffered from horrendous performance in this test.

Percentile Performance

I was unable to obtain this data from VSTS and so resorted to running Apache Benchmarker. For this test I used settings of 100 concurrent requests for a total of 10000 requests, collected the raw data, and processed it in Excel. It should be noted that the network conditions were less predictable for these tests and I wasn’t always as geographically close to the cloud function as I was in other tests though repeated runs yielded similar patterns:

Yet again we can see the massive improvements made by the Azure Functions team – performance remains steady up until 99.9th percentile. Full credit to the team – the improvement here is so significant that I actually had to add in the fractional percentiles to uncover the fall off.

Revised Comparison With Other Vendors

We can safely say by now that this new hosting model for Azure Functions is a dramatic improvement for HTTP triggered functions – but how does it compare with the other vendors? Last time round Functions was barely at the party – this time… lets see!

Gradual Ramp Up

On our gradual ramp up test Azure still lags behind both AWS and Google in terms of response time but actually manages a higher throughput than Google. As demand grows Azure is also experiencing a slight deterioration in response time where the other vendors remain more constant.

Rapid Ramp Up

Response time and throughput results for our rapid ramp up test are not massively dissimilar to the gradual ramp up test. Azure experiences a significant fall in performance around the 3 minute mark as the number of users approaches 1000 – but as I said earlier the Functions team are working on further area at this level of scale and beyond and I would assume at this point that some form of resource reallocation is causing this that needs smoothing out.

It’s also notable that although some way behind AWS Lambda Azure manages a reasonably higher throughput that Google Cloud – in fact it’s almost half way between the two competing vendors so although response times are longer there seems to be more overall capacity which could be an important factor in any choice between those two platforms.

Immediate High Demand

Again we see very much the same pattern – AWS Lambda is the clear leader in both response time and throughput while 2nd place for response time goes to Google and 2nd place for throughput goes to Azure.

Stock Functions

Interestingly in this comparison of stock functions (returning a string and so very isolated) we can see that Azure Functions has drawn extremely close to AWS Lambda and ahead of Google Cloud which really is an impressive improvement.

This suggests that other factors are now playing a proportionally bigger factor in the scaling tests than Functions capability to scale – previously this was clearly driving the results. Additional tests would need to be run to isolate if this is the case and whether or not this is related to the IO capabilities of the Functions host or the capabilities of external dependencies.

Percentile Performance

The percentile comparison shows some very interesting differences between the three platforms. At lower percentiles AWS and Google outperform Azure however as we head into the later percentiles they both deteriorate while Azure deteriorates more gradually with the exception of the worst case response time.

Across the graph Azure gives a more generally even performance suggesting that if consistent performance across a broader percentile range is more important than outright response time speed it may be a better choice for you.

Enabling The Improvements

The improvements I’ve measured and highlighted here are not yet enabled by default, but will be with the next release. In the meantime you can give them a go by adding an App Setting with the name WEBSITE_HTTPSCALEV2_ENABLED to 1.

Conclusions

In my view the Azure Functions team have done some impressive work in a fairly short space of time to transform the performance of Azure Functions triggered by HTTP requests. Previously the poor performance made them difficult to recommend except in a very limited range of scenarios but the work the team have done has really opened this up and made this a viable platform for many more scenarios. Performance is much more predictable and the system scales quickly to deal with demand – this is much more in line with what I’d hoped for from the platform.

I was sceptical about how much progress was possible without significant re-architecture but, as an Azure customer and someone who wants great experiences for developers (myself included), I’m very happy to have been wrong.

In the real world representative tests there is still a significant response time gap for HTTP triggered compute between Azure Functions and AWS Lambda however it is not clear from these tests alone if this is related to Functions or other Azure components. Time allowing I will investigate this further.

Finally my thanks to the @azurefunctions team, @jeffhollan and @davidebbo both for their work on improving Azure Functions but also for the ongoing dialogue we’ve had around serverless on Azure – it’s great to see a team so focused on developer experience and transparent about the platform.

If you want to discuss my findings or tech in general then I can be found on Twitter: @azuretrenches.

C# Cloud Application Architecture – Commanding via a Mediator (Part 5)

Over the last 4 parts of this series we’ve taken a simple application built around a layered architecture and restructured it into an application based around dispatching queries and commands as state through a mediator.

We’ve seen many of the advantages this can bring to a codebase reducing repetition and allowing for a clear decomposition into business, or service, oriented modules.

In this final part I’ll demonstrate how this pattern can support an application through the various stages of it’s lifecycle. The early stages of a software development project are often susceptible to a high degree of change. If it’s a new product under development then the challenge is often around establishing market fit (be that internal or external) without burning through the entire budget. Additionally if the problem domain is new it’s likely that the first attempt at drawing out bounded contexts will contain errors and if the system is built as fully isolated components change can be expensive. In either case keeping the cost of development and change low in the early phases of the project can lead to much more effective use of a projects budget.

In the system we’ve been developing we’ve developed three sub-systems: a checkout, a shopping cart and a product store – essentially we have a modular monolith.

In this part we’re going to assume that we’re finding that our product store is coming under a lot of strain and we are going to pull it out into a micro-service so that we can scale it independently. And we’re going to make this change without altering any consuming business logic code at all.

In our system we make use of the store in two places through the dispatch of GetStoreProductQuery queries. Firstly it is represented in the primary API as an endpoint that can be called by clients in the ProductController class:

[Route("api/[controller]")]
public class ProductController : AbstractCommandController
{
    public ProductController(ICommandDispatcher dispatcher) : base(dispatcher)
    {
            
    }

    [HttpGet("{productId}")]
    [ProducesResponseType(typeof(StoreProduct), 200)]
    public async Task<IActionResult> Get([FromRoute] GetStoreProductQuery query) => await ExecuteCommand(query);
}

Secondly it is also used to provide validation of products within the handler for the AddToCartCommand in the AddToCartCommandHandler class:

public async Task<CommandResponse> ExecuteAsync(AddToCartCommand command, CommandResponse previousResult)
{
    Model.ShoppingCart cart = await _repository.GetActualOrDefaultAsync(command.AuthenticatedUserId);

    StoreProduct product = (await _dispatcher.DispatchAsync(new GetStoreProductQuery{ProductId = command.ProductId})).Result;

    if (product == null)
    {
        _logger.LogWarning("Product {0} can not be added to cart for user {1} as it does not exist", command.ProductId, command.AuthenticatedUserId);
        return CommandResponse.WithError($"Product {command.ProductId} does not exist");
    }
    List<ShoppingCartItem> cartItems = new List<ShoppingCartItem>(cart.Items);
    cartItems.Add(new ShoppingCartItem
    {
        Product = product,
        Quantity = command.Quantity
    });
    cart.Items = cartItems;
    await _repository.UpdateAsync(cart);
    return CommandResponse.Ok();
}

To make our change the first thing we need to do is to be able to execute our command inside a different host – we’ll use an Azure Function that accepts the ProductID required by ourGetStoreProductQuery query. The code for this function is shown below:

public static class GetStoreProduct
{
    private static readonly IServiceProvider ServiceProvider;
    private static readonly AsyncLocal<ILogger> Logger = new AsyncLocal<ILogger>();
        
    static GetStoreProduct()
    {
        IServiceCollection serviceCollection = new ServiceCollection();
        MicrosoftDependencyInjectionCommandingResolver resolver = new MicrosoftDependencyInjectionCommandingResolver(serviceCollection);
        ICommandRegistry registry = resolver.UseCommanding();
        serviceCollection.UseCoreCommanding(resolver);
        serviceCollection.UseStore(() => ServiceProvider, registry, ApplicationModeEnum.Server);
        serviceCollection.AddTransient((sp) => Logger.Value);
        ServiceProvider = resolver.ServiceProvider = serviceCollection.BuildServiceProvider();
    }

    [FunctionName("GetStoreProduct")]
    public static async Task<IActionResult> Run([HttpTrigger(AuthorizationLevel.Anonymous, "get", "post", Route = null)]HttpRequest req, ILogger logger)
    {
        Logger.Value = logger;
        logger.LogInformation("C# HTTP trigger function processed a request.");
            
        IDirectCommandExecuter executer = ServiceProvider.GetService<IDirectCommandExecuter>();

        GetStoreProductQuery query = new GetStoreProductQuery
        {
            ProductId = Guid.Parse(req.GetQueryParameterDictionary()["ProductId"])
        };
        CommandResponse<StoreProduct> result = await executer.ExecuteAsync(query);
        return new OkObjectResult(result);
    }
}

Our static constructor sets up our IoC container (Azure Functions actually run on app service instances and you can share state between them – though their are few guarantees and you can debate at length how “serverless” this makes things – AWS Lambda is much the same) and should be fairly familiar code by now.

Our function entry point does something different – it creates an instance of our GetStoreProductQuery from the query parameters supplied but rather than dispatch it through the ICommandDispatcher interface we’ve seen before it executes it using a reference to a IDirectCommandExecuter resolved from our IoC container. This instructs the command framework to execute the command without any dispatch semantics – that means that any logging of dispatch portions of the command flow won’t be replicated by this function and it is slightly more efficient (it’s worth noting that you can dispatch again here if you need to – though generally you would take the approach I am showing here).

To support this new approach I’ve also made a change to the IServiceCollectionsExtensions UseStore registration method inside the Store.Application project so that it can be supplied an enum that determines how our command should be handled: in process (as we’ve been doing up until now), as a client of a remote service, or as a server (as we have done above). The enum is used to register the command in one of two ways and this is the key change to the existing change that enables us to remote the command:

if (applicationMode == ApplicationModeEnum.InProcess || applicationMode == ApplicationModeEnum.Server)
{
    commandRegistry.Register<GetStoreProductQueryHandler>();
}
else if (applicationMode == ApplicationModeEnum.Client)
{
    // this configures the command dispatcher to send the command over HTTP and wait for the result
    Uri functionUri = new Uri("http://localhost:7071/api/GetStoreProduct");
    commandRegistry.Register<GetStoreProductQuery, CommandResponse<StoreProduct>>(() =>
    {
        IHttpCommandDispatcherFactory httpCommandDispatcherFactory = serviceProvider().GetService<IHttpCommandDispatcherFactory>();
        return httpCommandDispatcherFactory.Create(functionUri, HttpMethod.Get);
    });
}

Both the in-process and server mode continue to register the handler as they have done before however when the application mode is set to client the registration takes a different form. Rather than register the handler we supply the type of the command and the type of the result as generic type parameters but then we setup a lambda that will resolve an instance of a IHttpCommandDispatcherFactory and create a HTTP dispatcher with the URI of the function and the HTTP verb to use. These interfaces can be found within the NuGet package AzureFromTheTrenches.Commanding.Http which I’ve added to the Store.Application project.

Registering in this way instructs the commanding system to dispatch the command using the, in this case, HTTP dispatcher rather than attempt to execute it locally. All the other framework features around the dispatch process continue to behave as usual and as we saw earlier you can pick this up on the other side of the HTTP call with the IDirectCommandExecuter.

I have shifted some other code around inside the solution to support code sharing with the Azure Function but that is really the extent of the code change. We’ve changed no business logic or consuming application code – we’ve simply moved where the command runs and the calling semantics are seamless – and essentially split the store out as a micro-service running inside an Azure Function. As long as you build your sub-systems as isolated units as we have here this same approach can be used with queues and other forms of remote call.

I’ve found this approach to be massively powerful – in the early stages of a project you can make changes within a codebase and with an operational environment that is fairly simple and is easy to manage and supported by tooling and as long as you have the tests to go with it refactoring a solution like this is really simple and is supported by tools like Resharper. Then, as you begin to lock things down or the solution grows, you can pull out the sub-systems into fully independent micro services without significant code change – it’s largely just configuration as we’ve seen above.

I wrote the commanding framework I’ve been using specifically to enable this approach and you can find it, and documentation, on GitHub here.

I hope this series has been interesting and presented (or refreshed) a different way of thinking about C# application architecture. There’s a fair chance I’ll swing back round and talk a bit about commanding result caching and some other scenarios that this approach enables so watch this space.

In the meantime if you have any questions about the approach or my commanding framework please do get in touch over on Twitter.

Finally the code for this final part can be found on GitHub here:

https://github.com/JamesRandall/CommandMessagePatternTutorial/tree/master/Part5

Other Parts in the Series

Part 4
Part 3
Part 2
Part 1

AzureFromTheTrenches.Commanding 6.1.0 – 10x Performance Improvement

I spent some time today look at the performance of my commanding / mediator framework. Although I did a little performance work early on I’ve made a lot of changes since then and been very focused on getting the feature set and API where I want it.

As a target I wanted to get near to the performance of Mediatr – an excellent framework that describes itself as a “simple, unambitious mediator implementation”. When I began work on my framework I had flexibility as a key goal: I wanted it to support persistent event based models (event sourcing) and an evolutionary approach to architecture and development enabling the seamless movement between command handlers that run locally and remotely. There’s usually a performance price to pay for flexibility and features and so although I’d used some performance focused techniques in the code it seemed unlikely I’d be able to equal the performance of a smaller simpler framework. I decided getting within 20% the performance of Mediatr would be a reasonable price to pay for the additional functionality and flexibility.

Despite starting off in a pretty dismal place – nearly 10x slower than Mediatr – I’ve improved the performance of the framework so it is now about 10% faster than Mediatr as can be seen below (the numbers are from running large numbers of commands through both frameworks):

Commands Time Taken (ms) Per Command (ms)
AzureFromTheTrenches.Commanding 6.1.0 10000000 11695 0.0011695
Mediatr 4.0.1 10000000 12818 0.0012818
AzureFromTheTrenches.Commanding 6.0.0 10000000 127709 0.0127709

 

I’m really pleased by that but I would suggest the numbers are sufficiently close that unless you have an extreme scenario you would be better choosing between the two frameworks based on other factors – predominantly how well they address your specific domain.

For those interested in how I improved the performance of the framework I’ll be documenting my process in an upcoming post (as well as highlighting a blooper that illustrates the need to always test performance in code where it is important).

Fixing a Common IoC Container Anti-pattern – the every class is public problem

An anti-pattern I’ve seen a lot over the last few years involves the registration of dependencies in an IoC container at the root of a project (or in a dedicated “IoC” project) – an approach enabled by making every single class in every assembly in the codebase public. It’s amazing how common it is and you see it in codebases that are poor in general and codebases that are otherwise well constructed. As such I find myself talking about it frequently and so it seemed a ripe topic for a blog post.

There are numerous issues with the “every class is public” approach:

  1. As someone reading or using the code I can no longer differentiate between the public API of a subsystem and the interfaces and classes designed for internal consumption.
  2. The registering project (for example an ASP.Net application) is making decisions about the lifecycle of components in another assembly and sub-system – and therefore about the internal implementation of that sub-system. This often leads to things getting out of sync and the issues arising from this kind of lifecycle registration / implementation mismatch can be subtle.
  3. The registering project has to be aware of every single thing in the system and reference every subsystem. One of the effective techniques to police code architecture is by looking at the dependency map and this is heavily polluted if you’re doing this.
  4. The scope of a code change is often larger than it should be and spans sub-systems when it doesn’t need to – if a project takes the root registration approach then adding a class and interface for internal use means I also have to visit the root project.
  5. If sub-systems are run within multiple hosts (for example a Web API and a queue processor) then registration is either duplicated in both root projects or an “IoC configuration” project is introduced: we’ve got ourselves in such a pickle that we now need a whole project dedicated to understanding both internal and external dependencies of sub-systems.

Encapsulation is a good thing – it shouldn’t be thrown away when moving from the class to the assembly level. It’s just as important there – perhaps even more so in modern codebases which are formed of many small classes with few methods rather than large classes with many methods.

I’ve provided a simple example of this common issue in the project you can find here:

https://github.com/JamesRandall/IoCAntipatternFix/tree/master/TheProblem

Conceptually in this project we’ve got three assemblies:

  1. A console app (ConsoleApp) that depends on (2)
  2. An assembly (Calendar) providing calendar functionality to the console app that depends on (3)
  3. An assembly (Notifications) providing notification functionality to the calendar assembly

From a required dependency point of view it looks like this:

But because of the every class is public issue it is actually implemented like this:

You can see the anti-pattern manifest itself in code in the RegisterDependencies method of Program.cs in the console app:

static IServiceProvider RegisterDependencies()
{
    IServiceCollection services = new ServiceCollection();
    services.AddTransient<Calendar.DataAccess.ICalendarRepository, Calendar.DataAccess.CalendarRepository>();
    services.AddTransient<Calendar.ICalendarManager, Calendar.CalendarManager>();
    services.AddSingleton<Notifications.INotifier, Notifications.Notifier>();
    services.AddTransient<Notifications.Channel.IEmail, Notifications.Channel.Email>();

    return services.BuildServiceProvider();
}

Does the console app have any business knowing that the ICalendarRepository is implemented by the CalendarRepository class? Should it even know about the email channel? Can it safely register the INotifier implementation as a singleton? The answer to all of those questions is no. Absolutely not.

The fix for this is pretty simple and it was great to see Microsoft adopt a version of it in ASP.Net Core as part of their formalisation of dependency inversion in that framework. All you need do is encapsulate the registration logic inside your sub systems – and if you need to conditionally configure the registration then pass through an options block (an example of this can be seen in my commanding framework).

I’m going to show two versions of the fix – one based on using the containers registration interface, which has the byproduct of your assemblies becoming tied to an IoC container, and another that doesn’t require this.

Solution with a container interface

The approach adopted by Microsoft in the ASP.Net Core assemblies and the related packages is to use extension methods on the container interface (in the Microsoft case that’s IServiceCollection). If we take this approach the registration in our console app now looks like this:

static IServiceProvider RegisterDependencies()
{
    IServiceCollection services = new ServiceCollection();
    services.AddCalendar();

    return services.BuildServiceProvider();
}

Additionally our console app no longer has a reference to the notification sub-system as this is now dealt with by the calendar’s AddCalendar registration method:

public static class ServiceCollectionExtensions
{
    public static IServiceCollection AddCalendar(this IServiceCollection serviceCollection)
    {
        serviceCollection.AddTransient<ICalendarRepository, CalendarRepository>();
        serviceCollection.AddTransient<ICalendarManager, CalendarManager>();

        serviceCollection.AddNotifications();

        return serviceCollection;
    }
}

Inside the calendar project only the interfaces intended for external consumption are marked as public with the rest moving to internal. It’s no longer possible to access the assemblies private implementation from the outside and we’ve moved the lifecycle and registration logic closer to the code that is written in line with those expectations.

And finally the notification assembly takes the same approach:

public static class ServiceCollectionExtensions
{
    public static IServiceCollection AddNotifications(this IServiceCollection serviceCollection)
    {
        serviceCollection.AddSingleton<INotifier, Notifier>();
        serviceCollection.AddTransient<IEmail, Email>();
        return serviceCollection;
    }
}

With this approach we’ve addressed all three of the concerns I raised at the start of this piece and have moved back to a place where encapsulation is used to help us both read the code and use it safely.

It could well be argued that having your sub-systems reference and be aware of the specific IoC container in use is itself another anti-pattern. I’d tend towards agreeing but it can be a pragmatic choice for an internal codebase – though it’s flawed if you are creating packages for others to use: you’ve built in a hard dependency on a specific IoC container. You can solve this by defining your own interface for proxying over a container and having people implement it or use a functional approach which we’ll look at next.

The code for the above approach can be found here:

https://github.com/JamesRandall/IoCAntipatternFix/tree/master/ContainerInterfaceSolution

Solution with functions

An alternative to the interface approach is to use a functional style passing down lambda expressions. If we take this approach our console application’s registration method now looks like this:

static IServiceProvider RegisterDependencies()
{
    IServiceCollection services = new ServiceCollection();
    Dependencies.AddCalendar(
        (iface, impl) => services.AddTransient(iface, impl),
        (iface, impl) => services.AddSingleton(iface, impl));

    return services.BuildServiceProvider();
}

We simply wrap the relevant lifecycle registration methods on IServiceCollection inside lambda expressions and pass them down to a registration method in our calendar sub system:

public static class Dependencies
{
    public static void AddCalendar(
        Action<Type, Type> addTransient,
        Action<Type, Type> addSingleton)
    {
        addTransient(typeof(ICalendarRepository), typeof(CalendarRepository));
        addTransient(typeof(ICalendarManager), typeof(CalendarManager));

        Notifications.Dependencies.AddNotifications(addTransient, addSingleton);
    }
}

This registers our types using the lambda expressions and passes them on to the notification dependency:

public static class Dependencies
{
    public static void AddNotifications(
        Action<Type, Type> addTransient,
        Action<Type, Type> addSingleton)
    {
        addSingleton(typeof(INotifier), typeof(Notifier));
        addTransient(typeof(IEmail), typeof(Email));
    }
}

Again this approach addresses the concerns that arise when implementation classes are made public and registration is centralised but with the added advantage that the sub-systems are independent of any specific IoC container. In my experience this also discourages people from misusing many of the “advanced” capabilities that can be found on IoC containers – but that’s a topic for another post.

The code for this approach can be found here:

https://github.com/JamesRandall/IoCAntipatternFix/tree/master/FunctionalSolution

Wrap up

Hopefully in the above I’ve highlighted a common pitfall and demonstrated two solutions to it. There are of course many other variants you can apply depending on your specific project. If you disagree or have any questions please feel free to reach out on Twitter.

Azure Functions – Microsoft Feedback on HTTP Trigger Scaling

Since I published this piece Microsoft have made significant improvements to HTTP scaling on Azure Functions and the below is out of date. Please see this post for a revised comparison.

Following the analysis I published on Azure Functions and the latency in scaling HTTP triggered functions the Microsoft development team got in touch to discuss my findings and provide some information about the future which they were happy for me to share.

Essentially the team are already at work making improvements in this area. Understandably they were unable to commit to timescales or make specific claims as to how significant those improvements but my sense is we’re looking at a handful of months and so, hopefully, half one of this year. They are going to get in touch with me once something is available and I’ll rerun my tests.

I must admit I’m slightly sceptical as to if they’ll be able to match the scaling capability of AWS Lambda (and to be clear they did not make any such claim), which is what I’d like to see, as that looks to me as if it would require a radical uprooting of the Functions runtime model rather than an evolution but ultimately I’m just a random, slightly informed, punter. Hopefully they can at least get close enough that Azure Functions can be used in more latency critical and spiky scenarios.

I’d like to thank @jeffhollan and the team for the call – as a predominantly Azure and .NET developer it’s both helpful and encouraging to be able to have these kinds of dialogues around the platform so critical to our success.

In the interim I’m still finding I can use HTTP functions – I just have to be mindful of their current limitations – and have some upcoming blog posts on patterns that make use of them.

Azure Functions – Scaling with a Dedicated App Service Plan

Since I published this piece Microsoft have made significant improvements to HTTP scaling on Azure Functions. I’ve not yet had the opportunity to test performance on dedicated app service plans but please see this post for a revised comparison on the Consumption Plan.

After my last few posts on the scaling of Azure Functions I was intrigued to see if they would perform any better running on a dedicated App Service Plan. Hosting them in this way allows for the functions to take full advantage of App Service features but, to my mind, is no long a serverless approach as rather than being billed based on usage you are essentially renting servers and are fully responsible for scaling.

I conducted a single test scenario: an immediate load of 400 concurrent users running for 5 minutes against the “stock” JavaScript function (no external dependencies, just returns a string) on 4 configurations:

  1. Consumption Plan – billed based on usage – approximately $130 per month
    (based on running constantly at the tested throughput that is around 648 million functions per month)
  2. Dedicated App Service Plan with 1 x S1 server -$73.20 per month
  3. Dedicated App Service Plan with 2 x S1 server – $146.40 per month
  4. Dedicated App Service Plan with 4 x S1 server – $292.80 per month

I also included AWS Lambda as a reference point.

The results were certainly interesting:

With immediately available resource all 3 App Service Plan configurations begin with response times slightly ahead of the Consumption Plan but at around the 1 minute mark the Consumption Plan overtakes our single instance configuration and at 2 minutes creeps ahead of the double instance configuration and, while the advantage is slight, at 3 minutes begins to consistently outperform our 4 instance configuration. However AWS Lambda remains some way out in front.

From a throughput perspective the story is largely the same with the Consumption Plan taking time to scale up and address the demand but ultimately proving more capable than even the 4x S1 instance configuration and knocking on the door of AWS Lambda. What I did find particularly notable is the low impact of moving from 2 to 4 instances on throughput – the improvement in throughput is massively disappointing – for incurring twice the cost we are barely getting 50% more throughput. I have insufficient data to understand why this is happening but do have some tests in mind that, time allowing, I will run and see if I can provide further information.

At this kind of load (650 million requests per month) from a bang per buck point of view Azure Functions on the Consumption Plan come out strongly compared to App Service instances even if we don’t allowing for quiet periods when Functions would incur less cost. If your scale profile falls within the capabilities of the service it’s worth considering though it’s worth remembering their isn’t really an SLA around Functions at the moment when running on the Consumption Plan (and to be fair the same applies to AWS Lambda).

If you don’t want to take advantage of any of the additional features that come with a dedicated App Service plan and although they can be provisioned to avoid the slow ramp up of the Consumption Plan are expensive in comparison.

Azure Functions vs AWS Lambda vs Google Cloud Functions – JavaScript Scaling Face Off

Since I published this piece Microsoft have made significant improvements to HTTP scaling on Azure Functions and the below is out of date. Please see this post for a revised comparison.

I had a lot of interesting conversations and feedback following my recent post on scaling a serverless .NET application with Azure Functions and AWS Lambda. A common request was to also include Google Cloud Functions and a common comment was that the runtimes were not the same: .NET Core on AWS Lambda and .NET 4.6 on Azure Functions. In regard to the latter point I certainly agree this is not ideal but continue to contend that as these are your options for .NET and are fully supported and stated as scalable serverless runtimes by each vendor its worth understanding and comparing these platforms as that is your choice as a .NET developer. I’m also fairly sure that although the different runtimes might make a difference to outright raw response time, and therefore throughput and the ultimate amount of resource required, the scaling issues with Azure had less to do with the runtime and more to do with the surrounding serverless implementation.

Do I think a .NET Core function in a well architected serverless host will outperform a .NET Framework based function in a well architected serverless host? Yes. Do I think .NET Framework is the root cause of the scaling issues on Azure? No. In my view AWS Lambda currently has a superior way of managing HTTP triggered functions when compared to Azure and Azure is hampered by a model based around App Service plans.

Taking all that on board and wanting to better evidence or refute my belief that the scaling issues are more host than framework related I’ve rewritten the test subject as a tiny Node / JavaScript application and retested the platforms on this runtime – Node is supported by all three platforms and all three platforms are currently running Node JS 6.x.

My primary test continues to be a mixed light workload of CPU and IO (load three blobs from the vendors storage offering and then compile and run a handlebars template), the kind of workload its fairly typical to find in a HTTP function / public facing API. However I’ve also run some tests against “stock” functions – the vendor samples that simply return strings. Finally I’ve also included some percentile based data which I obtained using Apache Benchmark and I’ve covered off cold start scenarios.

I’ve also managed to normalise the axes this time round for a clearer comparison and the code and data can all be found on GitHub:

https://github.com/JamesRandall/serverlessJsScalingComparison

(In the last week AWS have also added full support for .NET Core 2.0 on Lambda – expect some data on that soon)

Gradual Ramp Up

This test case starts with 1 user and adds 2 users per second up to a maximum of 500 concurrent users to demonstrate a slow and steady increase in load.

The AWS and Azure results for JavaScript are very similar to those seen for .NET with Azure again struggling with response times and never really competing with AWS when under load. Both AWS and Azure exhibit faster response times when using JavaScript than .NET.

Google Cloud Functions run fairly close to AWS Lambda but can’t quite match it for response time and fall behinds on overall throughput where it sits closer to Azure’s results. Given the difference in response time this would suggest Azure is processing more concurrent incoming requests than Google allowing it to have a similar throughput after the dip Azure encounters at around the 2:30 mark – presumably Azure allocates more resource at that point. That dip deserves further attention and is something I will come back to in a future post.

Rapid Ramp Up

This test case starts with 10 users and adds 10 users every 2 seconds up to a maximum of 1000 concurrent users to demonstrate a more rapid increase in load and a higher peak concurrency.

Again AWS handles the increase in load very smoothly maintaining a low response time throughout and is the clear leader.

Azure struggles to keep up with this rate of request increase. Response times hover around the 1.5 second mark throughout the growth stage and gradually decrease towards something acceptable over the next 3 minutes. Throughput continues to climb over the full duration of the test run matching and perhaps slightly exceeding Google by the end but still some way behind Amazon.

Google has two quite distinctively sharp drops in response time early on in the growth stageas the load increases before quickly stabilising with a response time around 140ms and levels off with throughput in line with the demand at the end of the growth phase.

I didn’t run this test with .NET, instead hitting the systems with an immediate 1000 users, but nevertheless the results are inline with that test particularly once the growth phase is over.

Immediate High Demand

This test case starts immediately with 400 concurrent users and stays at that level of load for 5 minutes demonstrating the response to a sudden spike in demand.

Both AWS and Google scale quickly to deal with the sudden demand both hitting a steady and low response time around the 1 minute mark but AWS is a clear leader in throughput – it is able to get through many more requests per second than Google due to its lower response time.

Azure again brings up the rear – it takes nearly 2 minutes to reach a steady response time that is markedly higher than both Google and AWS. Throughput continues to increase to the end of the test where it eventually peaks slightly ahead of Google but still some way behind AWS. It then experiences a fall off which is difficult to explain from the data available.

Stock Functions

This test uses the stock “return a string” function provided by each platform (I’ve captured the code in GitHub for reference) with the immediate high demand scenario: 400 concurrent users for 5 minutes.

With the functions essentially doing no work and no IO the response times are, as you would expect, smaller across the board but the scaling patterns are essentially unchanged from the workload function under the same load. AWS and Google respond quickly while Azure ramps up more slowly over time.

Percentile Performance

I was unable to obtain this data from VSTS and so resorted to running Apache Benchmarker. For this test I used settings of 100 concurrent requests for a total of 10000 requests, collected the raw data, and processed it in Excel. It should be noted that the network conditions were less predictable for these tests and I wasn’t always as geographically close to the cloud function as I was in other tests though repeated runs yielded similar patterns:

AWS maintains a pretty steady response time up to and including the 98th percentile but then shows marked dips in performance in the 99th and 100th percentiles with a worst case of around 8.5 seconds.

Google dips in performance after the 97th percentile with it’s 99th percentile roughly equivalent to AWSs 100th percentile and it’s own 100th percentile being twice as slow.

Azure exhibits a significant dip in performance at the 96th percentile with a sudden drop in response time from a not great 2.5 seconds to 14.5 seconds – in AWSs 100th percentile territory. Beyond the 96th percentile their is a fairly steady decrease in performance of around 2.5 seconds per percentile.

Cold Starts

All the vendors solutions go “cold” after a time leading to a delay when they start. To get a sense for this I left each vendor idle overnight and then had 1 user make repeat requests for 1 minute to illustrate the cold start time but also get a visual sense of request rate and variance in response time:

Again we have some quite striking results. AWS has the lowest cold start time of around 1.5 seconds, Google is next at 2.5 seconds and Azure again the worst performer at 9 seconds. All three systems then settle into a fairly consistent response time but it’s striking in these graphs how AWS Lambda’s significantly better performance translates into nearly 3x as many requests as Google and 10x more requests than Azure over the minute.

It’s worth noting that the cold start time for the stock functions is almost exactly the same as for my main test case – the startup is function related and not connected to storage IO.

Conclusions

AWS Lambda is the clear leader for HTTP triggered functions – on all the runtimes I’ve tried it has the lowest response times and, at least within the volumes tested, the best ability to deal with scale and the most consistent performance. Google Cloud Functions are not far behind and it will be interesting to see if they can close the gap with optimisation work over the coming year – if they can get their flat our response times reduced they will probably pull level with AWS. The results are similar enough in their characteristics that my suspicion is Google and AWS have similar underlying approaches.

Unfortunately, like with the .NET scenarios, Azure is poor at handling HTTP triggered functions with very similar patterns on show. The Azure issues are not framework based but due to how they are hosting functions and handling scale. Hopefully over the next few months we’ll see some improvements that make Azure a more viable host for HTTP serverless / API approaches when latency matters.

By all means use the above as a rough guide but ultimately whatever platform you choose I’d encourage you to build out the smallest representative vertical slice of functionality you can and test it.

Thanks for reading – hopefully this data is useful.

Recent Posts

Recent Tweets

Recent Comments

Archives

Categories

Meta

GiottoPress by Enrique Chavez