Monday, 17 December 2018

Network Infrastructure

Today I had one of those frustrating conversations with one of the network engineers of the infrastructure team. They had managed to destroy one of the octopus boxes and the replacement couldn't see some internal nuget servers.
First, to open the firewall I had to submit an spreadsheet from a template I couldn't access. Then it got rejected because I entered only the DNS entries and not the IPs, so I went to ask him why that was needed, hoping that the reason was that he would use the spreadsheet in some application to automate the rule creation. But the actual reason was "it will take too long to ping those boxes" (there were only three of them).
Hopefully he understands that this kind of behaviour makes every single developer want to move to the cloud and make him obsolete...

Thursday, 6 September 2018

TeamCity - GitHub Commit Status Publisher

This morning one of the new guys in the team asked me if he could get notified when a pull request that has merged is built. Sometimes you just need someone to do the right question for you to notice that you could be doing things better.
After a few clicks in TeamCity I was able to configure a Build Feature to feed back the result of the build to GitHub. The only drawback is that there is no an easy way to configure it in the root of all the projects, so I'll have to add it one by one to the close to hundred build configurations we have.



Plug-in Configuration:



Saturday, 28 July 2018

Unexpected interview output

During the interview of a candidate for a developer position in the team, I started to worry about how little experience he had on the operations side. After some questions, I realized that with all this new cloud development tools it is perfectly possible to focus on just coding in some scenarios.
In his case, he had been working on a simple web application built on Asp.net Core for which he didn't needed to setup continuous build/deployment manually, neither needed to spent much time on monitoring, instrumentation or provisioning as in Azure it is possible to deal with this concerns from the dashboard.
Other than that he proved he could solve problems and produce clean code, so hopefully he accepts the offer and joins the team.

Saturday, 7 April 2018

One year of Akka.net in production

One year ago we went live in production with a rewrite of the core of the platform in Akka.net. Although the initial release date had to be delayed because a bug we found in the pre-production environment in large Akka.net Clusters, I have to say that after fine tuning remoting and clustering parameters, the system has been stable and the improvement of the performance in soft real time processing is noticeable and also more scalable.
The core of the system is formed of two services plus the lighthouse, initially we went live with a single cluster of more than 50 instances, in the latest release this week the count has gone up to 101! service instances distributed in three clusters.
After this experience we have continued using Akka.net in other parts of the platform like dynamic clusters of pub-sub actors used to distribute the load and control the concurrency when processing messages from RabbitMq.
Also have started using Akka.net in conjunction with .net core, and had the opportunity to contribute to the Akka.DI.Ninject repository to convert it to .net standard 2.0 so it could be used in .net core services.

Monday, 26 March 2018

Using python and OctopusApi to handle rollbacks

This week we have migrated a large amount of servers and services to use new clusters of MongoDB, Aerospike and RabbitMQ. Because all this configuration is in Octopus variables (connection strings), we just needed to change those variables to point to the new infrastructure. All good in principle, change the variables and create new releases for each service; but that leaves us exposed to three potential problems:

  1. if an emergency release is needed, it will automatically point to the new infrastructure and it could not be ready for the switch over.
  2. if an unexpected problem is found after the release and we want to rollback, we would need to modify all the Octopus variables and create a new release to rollback to the old infrastructure (or rollback to the old version of the code).
  3. additionally, in both cases, creating a new release after the deployment could pick up variable changes made for future releases like feature flags.
To address those possible problems, I decided to add a rollback role to all the servers that were planned to be deployed (30+) and then add the same rollback role to the existing connection string variables, and then create new variables with the same roles as the originals. This way if a new release needs to be created it will use the rollback variables. Then right before the release we would remove the rollback role so when deploying the new connection strings would be used. And in the case of needing a rollback we would just add back the rollback role and deploy the same release.
The only problem was adding and removing that role to the large amount of servers, this is were I just put together my very first python program to use Octopus Api to add or remove a role. The code is a bit too raw, but other members of the team have already started to used it to setup new deployment targets. I have to say that I'm impressed how natural and quick was to write the script in Python.


import requests
OctopusUrl = 'http://***octopusserver***'
headers = {'X-Octopus-ApiKey' : 'API-***********'}
newRole = 'Rollback'
environmentName = 'Production'
#environment
machinesUrl = 'null'
environments = requests.get(OctopusUrl+'/api/environments', headers=headers).json()
for environment in environments['Items']:
if (environment['Name'] == environmentName):
       machinesUrl = environment['Links']['Machines']
#machines
machines = requests.get(OctopusUrl+machinesUrl, headers=headers).json()
machinesList = []
machineEndPage = False
while not machineEndPage:
for machine in machines['Items']:
    if ('MainRole in machine["Roles"]):
           machinesList.append(machine)
   nextMachinesUrl =machines['Links'].get('Page.Next', 'null')
if (nextMachinesUrl != 'null') :
    machines = requests.get(OctopusUrl+nextMachinesUrl, headers=headers).json()
else:
       machineEndPage = True
for machine in machinesList:
#if (newRole not in machine['Roles']): #Add role
    #machine['Roles'].append(newRole)
if (newRole in machine['Roles']): # remove
    machine['Roles'].remove(newRole)
       machineUrl = OctopusUrl+machine['Links']['Self']
    result = requests.put(machineUrl, json=machine, headers=headers)
       print(machine['Name']+' '+result)

Thursday, 8 March 2018

Video: What I Wish I Had Known Before Scaling Uber to 1000 Services • Matt Ranney

This morning during breakfast I came across this great video https://youtu.be/kb-m2fasdDY regarding the problems Uber had to overcame when it moved to a microservice architecture. Although in a very smaller scale, it is surprising how the organisation I'm currently part of, had/has the same problems: 
  • rest/json contracts need integration tests
  • too much logging
  • logging not uniform across different technologies
  • tracing agreement
  • language silos (zealots)
  • too many repositories
  • hard to coordinate multi team deployments
  • incidence ownership
  • load testing is hard

Sunday, 4 March 2018

Request Linking in asp.net core


In systems where events are processed in several microservices is useful to maintain a request identifier to be able to link logs from each microservice to do some investigation or to measure performance.
So far we have relied in a custom http header and owin middleware to read, add a new segment to this request identifier and then set it in the log4net context so it can be written when logging an event.
For a new asp.net core service I'm working, I have quickly put together new middleware to do the same operation using the documentation, but instead of using log4net I have found that it is easier to use NLog to capture the TraceIdentifier property in the HttpContext.

Middleware

public class TransactionLinkingMiddleware{
private readonly RequestDelegate _next;
public const string ParentTransactionHttpHeader = "Transaction-Link";
public TransactionLinkingMiddleware(RequestDelegate next)
{
 
_next = next;   
}
public Task Invoke(HttpContext context)
{
if (context.Request.Headers.TryGetValue(ParentTransactionHttpHeader, out var headerValues)
                                    && headerValues.Any())
{
context.TraceIdentifier = $"{headerValues.First()}/{context.TraceIdentifier}";
}
// Call the next delegate/middleware in the pipeline
return this._next(context);
}
}

public static class TransactionLinkingMiddlewareExtensions{   
public static IApplicationBuilder UseTransactionLinking(this IApplicationBuilder builder)
                  {        
return builder.UseMiddleware<TransactionLinkingMiddleware>();
}
 
}

Startup.cs

public Startup(IConfiguration configuration, IHostingEnvironment env)
{   
Configuration = configuration;    env.ConfigureNLog("NLog.config");
}

public void Configure(IApplicationBuilder app, IHostingEnvironment env,
ILoggerFactory loggerFactory,    IApplicationLifetime applicationLifetime)
{
if (env.IsDevelopment())
{       
app.UseDeveloperExceptionPage();
}
app.UseTransactionLinking();
loggerFactory.AddNLog();
app.UseMvc();
}

NLog.config

<?xml version="1.0" encoding="utf-8" ?><nlog xmlns="http://www.nlog-project.org/schemas/NLog.xsd"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      autoReload="true"
<extensions>
<add assembly="NLog.Web.AspNetCore"/>
</extensions>
<targets>
<target xsi:type="File" name="File"
fileName="${basedir}/logs/log.txt"
layout="TimeStamp=${longdate} Level=${uppercase:${level}} Transaction=${aspnet-TraceIdentifier} Message=${message}" />
</
targets
<rules>
<logger name="*" minlevel="Debug" writeTo="File" />
</rules>
</nlog> 

NLog.Web and NLog.Web.AspNetCore packages need to be installed https://github.com/NLog/NLog.Web

Saturday, 17 February 2018

Using Memory and CPU profilers to identify a production issue

As I explained in my previous post, the previous Friday I managed to fix and deploy a priority one issue that caused a prolonged outage in production, while I had the flu.
The symptoms were evident, randomly, one by one, the instances of one of the most important services in the platform started to use more memory and after few minutes it was unresponsive. Luckily for us, the load of this service is distributed among over twenty instances in the production environment, this allowed the team to actively identify and restart the services that were becoming irresponsive, minimizing that way the downtime of the system.
At first we thought that this issue could be caused by some Aerospike server maintenance we had that week causing Aerospike's .net client to misbehave and leaking memory. To make things worst, we also had a production release the same week with over fifteen services deployed, so while part of the team worked on palliative measures to downtime each team that had released a new version started to review all changes made and think if they could be causing the issue downstream. The situation escalated quickly, and soon we started to turn off feature flags and later to rollback to previous versions of the services.
In parallel to helping other members of the team to create and configure more boxes to distribute the load and to redirectionate traffic, I began to investigate the problem.
After looking at logs and the typical performance indicators like number of threads, handlers, connections and so on; and because one of the symptoms was high memory usage, I waited for one of the instances to start misbehaving to disconnect it from the main cluster and then take a memory snapshot using dotMemory.

Originally I thought that I would find a large number of objects related to one of the client libraries used by the service like Aerospike, MongoDb or RabbitMq, but instead I found a large amount of Akka.net messages. This services form an Akka.net cluster of more than fifty nodes and the actors within the system communicate between them using messages, so looking at the type of the message I was able to identify what part of the flow uses those messages and that way narrow down the problem.
This service is a rewrite of part of the system, but unfortunately, not a full rewrite, so the actor affected by the issue is calling in to legacy code too convoluted to easily identify the issue. With this information, my next step was to take a CPU snapshot using dotTrace; at first, the snapshot was pretty overwhelming, as there were hundreds of threads going on. Here is where I used the information I had gathered previously form the memory snapshot to filter threads and identify the threads that were misbehaving.









At the end I found that the method blocking the processing was three to five years old code used to perform tag replacement on user's templates. It was evident to me that the while loop below was causing the code to enter in an infinite loop and so, taking down a thread (and also an actor). The only thing left was to refactor that code, write unit tests around it to prove that the bug was there, and that my fix  would work. And once the pull request had the green light from all other team members I released it to preprod, and then production.

This loop never ends if the input string is malformed, for example "{tag} with {bad) format", and there were no validations in the UI or back-end for it...

//Remove any unmatched tags
while (name.IndexOf("{", StringComparison.Ordinal) != -1)
{
name = name.Substring(0, name.IndexOf("{", StringComparison.Ordinal))
+ name.Substring(name.IndexOf("}", StringComparison.Ordinal) + 1);
}


Friday, 9 February 2018

Friday afternoon deployment

Just finished finding, fixing and deploying to production a bug (infinite loop!) that had been there for more than three years and was causing a priority one issue. The best part is that I have been only around for around a year and a half and had not even seen that code before and that was supposed to be off because have the flu... what could have gone wrong?

Tuesday, 6 February 2018

Span the best part of the .net core 2.1 roadmap

The newly announced .net core 2.1 performance enhancements in build time and Http Client are great, but the best addition is the Span<T> and Memory<T> types in C# 7.2 that follow the low memory allocation fashion started with ValueTask<T> in C#7.
The .net environment has became one of the top contenders performance-wise, and although this new features won't probably be extensively used by most .net developers; now, I can only think of all the services where I could have or will use them.

Thursday, 25 January 2018

Reverse Proxying Kestrel with URL Rewrite for IIS

As said in previous posts, we have been developing new services using Asp.net Core and self-hosting as Windows services using Kestrel. A new interesting problem arised, as Kestrel can not share the Http/Https port with IIS or Owin self hosted services we had to use a different port to host each service. In he past we had this problem when hosting APIs with f# suave, and had overcome it redirecting the requests in the F5 Big Ip load balancer. That solution works, but sometimes I feel like it crosses in to the domain of infrastructure instead of development. So in order to contain all the deployment and configuration in TeamCity/Octopus, I decided to use IIS to reverse proxy the calls made to these microservices as explained in: https://weblogs.asp.net/owscott/creating-a-reverse-proxy-with-url-rewrite-for-iis
Basically requests coming to http://+/newmicroserviceapi/ will be rewritten to http://localhost:<microserviceport>/
This is the Octopus Deploy step I created to automate all this:

PowerShell Script:
$siteName = $OctopusParameters['ReverseProxy.IISSiteName']
$pathBase = $OctopusParameters['ReverseProxy.PathBase']
$port = [int] $OctopusParameters['ReverseProxy.Port']
$site = "iis:\sites\$siteName"
$filterName="Reverse proxy inbound $pathBase"
$filterPath = "system.webServer/rewrite/rules/rule[@name='$filterName']"
Write-Host 'Adding reverse proxy rule to '$siteName' for '$pathBase'/*'
Clear-WebConfiguration -pspath $site -filter $filterPath
Add-WebConfigurationProperty -pspath $site -filter "system.webServer/rewrite/rules" -name "." -value @{name=$filterName; patternSyntax='Regular Expressions'; stopProcessing='True';}
Set-WebConfigurationProperty -pspath $site -filter "$filterPath/match" -name "url" -value "$pathBase/(.*)"
Set-WebConfigurationProperty -pspath $site -filter "$filterPath/conditions" -name "logicalGrouping" -value "MatchAny"
Set-WebConfigurationProperty -pspath $site -filter "$filterPath/action" -name "type" -value "Rewrite"
Set-WebConfigurationProperty -pspath $site -filter "$filterPath/action" -name "url" -value "http://localhost:$port/{R:1}"


Octopus Step:




Note that the Application Request Routing IIS pluging is needed.

Tuesday, 23 January 2018

Refactoring Monolith Services and Git Contribution Stats

Today, a friend was having a look at my GitHub account and asked me about the large number of  private contributions I have. That made me look at the stats myself and then I noticed that I have been committing so much more than I expected to a single repository.
Of course, this is the repository of the "big monolith" my team and I have been refactoring and splitting in to multiple repositories for the last year, and the reason of that many commits to this "soon to be obsolete" repository is the approach we are taking.
Instead of rewriting the functionality in a new service straightaway, I have found that for very complex parts of the system, a good approach is to feature flag the functionality that is being rewritten and then rewrite it in the same solution along the old code using bridges and adapters to interface it. This way we can easily switch back an forth easily from old to new code and it also allows to write tests that can run on both implementations, but mainly it allows for a more progressive and less aggressive rewriting of the functionality.
A small trick is to create all the new projects in the same folder, that way when the code is stable enough, that folder can be split in to a new repository maintaining the history.

Sunday, 21 January 2018

Asynchronously wait for Task to complete with timeout

This week I found the following code in Stack Overflow to await for the completion of an async operation with a timeout and really liked the solution.

  int timeout = 1000;
  var task = SomeOperationAsync();
  if (await Task.WhenAny(task, Task.Delay(timeout)) == task) {
      // task completed within timeout
  } else {
      // timeout logic  
  }