Dev Notes: February 2018

As I explained in my previous post, the previous Friday I managed to fix and deploy a priority one issue that caused a prolonged outage in production, while I had the flu.

The symptoms were evident, randomly, one by one, the instances of one of the most important services in the platform started to use more memory and after few minutes it was unresponsive. Luckily for us, the load of this service is distributed among over twenty instances in the production environment, this allowed the team to actively identify and restart the services that were becoming irresponsive, minimizing that way the downtime of the system.

At first we thought that this issue could be caused by some Aerospike server maintenance we had that week causing Aerospike's .net client to misbehave and leaking memory. To make things worst, we also had a production release the same week with over fifteen services deployed, so while part of the team worked on palliative measures to downtime each team that had released a new version started to review all changes made and think if they could be causing the issue downstream. The situation escalated quickly, and soon we started to turn off feature flags and later to rollback to previous versions of the services.

In parallel to helping other members of the team to create and configure more boxes to distribute the load and to redirectionate traffic, I began to investigate the problem.

After looking at logs and the typical performance indicators like number of threads, handlers, connections and so on; and because one of the symptoms was high memory usage, I waited for one of the instances to start misbehaving to disconnect it from the main cluster and then take a memory snapshot using dotMemory.

Originally I thought that I would find a large number of objects related to one of the client libraries used by the service like Aerospike, MongoDb or RabbitMq, but instead I found a large amount of Akka.net messages. This services form an Akka.net cluster of more than fifty nodes and the actors within the system communicate between them using messages, so looking at the type of the message I was able to identify what part of the flow uses those messages and that way narrow down the problem.

This service is a rewrite of part of the system, but unfortunately, not a full rewrite, so the actor affected by the issue is calling in to legacy code too convoluted to easily identify the issue. With this information, my next step was to take a CPU snapshot using dotTrace; at first, the snapshot was pretty overwhelming, as there were hundreds of threads going on. Here is where I used the information I had gathered previously form the memory snapshot to filter threads and identify the threads that were misbehaving.

At the end I found that the method blocking the processing was three to five years old code used to perform tag replacement on user's templates. It was evident to me that the while loop below was causing the code to enter in an infinite loop and so, taking down a thread (and also an actor). The only thing left was to refactor that code, write unit tests around it to prove that the bug was there, and that my fix would work. And once the pull request had the green light from all other team members I released it to preprod, and then production.

This loop never ends if the input string is malformed, for example "{tag} with {bad) format", and there were no validations in the UI or back-end for it...

//Remove any unmatched tags
while (name.IndexOf("{", StringComparison.Ordinal) != -1)
{
name = name.Substring(0, name.IndexOf("{", StringComparison.Ordinal))
+ name.Substring(name.IndexOf("}", StringComparison.Ordinal) + 1);
}

Dev Notes

Saturday, 17 February 2018

Using Memory and CPU profilers to identify a production issue

Friday, 9 February 2018

Friday afternoon deployment

Tuesday, 6 February 2018

Span the best part of the .net core 2.1 roadmap

Blog Archive