Dev Notes: Infrastructure day

The second part of the day was not much better, I was peacefully having my red Thai curry for lunch when looked at the monitoring screens and saw the errors taking over. Horror! Rabbitmq was down and with it, one of the most important parts of the system.
We were quick to jump into the box to restart the service and then see how the system stabilized after few minutes.
One thing to note is "the box", as it turns out that rabbitmq doesn't work well when clustered. At least that is what the platform guys told us a while ago, so I guess is time to verify it myself.
Once things calmed down we proceed with the "incident autopsy". Strangely enough none of the many alarms for disk, memory, CPU usage and similar were triggered.
Started by looking at the Rabbitmq service logs and found that there was a warning of memory hitting the high water mark (we have added an alarm for that now). When the highwater mark is reached rabbit stops all publishing hopping that consumers would do they job and any unacknowledged message would be removed and queues emptied. Unfortunately the kernel watchdog was not configured properly (too low memory threshold) and it decided to kill the service.
Further inspection of the service log showed no signal of queues building up, so the reason of the excessive memory usage become a bigger mystery.
Then is when I turned my attention to Zabbix for monitoring data. Number of connections, CPU usage and disk usage were relatively low, so it could be only memory. The graphic for historical memory usage showed the typical climbing trend of a memory leak. During busy periods the memory builds up, and when it recovers the baseline is a little higher than previously.
That was not what I wanted to see, but it was there, so I followed something that a great architect whom I worked in the past teach me to do. Instead of googling "Rabbitmq memory leak", I had a look at the version of Rabbitmq and Earlang that we were using and went through the version change logs looking for s memory leak.
Incredibly I had to look just to the next release of rabbitmq to find a memory leak that had been there from version 1.0.0. we were using 3.3.5 and in the changelog for the version 3.4.0 I found:

25850 prevent excessive binary memory use when accepting or delivering large messages at high speed (since 1.0.0)

Unfortunately for us (or fortunately as now I had an explanation) our system fits the profile, with messages ranging from few KBs up to a MB being published constantly at rates of up to thousands per second.
Found the probable cause for the outage and the solution, upgrade or replace the box. But that would not be as simple as I thought. Went to talk with the platform guys to ask them for advice on to what version should we upgrade to, their first answer was: "we install the stable enterprise version of rabbitmq, so therefore there can't be any memory leak". I expected that attitude, so took a big breath and explained a couple of times more until they gave up and looked at the version changelog. But then, they came back with a different excuse to not upgrade/install a later version: "we have to adhere to the latest stable version in the Extra Packages for Enterprise Linux (EPEL) repository, if you want a latter version we won't install or support it".
Well, that was unexpected, 3.3.5 dates from 2014 and they, being so "strict", prefer to live with a rather dangerous bug than to upgrade to a newer version. At the end our team has agreed to take full responsibility of this part of the infrastructure and we'll create and maintain these boxes with a newer version of rabbitmq as we cannot really wait for them to catch-up.
It seems that the history might continue in the short future...

Dev Notes

Sunday, 22 October 2017

Infrastructure day - part two

No comments:

Post a Comment

Blog Archive