Dev Notes: 2017

Tuesday, 19 December 2017

IIS and certificates

The other day the last "pure" DevOps engineer in our team left the company, so what is left of the team is filling that role.

One of the tasks that was left pending was to use SSL termination instead of having the certificates installed in each server, and surprise, surprise, the expiration date of the active certificate is within a week of writing this post.

So instead of addressing this issue properly, we decided to just add the new certificate. For that I created an Octopus Deploy project that simply runs the following powershell script:

$base64Certificate = $OctopusParameters['Base64Certificate']
$password = $OctopusParameters['Password']
Write-Host "Adding/updating certificate in store"
$certBytes = [System.Convert]::FromBase64String($base64Certificate)
$cert = New-Object System.Security.Cryptography.X509Certificates.X509Certificate2($certBytes, $password, "Exportable,PersistKeySet,MachineKeySet")
$store = New-Object System.Security.Cryptography.X509Certificates.X509Store("My", "LocalMachine")
$store.Open("ReadWrite")
$store.Add($cert)
$store.Close()

I had worked previously with certificates to sign and validate requests in webservices, and we simply referred to the certificate using distinct properties but not the certificate thumbprint, as that would force configuration to be updated when the certificate expires. That is why I was surprised of not finding a way to do the same in IIS, originally, the only way I found to bind the SSL port to a certificate was to specify the thumbprint. So I added the following step using netsh (https://msdn.microsoft.com/en-us/library/windows/desktop/cc307236(v=vs.85).aspx).

$port = [int] $OctopusParameters['Port']
$certThumbprint = $OctopusParameters['CertThumbprint']
$output = netsh http show sslcert ipport=0.0.0.0:$port | Out-String
Write-Host "Previous configuration: $output"
Write-Host "Remove ssl certificate port binding"
$output = netsh http delete sslcert ipport=0.0.0.0:$port | Out-String
if(($output -like "*Error*") -and ($output -notlike "*Error: 183*"))
{
exit 1
}
Write-Host "Ensuring ssl certificate port binding"
$guid = [guid]::NewGuid()
$guidStr = "{$guid}"
$output = netsh http add sslcert ipport=0.0.0.0:$port certhash=$certThumbprint appid=$guidStr | Out-String
Write-Host $output
if(($output -like "*Error*") -and ($output -notlike "*Error: 183*"))
{
exit 1
}
$output = netsh http show sslcert ipport=0.0.0.0:$port | Out-String
Write-Host "New configuration: $output"
exit 0

Later I found that IIS provides an option to automatically rebind renewed certificates, but I didn't went for that option as it would mean enabling it and wait for the expiration date to trigger the switch over without having seen it working beforehand. Instead, I have set up one of the lower environment boxes to use this method.

Monday, 13 November 2017

Cordova Ionic and Content-Security-Policy

This weekend I have been looking to solve a problem in one of my side project's mobile applications that have been using for years. The problematic application is written using Typescript and Ionic/Cordova, and records some data that then synchronises to a node.js back-end for better querying/visualisation. I have not published the app to any store, but have been using it in my old Sony Xperia Z (Android 5.1) for years without problems, but recently I bought an One Plus 5 (Android 7.1.1) and then the synchronisation stopped working.
The error code returned when using angular's $http service is status code = 0 reason = null, not very helpful, all the search results for this problem suggested that the problem could be in the server side where Cross-Origin Resource Sharing (CORS) needs to be enabled to accept requests from other domain, but I had this in place already.
Other posts suggested that the AndroidManifest.xml could be missing the line below, but again, this line was there as it is added by default when creating a project with Visual Studio's Tools for Apache Cordova.

<uses-permission android:name="android.permission.INTERNET" />

And others pointed at the requirement of the Cordova's plugin whitelist, that needs to be installed and then it needs to specify in the config.xml the following line:

<access origin="protocol://domain" />

After checking that all the configuration was in place, I started debugging the mobile application and pointing it to a development server instead, and found that the requests were not "leaving" the phone, the requests were failing all the same no matter what endpoint was configured and even with air plane mode on.
Finally I found that the web view in the latest Android versions makes use of Content Security Policy (CSP) on top of all the previous configuration. The solution was to add the back-end domain to this policy that looked like this by default:

<meta http-equiv="Content-Security-Policy" content="default-src 'self' data: gap: https://ssl.gstatic.com 'unsafe-eval'; style-src 'self' 'unsafe-inline'; media-src *">

And now, after adding the connect-src directive:

<meta http-equiv="Content-Security-Policy" content="default-src 'self' data: gap: https://ssl.gstatic.com 'unsafe-eval'; style-src 'self' 'unsafe-inline'; media-src *; connect-src 'self' http://mybackendapi.herokuapp.com">

Sunday, 22 October 2017

Infrastructure day - part two

The second part of the day was not much better, I was peacefully having my red Thai curry for lunch when looked at the monitoring screens and saw the errors taking over. Horror! Rabbitmq was down and with it, one of the most important parts of the system.
We were quick to jump into the box to restart the service and then see how the system stabilized after few minutes.
One thing to note is "the box", as it turns out that rabbitmq doesn't work well when clustered. At least that is what the platform guys told us a while ago, so I guess is time to verify it myself.
Once things calmed down we proceed with the "incident autopsy". Strangely enough none of the many alarms for disk, memory, CPU usage and similar were triggered.
Started by looking at the Rabbitmq service logs and found that there was a warning of memory hitting the high water mark (we have added an alarm for that now). When the highwater mark is reached rabbit stops all publishing hopping that consumers would do they job and any unacknowledged message would be removed and queues emptied. Unfortunately the kernel watchdog was not configured properly (too low memory threshold) and it decided to kill the service.
Further inspection of the service log showed no signal of queues building up, so the reason of the excessive memory usage become a bigger mystery.
Then is when I turned my attention to Zabbix for monitoring data. Number of connections, CPU usage and disk usage were relatively low, so it could be only memory. The graphic for historical memory usage showed the typical climbing trend of a memory leak. During busy periods the memory builds up, and when it recovers the baseline is a little higher than previously.
That was not what I wanted to see, but it was there, so I followed something that a great architect whom I worked in the past teach me to do. Instead of googling "Rabbitmq memory leak", I had a look at the version of Rabbitmq and Earlang that we were using and went through the version change logs looking for s memory leak.
Incredibly I had to look just to the next release of rabbitmq to find a memory leak that had been there from version 1.0.0. we were using 3.3.5 and in the changelog for the version 3.4.0 I found:

25850 prevent excessive binary memory use when accepting or delivering large messages at high speed (since 1.0.0)

Unfortunately for us (or fortunately as now I had an explanation) our system fits the profile, with messages ranging from few KBs up to a MB being published constantly at rates of up to thousands per second.
Found the probable cause for the outage and the solution, upgrade or replace the box. But that would not be as simple as I thought. Went to talk with the platform guys to ask them for advice on to what version should we upgrade to, their first answer was: "we install the stable enterprise version of rabbitmq, so therefore there can't be any memory leak". I expected that attitude, so took a big breath and explained a couple of times more until they gave up and looked at the version changelog. But then, they came back with a different excuse to not upgrade/install a later version: "we have to adhere to the latest stable version in the Extra Packages for Enterprise Linux (EPEL) repository, if you want a latter version we won't install or support it".
Well, that was unexpected, 3.3.5 dates from 2014 and they, being so "strict", prefer to live with a rather dangerous bug than to upgrade to a newer version. At the end our team has agreed to take full responsibility of this part of the infrastructure and we'll create and maintain these boxes with a newer version of rabbitmq as we cannot really wait for them to catch-up.
It seems that the history might continue in the short future...

Wednesday, 11 October 2017

Infrastructure day - part one

Today I had a zero coding day, it was all about infrastructure problems.

It all started with some internal users experiencing slowness when using one service endpoint that was released recently.

The operations in the new endpoint was supposed to take something around 3 seconds as it involves payloads of up to 1MB and is very CPU intensive.

But this morning users were reporting times of up to 30 seconds. No alert had been triggered, so I run a quick query against the logs and found that requests were being processed in under 3 seconds.

My first thought was the serialization in the client, a VBA "app", has been a long while since the last time I did anything on VB*, so I thought of trying something different first; network? is never network... but I had to checked; was able to reproduce the issue with one of the developers with the application pointing to a subdomain that resolves to a load balancer as per the production deployment, and then changed the application to go straight to one of the boxes and the slowness was gone.

And that was all the joy of it, after that, I had to go to talk with the network administrator that, of course, told me that the problem was in our applications. After three emails, one diagram and two more visits managed to convince that there was a problem in the network, and after looking at DNS records and the f5 we found out that the subdomain was resolving to an external network address, what meant that our internal users requests were traveling out and back to our network through through enough network devices and hops to cause the slowness. Finally we changed that internal DNS to resolve to an internal virtual IP and moved on.

Have to say that the pain of having to deal with IT was more bearable as I had the official DevOp of my team working with me, feels good to have someone bridging both worlds to investigate and overcome this kind of problems.

Monday, 9 October 2017

C# 7 value tuples and enumerables

Today I found myself fighting with Visual Studio about how to check if the result of a FirstOrDefault operation on a list of a value type was null. Tried all the following and nothing worked, even with several results from stack overflow.

I created a quick test harness to try to find a "solution" to my problem:

My first reflex was to check if it was null as we usually do with generic lists of non value types.

Then I realized that I was so used to work with enumerables of reference types that I had completely overlooked the name of this new types value tuple, so I tried to check for defaults.So I tried with default, default of value type and default of tuple, none of them work.

For this last one I also switched from C# 7 to 7.1, but the compiler continued complaining.

After a bit of google I noticed that I was not the only one with this kind of problem as it seems that the new value tuples don't play well with enumerable just yet. So I went back to use reference types for this case as the solution I found is far from elegant and error prone:

Sunday, 1 October 2017

DateTime.UTCNow please

Last week at work we were migrating some servers to a new network/virtual infrastructure. The migration is going to be gradual, so this time we were moving only services in four boxes out of a pool of more than twenty.

A couple of guys in the team and myself woke up early to do the deployment and as far we could tell everything was OK, the services in the new boxes were processing requests and there was almost no downtime.

We were not aware of this at first, but at that time one of the DevOps found out that the services in the new boxes were not writing anything to splunk; and after some investigation he found out that the new windows servers had the timezone set to BST unlike the other existing servers that were set to UTC. So he quickly changed those servers and the terraform template to UTC, job done, time for breakfast...

Not quite, after a while one of the QAs noticed that the times in some of the websites were jumping back and forth one hour. Thankfully we still are in daylight saving time, if this had happened during the winter we would not have noticed until Spring, and by then the investigation would have been so much more difficult because the lack of context.

The guys couldn't understand what was going on, all the servers were set to use UTC (we didn't know that the timezone had been changed few minutes ago). Fortunately it wasn't the first time I saw something like that, so logged to one of the boxes and checked the windows event viewer, and there it was, the timezone change. Restarted the services and after forcing some updates in the affected entities everything was fine again.

Going back home I remembered reading about this in the great book CLR via c#, processes and threads hold to their own locale. So I wrote an small program to write DateTime.Now and Date time.Now in a loop, as I know that writing it will help me remember it quickly next time.

This is the output with the timezone set to BST.

And this is the output with the timezone set to UTC again, note that the first instance still uses BST, while the new instance has picked up the update.

Now is time to address this technical debt, we have to use DateTime.UTCNow everywhere in the backend and also send dates with timezone between services and frontend to avoid this kind of problems.