99.9999% availability in the mud 💀

Published in

scalable.africa

5 min readMar 14, 2021

On Monday, 8th March 2021, I noticed that our servers at gomoney started crashing all of a sudden. We run our platform in a small k8s cluster with 5 nodes (high-performance VMs). One of the pods (an individual server that runs code belonging to a domain) had crashed about 51 times in the last 12 hours since it was last restarted. This was bad, really bad. The customer success team had not reached out to complain that customers could not log into their accounts or perform certain operations, so that was a good sign (it wasn’t affecting the platform uptime). As a stopgap, we scaled each of the affected services by 2x, not like that would solve anything, but at least requests would be routed to available pods if one of them was crashing. Anyway, to the sweet stuff 😂.

The Errors

The error messages from the servers were logged with different error messages, which didn’t help narrow it down at all.

MongoError: not master
MongoError: operation was interrupted
MongoError: not master and slaveOk=false
MongoError: Plan executor error during findAndModify :: caused by :: operation was interrupted

The full crash logs look like this:

At this point, we suspected it was our mongoose (a MongoDB ORM) version, so we figured we’d just upgrade to the latest mongoose version in one of the services and monitor it to see if the crash rate reduced (we setup monitoring a while back, but the config has been messed up, we’ll need to fix that soon 😭). Well, this didn’t solve anything, the pods were still crashing and at an alarming rate.

Debugging The Issue

We started Googling and checking everywhere (StackOverflow, Github, MongoDB docs, Mongoose docs) to see if this was an issue other people had experienced before, they had, but not in an exact way. The next thing we did was to change the way the mongo client running on the pods was connecting to the server. We played around with the connection options.

Our connection string options which were initially set to

retryWrites=true&w=majority

was changed to

retryWrites=true&readPreference=primary&replicaSet=our-replica-set-nameretryWrites=true&readPreference=nearest&maxStalenessSeconds=100&replicaSet=our-replica-set-name

Thesee also didn’t solve anything. At this point, I was personally convinced it was not from our config, because we had not changed it since initial deployment. So why was it crashing all of a sudden? 😫

Finding The Cause (with help though)

We set our sights on the database cluster itself, I was checking the server metrics and noticed one of the replicas was restarting more times than it should, the reason? I didn’t know. In fact, it restarted as we were on the zoom call debugging. Fun times.

Server Restart Chart for 3rd Replica (This week)

The red lines in the chart above represent a “server restart” event, to compare it to what it should look like, see all three servers side by side in the same duration. You can see that the 3rd replica was obviously restarting way too many times.

Server Restart Chart for 3 Replicas (2 months)

It was time to call in the big guns aka MongoDB Support 😂, as we were obviously way in over our heads. We created a support ticket and presented all this information.

The Revelation

Apparently, that faulty replica was affected by a BI Connector (Business Intelligence MongoDB driver) process that was using up a lot of memory of that replica. In all cases, they found out that the not master and slaveOk=false errors were occurring on node 00-02 , the faulty replica. There was a mongosqld (BI Connector Process) running on that node and the process was using all of the available memory and leading to instability in the cluster.

They were seeing high heartbeat response times to and from this replica and also high replication lag (it could not get the latest data copied to it). The frequent delays were resulting in that replica calling elections because it is not able to process heartbeat requests from the other members promptly, due to the memory pressure. After an election, it could render the whole cluster useless because it could not respond to clients on time when it was elected as the primary/master. I saw a case where there was no primary for about 5 seconds. 5 seconds is a really long time for a computer. Omo.

Errors also occurred when an election had made another replica the primary making our faulty replica a secondary. In one instance node 00-01 was elected to become the primary but node 00-02 was still in the process of closing all of its current connections, this would also leave the clients in an unstable state and, boom, another crash.

The Solution

We disabled the BI Connector and would be provisioning an analytics node—another read-only replica that won’t take part in the election process— then configure the BI Connector to be hosted by the analytics node instead of on node 00-02. This event made me realise how wonderful database technology is. The election process that happens in replica-sets is such an important event, such that it can make or break your application’s uptime. Thankfully we had set our clients to write to the primary node and read from the secondary nodes. So all the crashes happened when apps were trying to read from that faulty replica.

Takeaways

What happened to our staging environment?

It is worth mentioning that this didn’t affect our staging (test) environment at all which raised a few eyebrows, the reason being, the database cluster for our production environment runs in a replica set, while the staging environment runs as a standalone node, you know, those 512MB “SANDBOX” clusters that MongoDB Atlas provides. That said, I think we should have (and going forward will do so) run our staging environment also in a replica set, not like it would have solved anything though, but you’d want your staging environment to closely mirror the production environment on all layers (database, app, transport, etc).

Don’t be afraid to ask for help, do your own debugging first, then package all that research to whomever you’d be asking for help. That can help them build on what you already did. This is particularly important if you want to ask the support staff for help so you can narrow down the issue.