"Every night I lived in fear. Sleeping was difficult, but not because of nightmares, it was because of alerts on my cellphone saying that our API queues were growing, response times spiking and everything was slowly falling apart.
The problem did not occur at the same time each day, which made it more difficult to debug.
Finally our super duper architect installed MMS. The second day we used MMS I saw the light at the end of the tunnel. There was a clear spike on the page faults indicator at exactly 3:00am. Mongo was doing everything possible to keep working but it eventually failed minutes or hours later, that is why we were never getting the alert at the same time.
So, easy no? It's a cron job. So we went to our code, look for all cron jobs running at that time and we found one that used to loop trough every single user on the system, but that query had the read preference of secondary only. So what was it?
After digging on what the cron was doing, we found, for each user, it was doing a very very simple query, hitting random places of another collection, and this was making mongo to page fault A LOT.
Eventually the paging removed all signs of hot data from ram on the Primary and the cluster would become a slow wagon.
We switched that query to run on secondaries, and now we are a happy family again. I sleep like a baby now."
Our final message is that you have to work with the tools that the providers give you in order to detect the problems or opportunities to improve. Do not get me wrong, as any tool it could be improved (knowing the mongodb people, I am sure it will). We love mongodb and we believe in the MMS capabilities.
Check it out at http://mms.mongodb.com
Thanks to Julio Viera for working together with me in the development of this post.
Check it out at http://mms.mongodb.com
Thanks to Julio Viera for working together with me in the development of this post.