Site news

TribeLab Problems - An Apology

 
Picture of Paul Offord
TribeLab Problems - An Apology
by Paul Offord - Wednesday, 4 April 2018, 10:47 PM
 

You may have noticed that the TribeLab Community site has been pretty unstable over the Easter period.  A couple of months ago we moved the site to Amazon Web Services.  All had been going well but recently we have had a couple of issues, and the latest proved particularly challenging.  We now understand the cause of the most recent problem and we believe we have addressed it.

I can only apologies for the downtime we all suffered as a result.

Best regards...Paul

PS: Not wanting to miss an opportunity to learn something, subject to a bit more verification work, and for those who are interested in cloud technology, here is what happened:

  • The TribeLab Community site runs in an ECS Docker container
  • The container runs on an EC2 Linux instance
  • Several of our systems (including the TribeLab site) access an EFS volume via NFS
  • We hit the EFS data transfer quota (visible in Cloudwatch) - we are not sure why yet
  • NFS requests were being delayed due to throttling (found via tcpdump and Wireshark)
  • The EC2 Linux instance started to throw NFS errors into the kernel dmesg log
  • This seems to have thrown Linux into a tailspin with two kworker kernel worker threads consuming about 75% of two CPUs (found via TOP)
  • This caused the instance to run out of CPU credits (visible in Cloudwatch)
  • A cron job that runs every 60 seconds failed to complete within 60 seconds and so multiple cron job started running, further exacerbating the problem (found with ps -aux issued in the container)
We have an ELB in front of the site and this was triggering restarts of the container (visible in Cloudwatch) but unfortunately this didn't fix the NFS issue and so the whole problem would quickly start again.