You may have noticed that the TribeLab Community site has been pretty unstable over the Easter period. A couple of months ago we moved the site to Amazon Web Services. All had been going well but recently we have had a couple of issues, and the latest proved particularly challenging. We now understand the cause of the most recent problem and we believe we have addressed it.
I can only apologies for the downtime we all suffered as a result.
PS: Not wanting to miss an opportunity to learn something, subject to a bit more verification work, and for those who are interested in cloud technology, here is what happened:
- The TribeLab Community site runs in an ECS Docker container
- The container runs on an EC2 Linux instance
- Several of our systems (including the TribeLab site) access an EFS volume via NFS
- We hit the EFS data transfer quota (visible in Cloudwatch) - we are not sure why yet
- NFS requests were being delayed due to throttling (found via tcpdump and Wireshark)
- The EC2 Linux instance started to throw NFS errors into the kernel dmesg log
- This seems to have thrown Linux into a tailspin with two kworker kernel worker threads consuming about 75% of two CPUs (found via TOP)
- This caused the instance to run out of CPU credits (visible in Cloudwatch)
- A cron job that runs every 60 seconds failed to complete within 60 seconds and so multiple cron job started running, further exacerbating the problem (found with ps -aux issued in the container)