Prod DB CPU spike via testcode

Our test code APIs for our End-to-End tests can't be accessed in production

There was a release that added more unitests before the spike. Looking at the PR with thought nothing of it.

What we later found out was that it resulted in 100% CPU utilization.

This was due to ...

  1. An existing bug in a factory that would eagerly create records when the module was loaded.

  2. The new code loaded the buggy module when the application started due to an import dependency.

  3. We run 100s processes (pods*#process/pod) for the application, this table was filling up with test data.

  4. These extra records exaggerated the inefficient data integrity queries that were run on the save of related entities, which were a lot.

Mitigation/Fixes:

  • We stopped incoming updates (blacklist APIs, stop background updates) to the DB however, the load didn't subside.

  • We noticed that the table with slow/high throughput queries was growing in number when it shouldn't have.

  • We validated that it wasn't a security concern but noticed that the test code was being executed via our ELK logs.

  • Reverted the last change and bulk deleted the newly created test records that were introduced.

Takeaways

  • Occam's razor applies: the last change was the culprit, which we were slow to adopt.

  • The call was chaotic and lacked leadership.

    • We should have orchestrated the engineering efforts more effectively and methodically.

    • We should have facilitated clearer reporting of findings to improve visibility.

  • We (almost) have Isolated test code from production deployments.