via GIPHY

The Set Up

My sister always told me never commit code after 10 PM. Never has that sentiment been more real than in this situation. This is an important story to tell because it is an example of how disasters can be self-inflicted and are not always impressed on developers.

The Night Before

It was a Thursday night, and I was leaving for New York the next day for a wedding. I had finished packing, got the house in order and was winding down on work to try and enjoy the three day weekend. I decided since I wasn’t planning on doing too much work I would pay down some technical debt on the flight to relax. I had been having some trouble with my local environment, so I decided to up my laptop and start fresh the next day with a clean install of the client application I was working on. So I ran the cleanup script and went to bed without thinking twice about it.

The Next Morning

The next morning I woke up early to catch my flight and notice several missed calls from the client, and an email saying the site was down. I had inherited this site, and outages were not infrequent. I take a look on my phone and confirm the website was down.

OK, now I need to head to the airport and fly to New York while fixing a Sev 1 outage. I explain the situation to my wife, and we load up the car and start heading to the airport. The second I get in the car I power on my mobile hotspot, open my laptop and dive into the mobile war room.

I start restarting services, but nothing fixes the site is still down. At the time the most common outage was caused by a Redis crash and would need some restarts. This time no luck.

I roll up my sleeve and start diving into the logs asking myself, what went wrong. I am frantically looking through errors while trying to communicate with and calm down the client at the same time.

I Found The Problem…

I uncover in the logs that the database is not authenticating. This is a new one I think, and start digging in. I think maybe one of the files got corrupted.

The truth was so much worse than I thought. I log in to the production database…

via GIPHY

It wasn’t there, but how could this be possible? Were we hacked? Did the database gets corrupted? The truth comes to me in a horrifying realization. As I feel the blood drain from my body and my stomach sink through the floor the gravity of the situation kicks in. “I deleted the production database!” This is not a joke; this is not a drill. I am not sitting ad DefCon 1, and it’s my fault. That last-minute cleanup I ran the night before was on the production database. The wrong tab was open, and my mistake now ruined thousands of people’s day an potentially lost them revenue.

At this point, I want to own what I did without flat out executing myself. I go and try and talk to the client, and they say the magic words. “Let’s get this fixed; we can talk about what happened later” I am simultaneously relieved and terrified of what is going to happen next, but at least we are all now working towards a common goal.

Fixing the problem

Now that I know what is wrong it is time to start making things better - I began by developing a plan to fix it as my wife drives to the airport. Now I that I have a clear idea I can channel my crisis into focus. I metaphorically role up my sleeves and start implementing out disaster recovery procedure. I go to S3 pull down the last backup of the database that is valid, which is several GBs, so it takes a little while to pull from Amazon. Finally, I get the latest backup on to the server and realize there is a 12+ hour time difference from the backup to the loss.

I immediately start communicating the news to the client that all of their clients work over the last day will be lost, and I will have the site back up in an hour.

So I start loading the database backup into the production database and… it hangs. What do I do there is no progress bar, it just hangs on the import. So I switch to using Screen and close the SSH connection. I need to go through security now.

After I get through security, I open back up my laptop and watch waiting for the database import to complete. Once it does, I reload everything and get the site back up.

The Kicker

After getting the site back up, I realize there was a newer snapshot of that server that would have reduced the amount of data loss from 12 hours to 1 hour. By the time I recognize this its too late and the server snapshot has rotated out of retention. Besides I already had the site up, I wasn’t going to bring it down to go through the entire exercise again. So I cut my losses, and we start communicating with the users what the situation is.

Conclusion

I thought this story was important to tell because horror stories are not always inflicted on developers. Developers are not the perpetual victim. Sometimes we are the cause of our own personal hell on a project. When I used to say to my wife that I couldn’t figure out why things would always go wrong at the end of the day, she would tell me the problem was that I would stay until something went wrong. I think it is very easy for developers to have the hero complex and want to fix the problem, but we should be cautious and try not to perpetuate our disaster.

How Do We Improve

  • First off NEVER keep active production database connections open, and NEVER do that one last thing late at night.
  • Proper controls and automation around touching production are essential.
  • Incremental backups are crucial to allow for point-in-time backups.