Please just provide a short, medium or long story

  • @dan@upvote.au
    link
    fedilink
    12
    edit-2
    2 months ago

    I took down the home page of one of the top 5 websites for around 5 minutes.

    There were two existing functions that were written by a different team: An encode method that took a name of something (only used internally, never shown to the user) and returned a numeric identifier for it, and a decode method that did the opposite.

    Some existing code already used encode, but I had to use decode in my new code. Added the code, rolled it out to 80% of employees, and it seemed to work fine. Next day, I rolled it out to 5% public and it still seemed okay.

    Once I rolled it out to everyone, it all broke.

    Turns out that while the encode function used a static map built at build-time (and was thus just an O(1) lookup at runtime), decode connected to a database that was only ever designed for internal use. The DB only had ten replicas, which was nowhere near enough to handle hundreds of thousands of concurrent users.

    Luckily, it’s commonplace to use feature flags changes, which is how I could roll it out just to employees initially. The devops team were able to find stack traces of the error from the prod logs, find my code, find the commit that added it, find the name of the killswitch, and disable my code, before I even noticed that there was a problem. No code rollback needed.

    That was probably 7 years ago now. Thankfully I haven’t made any mistakes as large as that one again!

    Always use feature flags for major changes, especially if they’re risky!