This is the next article of the series describing how we’re increasing our service availability in Citymobil (you can read the previous parts here: part 1, part 2, part 3). In further parts, I’ll talk about the accidents and outages in detail.
1. Bad release: database overload
Let me begin with a specific example of this type of outage. We deployed an optimization: added USE INDEX in an SQL query; during testing as well as in production, it sped up short queries, but the long ones — slowed down. The long queries slowdown was only noticed in production. As a result, a lot of long parallel queries caused the database to be down for an hour. We thoroughly studied the way USE INDEX worked; we described it in the Do’s and Dont’s file and warned the engineers against the incorrect usage. We also analyzed the query and realized that it retrieves mostly historical data and, therefore, can be run on a separate replica for historical requests. Even if this replica goes down due to an overload, the business will keep running.
Читать полностью »