Thursday, October 18, 2018
Hand it over to someone who wants to manage it?I don't know if this is wise from an ethical or legal point of view. We have 15 years of forum activity - 32 million posts. Can you hand that over just willy nilly? Do we need to pay a bunch of money to a lawyer to find out? Plus what's the benefit? Why do you need all of the old posts to relocate?
DowntimeSo what a coincidence that with all that spinning around on the forums they go down. I woke up on Saturday morning to a ton of pingdom emails. I got to my computer to find the database wasn't responding. Discord was blowing up with everyone talking about how unfair it was to just turn the forums off with only a few days warning, how Garry is a selfish childish arsehole. I checked Azure's status page and they'd had an issue with the storage system. Relieved that it wasn't something I'd done, I left it for a while, it'd come back up on its own.
SupportIt didn't come back after a few hours so I put up a "Sorry" page explaining that the database was down - as a way to inform everyone that it hadn't just been deleted. And I fired off a support request. Support informed me that they were working with the team in the US. Saturday turned into Sunday and still nothing. Support let me know that the ticket was moderate severity so they were working business hours. I asked them if business hours meant "not weekends" but didn't get a reply.
No BackupsThinking through the worst case I came to realise I had no backups of the forum's data. If it was fucked, it was fucked. The forums would close down for real because there wouldn't be a way to bring it back. I'd purposely chosen to use an azure mssql database because they backed themselves up - I didn't need to do anything. But if some fuck up and fucked up the data - we were fucked. So I opened a new ticket with the top severity level. They phoned me within 5 minutes with a logmein link and watched me try some stuff, then more random phone calls during the day. At about 6pm they phoned and let me know that the problem should be fixed. Later on they emailed me a post mortum:
The database first went through a reconfiguration around 13th Oct 3:26 am UTC due to a known xstore outage in the EastUS region. This outage caused the storage on this database to become unavailable and consequently caused the compute to go down. As the compute was going through recovery, one of the replicas got stuck in its transition to primary state. Due to a bug in the failure path caused by xstore outage, the transition was stuck. Restart of the compute replica mitigated the issue.
The following updateSLO/Maxsize operations were stalled behind the unhealthy compute and completed once the compute became healthy.
The bug in the failure path has been fixed and will be deployed in production in next few weeks