He Locked Players Out On Launch Day - Zachery Johnson
Download MP3[dramatic music]
[evil laughter]
VO: Arise Diablo! Lord of Terror!
VO: Though this be our darkest hour, it may yet be your greatest moment.
Zachery Johnson: This is Day 2 and game servers are crashing. Players are finding ways to bring down our game servers. What that means is the game server that owns the lock can't really say, "Hey, I've released that. It's dead," and so we need another paradigm.
We see there's 70,000 locked characters. We're way up from the 10,000 we were before. 10,000 is still, when I think about 10,000 people that I'm preventing playing a game, that's extreme. I ran a query to get all of the affected players. Who are players who have their save data locked by the server? And we took down a production MySQL instance.
[dramatic music]
VO: Not even death can save you from me!
Kevin Riggle: Welcome back to the "War Stories" podcast on Critical Point. I'm Kevin Riggle. We're here with Zach Johnson to tell us about an incident he was involved with, which happened in the Diablo II: Resurrected" launch when he was at Blizzard. Super excited to get into that.
One tiny bit of podcast business before we get started is that we are now available, as they say, wherever you get your podcasts. There's a link down in the description. You can go subscribe to the audio version of the podcast on Apple Podcasts, Google Podcasts, Amazon Podcasts, Spotify, and a bunch of other services.
We also have full edited transcripts up there. [Hi! Thanks for reading. :) -KR] That's a very important thing. You can also, of course, find the YouTube link, which is still the main version, the video version of the podcast. We'll talk more about that at the end of the show, but I wanted to let folks just tuning in know about that upfront. With that, we will roll the titles and get started.
[music]
Kevin: All right, we're back. Zach, it's really great to have you here. Can you tell us a little bit about yourself and what got you into a place where you could break production?
Zach: My name is Zach Johnson. I am a software engineer who primarily works on the back end, so both on the software side, but also the integration points with database and infrastructure.
I'm currently at Disruptive Games, but at the time of the incident, I was at Blizzard Entertainment working on Diablo II: Resurrected. This incident was around the time of the launch of Diablo II: Resurrected.
We had run a couple of public betas where people got to actually come in and test the game and whatnot. This was going through the full feature launch of the game where we had far more players than betas, given their beta nature. That led to the discovery of new issues due to the larger player base.
Kevin: Interesting. What's the difference in player numbers between a beta and a launch? I know very little about the tech side of online gaming, but it's fascinating to me.
Zach: I can't recall the exact beta numbers, and I don't think I could disclose those, but I can say it's on the order of at least a magnitude and probably in the 50 to 75 times larger for the full scale launch than the open beta we have. This was an open beta, so anybody could download for free. Still, quite a difference in scale there.
Kevin: Yeah, quite different. I guess often open betas, they wipe all of your character progression and items and things after the period, and so, there is a disincentive to—
Seeing somewhere between 10 and 100X the number of users, a couple orders of magnitude on the day of, all landing on your head at once. That's exciting.
Zach: Very fun.
Kevin: Yeah. What was the first thing that you noticed?
Zach: This was day two, the day I took down production. Day one had its fair share of issues, as Diablo launches are historically known to do. This was a case where on day one, we had issues, they came up, we resolved them. Nothing was to the degree of no one could play.
It was frustrating for a few players, maybe you got disconnected for some amount of minutes, but it wasn't a total outage. We came off of day one feeling relatively good about where things stood. Obviously, we wished it had gone better, we wish we had load tested this better, etc. but that's all retrospective stuff.
When we came into day two, it was, "All right, let's continue working on these known issues we have. Let's try to get the game in the most stable state possible." One of those issues was around character saves. This is where we'll have to get a little bit into implementation details to fully understand what went wrong.
The important thing we wanted to ensure was that, only one game server had mutable access to your save data. This is the problem issues where, if you were somehow connected to multiple game servers, there could be a last strike win situation, there could be item duplication issues.
The Diablo economy is very sacred to players. Given some items are super rare, being able to duplicate them and treat them like candy could invalidate someone else's hard work. The sanctity of the economy was at the forefront of our mind. By locking...Go ahead.
Kevin: Sorry. Partly for me, because I don't have a picture of the system architecture in my head. I don't think I've played Diablo, but other Blizzard games. When I logged into WoW, for example -- World of Warcraft, the big multiplayer online game -- I would connect to a particular server.
There might be a North America server, and it would have a fancy name. Some of the servers would allow player versus player combat, some would be player versus environment, some would be roleplay servers. I, naively, might think of those as being like, "OK, there's a computer in...Blizzard is Burbank?
Zach: Irvine.
Kevin: I did the same thing the first time we chatted. Blizzard is Irvine, and I might imagine that there's actually a physical box in Irvine with that name stenciled on the front, that I am talking to. That's probably not what's going on here, but that's a initial guess.
Zach: Don't quote me on this. I'm pretty sure all of the Irvine-based compute has been de-commissioned or otherwise de-prioritized, and it is largely utilizing cloud infrastructure.
Kevin: OK, cool. Moved to Amazon's web services or Google Cloud or Microsoft Cloud or whatever. Now there is a box somewhere in Oregon, plausibly, that Amazon owns and maintains, and y'all have a piece of that, a virtual machine running on that box. When I connect to Rehoboth server, I'm connecting to an actual single, probably fairly beefy, virtual machine in AWS, for example?
Zach: These were multi-tenant VMs. The game server process is pretty small.
Kevin: Oh, interesting.
Zach: We were able to get quite a few running, and there's reasons why you wouldn't necessarily want to do multiple processes in one VM. It was definitely a possibility, because the game server process is not super demanding, especially to a lot of the logic of Diablo 2.
I should also clarify this point. Diablo II: Resurrected was a remaster of the original game that came out in early 2000s, and we wanted to keep as much of that code as possible, because the original game was beloved.
Imagine that you take a game written in early 2000s and you want to run it on modern compute. It's probably going to do pretty well. It's lacking a few of our modern constructs, there's linked lists everywhere, but that's OK.
Kevin: That's of the time.
Zach: It works.
Kevin: Absolutely. All of this stuff is probably written in C++, running on Windows, I'm guessing?
Zach: It was at the time, for early 2000s. One of our projects was, "If we want to scale at a cloud level, we need to get this running on Linux." That was something that we achieved.
Kevin: A friend of mine was at [38 Studios] back in the day, and the explanation he gave me for why they were running on Windows at the time was that it was super important for the devs to be able to run the server and the client while they were doing development on the same physical workstations. These days, the ability to have devs spinning up VMs is now so much greater that like...The world has changed a lot, and so interesting that that's an evolution that's happened in your space too. Cool.
The goal then, if I am understanding what you're saying correctly is, it's very important for the backend services. There's a shared database somewhere which is maintaining the state of which person owns which magic items. All of the servers that players are connecting to are then talking to this backend database.
You've got a leader follower set up where only one of these database servers do we want to be active and actually making updates to the database. The others exist so that...because reads in this case are much more common than writes.
You want all of the reads going to read replica and only the heavyweight write activity going to the one master or the one leader database server. Is that...
Zach: You're correct on all points except for one critical thing, and that is that, you would think that reads are more common, but actually writes are significantly more common for a game like Diablo.
This again goes back to the item duplication thing. If you have an item, it is paramount that that item exists once and only once. You might think, all right, when someone picks up that item off the ground, we're going to create a save, but Diablo II allows up to eight players per game.
You could be anywhere on the map in this game, you don't need to be playing together. You could be all over the world and you could kill a monster extremely quick in Diablo. Just snap your fingers and things are dying. You could be picking up items all the time. There's plenty of actions in a game that could invoke a save.
Really the only time a game server cares about reading your data is upon your initial join, and it's going to make amendments to that existing save over time. The result of that is we were a very write-heavy game and not a very read-heavy game. There was some architecture considerations for that.
Instead of having a single source of truth global database, we could have regional databases that do a write through. I'm going to write to my regional, that's going to be far quicker latency wise, instead of if you're a European player writing over to America's, that would be a disaster for latency.
Kevin: Oh God. Yeah.
Zach: Let's do some write-through model where we write to the regional and that eventually propagates back. Again, distributed systems create complexity, and there's a whole bunch of complexity there and making sure once you log out we persist that right up. Anyways.
Kevin: Sounds like fun. Fascinating.
Zach: The key insight there is that we are a write-heavy model. That is where the system is going to fall apart.
Kevin: I think also you said something that I hadn't picked it up on. I was thinking of Diablo II as being like World of Warcraft in having a bunch of people, but no, you have only up to eight people playing together. I don't know how much that changes the backend model, but yeah.
Zach: This is also a artifact of early 2000s game design, because a lot of modern game like Diablo IV has moved to what is more similar to the WoW model, where there's a shared hub where everyone match-makes to, and then you run off on your own.
Those might be their own servers hosting their own instances, but to the player, there is no concept of a game or a lobby, it's an open world they could go out and explore.
Diablo II: Resurrected did not have such a fantasy, and you're going to click a game in this list and you're going to go there, and you know who the other seven people are, and you'll have fun that way. It is a different game design.
Kevin: Different trade-offs to make there, but OK. Is it the case that you can assume that generally people will be writing to the same...will be in the same regions. I guess they're all on the same server, so it's not about where the players are, it's where the server they're connected to is and...
You just have to worry about the server's connection to the backend database rather than the player's connection to the backend database, which is nice. We've got game server, we've got the backend database. We're doing a ton of writes to it.
Every time one of those little goblins goes down and drops a pile of gold, we're adding a row to a database somewhere, maybe a few rows.
Zach: It's a lot of writes and I should clarify that. Rather than doing a row per item drop, which is a little bit more modern of immutable ledger style of, here's all the events that have gotten us to the state. It was just a binary object that got stored in the database. Serialize player's inventory, shove it in here.
Binary objects are very large and you have to consider what is our database technology going to do with possibly 8K, 16K of data that we're shoving into a row and checking for changes in that.
Kevin: And unstructured data, this is just a blob. We're updating the same row every time this changes. As a software engineer, I worry about changes to serialization and deserialization formats, indexing, all of this is starting to give me heart palpitations, but OK.
This is somewhat different model than what I'm used to from SQL world where everything is very structured. What I'm also taking on this is that you're not doing queries on the information in this blob.
Zach: That's the key insight. We were using MySQL, that is the backing database for this, but...
Kevin: It's very performant.
Zach: The queries we're interested in, aren't about the player's inventory. That has almost never been a business logic factor, that's more of a after-the-fact business analytics factor, and we capture that kind of data through on-demand gameplay events that gets sent to telemetry.
Just your ELK stack that's running like, player picked up this item. If we want to do business insights, we do it there, and then the database is strictly the true state of the player.
Kevin: Got it. [laughs] With that systems context, with this little system diagram in our heads, someday I'm going to do these podcasts with a whiteboard and that'll make it easier to visualize.
With this little system diagram of game server database, players updating this row and the database with this blob of player inventory data very frequently. What happened, what's next? [laughs]
Zach: Let's talk about locks. Specifically, this is an abstract concept that we are putting on the MySQL model, not necessarily like a lock you would think in SQL. I'm going to read this row, let me lock these rows.
Kevin: Not using MySQL's built in transaction row locking mechanisms.
Zach: We will use that. That's ultimately what brought down production. Spoiler alert.
Kevin: Oh, great. [laughs]
Zach: The locks here are the abstract model we're applying. It's effectively another name for a mutex, where the game server says, "I am the exclusive owner of this. When I update this player's data, it will succeed. No one else can do that, such that I know that my view of the player's data is true."
Kevin: Interesting, so you're layering...There's another table in the database which is basically like, "Which game server owns this player's data and is allowed to write to it?" At least theoretically in the game code, you're supposed to check that table, make sure that it's you as the game server before you go do the writes.
I'm guessing that something here broke. I'm guessing that that didn't work. [laughs]
Zach: Luckily, the way it broke is the way we want it to. By that, I mean players' save data is paramount. It is the thing we care about most. We don't want to corrupt saves. We don't want to duplicate items, etc.
What happened on day two was that these locks were not releasing correctly. You can imagine a case where, it's C++, memory safety is tough to do.
Kevin: [laughs] Yes.
Zach: Memory safety...
Kevin: A polite suggestion at best. [laughs]
Zach: Yeah. This is day two and game servers are crashing. Players are finding ways to bring down our game servers. What that means is the game server that owns the lock can't really say, "Hey, I've released that." It's dead, and so we need another paradigm.
We do have timeouts on locks, etc. If a game server goes down and never comes back, how do we handle that case? That was considered. You have to imagine a player is clicking and now nothing is happening. They're like, "Server crashed. I'll go find a new game." They go back, they find another game, they hit play.
Their old server has not come back up. There is no way for that server to say, "Hey, release this lock because it's not back yet. It's still dead." You can imagine some cron job that's just looking, "Should I bring back the server?"
Unfortunately, they were named servers, and like server one, two, three, and not something more modern where it's like, "We don't care about the server instance."
Kevin: Oh, sure. OK, so this is servers are still pets here?
Zach: Yes. It's definitely pets and not cattle. When server 100 goes down, it's going to come back up and then release the locks it had, but prior to that, it still has the lock.
Kevin: Interesting. The server actually has to affirmatively update this row in the database saying, "I'm no longer the owner of this thing." There's not like, I was sort imagining, when you said that the locks timeout, that when I grabbed the lock as the game server, I also time-stamped that.
Another game server knows that after five minutes or whatever, if that row hasn't been updated, then it can grab it itself, but no. The server, the original game server...
Zach: That was the case.
Kevin: Oh, it was.
Zach: Yeah. We had some relatively short timeout, make it three minutes, let's say, five is also a good number. I don't recall what it was exactly.
As a player, you know your game server crashed, and you know you want to keep playing, so you're going to hit play right away, like, "Find me another game." You're like, "All right, I got a game server here. Let me join," and the game server says, "No. I can't lock your data. You don't get to play."
Now a player is sitting in the front end and being told they can't play the game because their data's locked. Unfortunately, we did put the error message that says, "Save data is locked." If you go and look at the forums or the blogs at the time, we talked about locked save data all the time.
Revealing the implementation details to players isn't a huge deal, but as a developer, they start talking about it as if there are solutions they could offer.
Kevin: Oh, yes.
Zach: It's like, "Just unlock it." "I forgot to..." It's just one of those things where I understand why there's very vague error messages now.
Kevin: It's always been a thing. The error codes, I play a lot of Star Citizen, and for a long time the 30k error, the dreaded 30k error was just, "The game server has gone away. Have a nice day."
When things started getting iffy in-game, chat would suddenly fill up with, "30k?" until either the game server recovered or you got kicked out of the game.
I don't know. I find, as a player, it's nice to have at least a little bit more context. Otherwise, you're sitting there staring at something that says 30k and you're like...
Zach: Yeah. This is my bruised and fragile ego talking, of course, because in the moment when this bug is happening and you have players giving feedback like, "Hey, I'd like to play the game," like, "I'm trying. I'm really trying."
Kevin: They're like, like you say, "If I were a game developer, I would simply not lock the save data."
Zach: "I would just unlock the save data. If it's locked, unlock it. What are you doing?"
Kevin: "Unlock the save data. Come on, guys. What are you doing?" Distributed systems are hard. Well, I hope that this podcast maybe helps explain to some players why it is not as simple as unlocking the save data, a little bit of what's going on behind the scenes.
OK, so players are starting to see, they're starting to get kicked out of the game or the game stops responding. Then they go to log in again and they see "save data locked" and they can't continue playing.
Zach: There's no action they could take to fix this. There isn't like a, "OK, unlock me, roll back my save data. I don't care that I lost a little bit of progress." It's, "You can't play until we unlock that save data."
This was our dilemma to solve on day two. This happens, and it's unfortunate, but servers are going to crash and if you wait five minutes, you'll be unlocked and you get to play. Except some players weren't being unlocked. We saw that there was a growing number of people who were stuck in this locked state.
At first it was a thousand, and then it was 5,000 and then it was 10,000. You think, "These are players that are unable to play, period, and we can't tell them to wait it out." There's some bug here we don't understand fully.
Kevin: How did you notice this? Was it, the forum started filling up or what was the first thing that got people's attention?
Zach: The forum is a big one. Blizzard have a very good customer service department where there are people actively monitoring Twitter. Bless their hearts. That job must be miserable. They're getting player feedback and they're passing it on to the game team, like, "Hey, these players are complaining about this. What can we do about it?" We saw the volume of that increase.
Kevin: I forget, I don't think we actually said what year is this?
Zach: Oh yeah, this is September 2021.
Kevin: 2021, so this is very recent?
Zach: Very recent, yep.
Kevin: Twitter is a thing and you're getting this feedback, basically in real time, and the internal customer support processes are able to start feeding this to the devs in basically real time. OK, that's nice. You don't have to wait like we did in the old days until that forum thread blew up and the forum moderator sent an email out and, and, and.
Somebody from the social team, raises an incident in Slack basically and everybody starts working on it.
Zach: We got feedback even quicker, which is funny. Blizzard makes video games, and video games are streamed on multiple websites. There's YouTube, there's Twitch, there might've been Kick at the time, there's a lot.
Kevin: It's become huge.
Zach: Yeah, exactly.
Kevin: Who knew that watching people play video games is actually a lot of fun, and...
Zach: And I do.
[crosstalk]
Kevin: successful business. I do too.
Zach: Yeah, I watch streamers all the time. I watch e-sports, etc. When you launch a game, you boot up Twitch...
Kevin: When we met, I was in Las Vegas for Black Hat, the cybersecurity conference, but you were in Las Vegas for the Evo fighting game tournament which was happening at the same convention center. It was super fun.
Zach: It was across the hall.
Kevin: Figuring out who was there for which conference. It was great, and it made connections like this happen. Are you, internally, watching Twitch streams...
Zach: Oh yeah.
Kevin: ...of people playing? OK.
Zach: Blizzard was sponsoring certain big streamers to get them to play the game and whatnot. I had already...
Kevin: Oh, no. OK. Well, that's even better. Or worse.
Zach: It gets worse, but, there's streamers that I enjoyed.
As we developed the game, people were just like, "Oh, I've played Diablo 2 for years. They're remastering it. How cool." Those people are the passionate ones. Those are the people you're making the game for, really. This is their childhood we get to bring back to life and play with other people.
Not to Zoomer anybody out, but I was four when the game came out, I believe, originally. My mother isn't letting a four-year-old play Diablo. It's a very gory, bloody, demonic game. To be able to experience that and play the game they remember is super cool.
I wanted to see the reactions on Twitch, of people smiling and laughing and enjoying the game. I boot up a Twitch stream and it says "save data locked." It's devastating.
Kevin: You're like, "Oh, not only is this bad for them and these people I care about, this is my fault." [laughs]
Zach: Yes, "It's my fault. I got to go fix this." The way we established, not necessarily the root cause, but a symptom, is run a SQL query and see what locks have expired, as in are older than five minutes and are still being held, like, we haven't cleared out that row or reset the timestamp, etc. When we ran that query periodically, it was growing.
Our solution is, "We don't know why these aren't being released, but we can manually release them until we fix that bug. We're going to update the rows where the lock timer is greater than that five minutes and they haven't been released, and then players, if they retry again, should be able to get into games. That was our plan. Every X minutes, someone go check it and manually release them.
Kevin: Oh, wow. Somebody's sitting there with a SQL console open, logged into the...
Zach: Production database.
Kevin: ...main production database as an admin user every five minutes being like update owner for where all, whatever. We've all been there. This is having to do it every five minutes is a little bit more aggressive.
Zach: It's a little extreme. It was something that was scriptable. We didn't dare script it because doing that in a production database is real scary.
Kevin: Not in the moment. We can patch stuff over, especially for a live thing, for a live service. We can patch stuff over with somebody just pressing enter for a while. Eventually we'll give them a drinking bird that can do that automatically, and then we'll write a script and then we'll go on from there.
Zach: That was the play. We're going to do this periodically and we're going to let people play and whatnot. We were doing that as early as day one, because we saw that issue happening. We caught in the second half and like, this is something we'll address tomorrow in day two.
I log in for day two, good old 7:00 AM on day two of our product launch. You got to be in pretty early for your shift, and we see, there's 70,000 locked characters. We're way up from the 10,000 we were before. When I think about 10,000 people that I'm preventing playing a game, that's extreme, but 70.
Kevin: 10,000 is more than lived in the town that I grew up in. [laughs] It's not a city, it's a town, but it's still a lot of people.
Zach: When you think about holding the happiness of a small town in the palm of your hand, it's like, "Ah, I need to fix this."
Kevin: "I need to fix this."
Zach: Day one, we were pretty cautious again like you mentioned. Doing these queries on a production database is dangerous work, so we ran this by our database experts, DBEs. We're like, "Hey, here's the queries we're going to run, please give them a look over and then we'll run them."
Day two, my ego had gone unchecked for all of day one, because we were fixing problems left and right and it feels good to fix problems and see people play and etc., etc. Day two we're like, let's keep trucking on, let's do what we did yesterday. I ran a query to get all of the affected players. Who are players who have their saved data locked by a server?
I was going to put this in a temporary table, and then we could start developing a script that does this unlock based on this table of data we have now. That's a temp table and not the main table. I ran the MySQL query, select star from players where blah, blah, blah, etc., etc., and then I ran it, and then it didn't respond, and then it didn't respond.
Then a bead of sweat rolls down my face as I hear someone else from the call say, "Oh, I just disconnected from the game." My query to create a temp table locked every row in the player's table in order to create that temporary table. These game servers that are trying to save out have these rows locked, they're failing to save.
Now the database is getting overwhelmed and total system outage as game servers can't write. There was exponential back off, it was not an effective exponential back off. Database is getting repeatedly slammed, and we took down a production MySQL instance.
Kevin: This is not getting locked at the game server level. This is not getting locked in this lock table that we have. This is the underlying MySQL locks. I'm curious why it wound up locking those rows if you were just reading from them to create the temporary table. That is surprising to me.
Zach: I'll mention that I did not get this query signed off by our DBEs, which is an important distinction to make. This is the first time we broke from procedure and did not run these queries by. It was 7:00 AM, I wanted to be helpful. I was slightly less than helpful.
Kevin: [laughs] Just slightly.
Zach: Slightly less than helpful. There was another incident in the betas that didn't mention, but we found another bug in MySQL involving too many concurrent connections, which is a really fun one to solve, but...
Kevin: That's also very fun.
Zach: Yeah, because you imagine we have lots of game servers that all need to connect to this global database. There is a limit, we found it. Story for a different time. In that moment it's like, "This query I ran, took us down. It caused major contention of the database, we've gone dark."
[crosstalk]
Kevin: What did you do?
Zach: In the moment, the only thing you can do is be 100 percent honest like, I ran this query and this is what happened. How do we get back to a point where anybody could play, period. This involved paging a DBE right away. There might have been one on the call at the time, but page out, create the incident.
We got the instance back online, SQL instance, and we saw game servers successfully connecting again and whatnot. Through standard operating procedure, everything self-healed to where players could occasionally jump on and play a game and we're seeing some sort of recovery.
The fallout is where things got interesting because we started with 70,000 locked characters. After things came back up, I believe we were at around 250,000 locked characters. Basically anybody in the game at the time who got disconnected is now in this garbage state that we need to figure out how to resolve.
Kevin: You've run the query and it's locked the entire player database basically. All of the game servers are like, can't talk to the database, good luck, bye. Are they down or are they just like...
Zach: That's a really good question. I believe our logic at the time was a crash. I believe the game servers crashed rather than stayed alive, waiting for the connection to come back. I'd have to dig through the archives there, but the end result for players was the same in that nothing they were doing was working. The saves were not happening.
Kevin: The database is actually also crashed. The MySQL process is just...
Zach: Has stopped running.
Kevin: Interesting. That's also surprising to me. Game servers are crashed, MySQL process is crashed. Somewhere there is a little watchdog timer, which is going to go restart these things, but the DBE can also come in and just go prod the MySQL process and start a new one.
Zach: I believe the actual recovery process there was we switched our replica database to be the primary because...
Kevin: Great. Oh yeah.
Zach: ...primary is long dead, and at least we could recover more quickly and not fiddle with starting up a process and maybe have some room to...
Kevin: Not have to deal with reading all of that data off of disk to get started...
Zach: Correct.
Kevin: ...which is why we have replicas in the first place.
Zach: [laughs]
Kevin: The failover there happened properly then. Whether that's a manual failover, an automatic failover, you go from the primary to the replica. Replica is now the new primary. Congratulations on your promotion.
[laughter]
Kevin: Now get me some fucking air support. The game servers can now connect to the new primary, but pretty much everybody in the game now is in this locked save file state. OK.
Zach: A few thoughts there. One, once players can start reconnecting, that feels good. That's a sign of recovery. Two, things are worse than when I started.
Kevin: Yes. [laughs]
Zach: We still have the locked character issues, and now there's more of them. How do we actually want to address this?
Three, this is a personal one, but I am responsible for all of this. This is also the most major screw up of my career, bar none. My actions have caused hundreds of thousands of players to feel worse. That's not why I wanted to make games or be a software engineer. I want these people to be happy.
In that moment, I am feeling a devastation I have not felt before in my professional career. Like, "This is my fault." Once things have stabilized and we know what the situation's like, I need to take a walk.
I asked the war room, "Hey, I am more emotional than I am rational right now. I need to step away, but I'll be back to help." In hindsight, that was a very healthy thing to do, and I'm glad I did. I did two laps around my neighborhood, took some deep breaths, grounded myself. I was like, "OK, I'm ready to contribute again to the problem we have."
That's when you go to the incident leader and like, "How can I be useful?" It's, no longer do I want to be running the queries. I'll do it if I need to, if I need to be the one to execute them, but I want to be helpful first and foremost, so how do I be helpful?
Kevin: Good. That is very smart. So much of incident response is about managing as a subject-matter expert my own emotional state and keeping it in a place where I can be productive and useful. Also, this is 2021, so this is fully remote. You're working from home. Everybody else is also working from home.
Zach: The Blizzard operations team was on site. I believe they were the only people approved to be on site, were people who monitored the global operations. Yes, everyone else was fully remote.
Kevin: Blizzard has basically a software operation center we would call it in...
Zach: In Irvine.
Kevin: Irvine. Cool. That's fascinating. Talking with them sounds like it would be a...
Zach: They have some stories. I hope you can find an operations center person.
Kevin: I used to work at Akamai. I forget if we talked about this. Our network operations folks there were...I worked a lot with them. We actually had follow the sun coverage, and we had the big mission control space in the office in Boston that...It was a little bit for show, but it was also real.
That was the way that we sold people on working at Akamai was like, "You can be part of a thing, an organization which does this." We were delivering a quarter of the traffic on the web in that era. That was like yeah, you can be mission control for a quarter of the web.
At Blizzard, it's, you can be mission control for WoW and Diablo II and all these other games that people love and enjoy. Most of the people on this incident are remote. Is there a Zoom call? Is there a Slack channel? How are you coordinating?
Zach: For a launch in particular, we had a war room, which is just a Zoom call with stakeholders that could be our immediate team, and then the operations center and then C-suite level if they were interested. That typically doesn't happen until you bring down the game then they're really interested.
Kevin: Yes, indeed. Oh, because this is a launch, this is a thing that you set up from the beginning. You like, we know that we're going to need to coordinate, we know that stuff is going to break, so we're going to be on the call. This was at least already something. You didn't have to wake people up to get people out of bed.
Zach: Exactly. This was 7am, so some people did not like being in the call that early, but it was a known factor. Certainly the time to response is significantly higher when you have the Zoom call already running and you make the mistake.
Kevin: Exactly. Yes, cool. Personally, I always prefer doing this stuff over audio rather than over Slack.
Who is the incident commander? Who's running the incident or the launch?
Zach: It is a rotation. There's someone from the various tech leads across Blizzard, and not even in Diablo, but every team. You could have the World of Warcraft tech director doing your incident, and that's to get operational knowledge across and have people be less siloed, which is, in my mind, a very good idea.
Kevin: Oh, it's incredible. Yeah, you can't do stuff at this complexity without it, I believe.
Zach: I forget who was the incident manager at the time, but a tech director who was trying to ask the right questions to get us to resolution.
Kevin: Senior enough in the organization. Yeah, good.
Zach: Leading on our team side, was myself, given my knowledge of the systems we had, but also my manager, who was the more formal POC of like, "Hey, how can we do this, etc." I was doing more of the grunt work more than the directional of like, "Let's choose to do this, let's choose to do that."
Kevin: Got it. Your manager is probably also coordinating with other teams as necessary to get work to happen, make things go.
Your team is responsible for, not the databases directly, but the game servers' interaction with these databases?
Zach: When I say my team, I'll refer to the server engineering team. That was about six or seven people, and then SREs as well, very important. Those folks are monitoring health, and we're more so monitoring... I guess also health, but not system health, but gameplay health and...
[crosstalk]
Kevin: Right, application health versus the underlying health of the computers.
Zach: Yeah. My role in that incident was, "Hey, given this bug, what do we think the cause is? How can we investigate it? Let's get more info on that." This is very early in my career—I'm saying that still very early in my career, but I think...
Kevin: Two years later, sure.
Zach: Yeah, this was my second full year in the industry. I was certainly out of my depth in terms of depth of knowledge, but I knew our architecture pretty well. I could tell you what's likely going wrong. When these things happen, I'm throwing out solutions, and running queries before getting approval, and run-of-the-mill stuff like that.
Kevin: You go take a lap around the neighborhood. You go talk to the incident manager, be like, "I've got to step away. Take a deep breath, see the sky, touch grass, as the kids say, and come back in a frame of mind to be helpful." Here you are back.
Zach: Back to the desk. The first thing I observe is we are actively working on the scripting method to release these locks now. Given that releasing 250,000 manually would not be necessarily wise.
Given that, one, I had just introduced major lock contention with a single query. We were looking for a slower burn way to release all of these locks and one that we could leave passively going as we investigate the root cause, which is somewhere in the game server.
Kevin: Given that, what was supposed to be a non-destructive, non-locking read query has just locked all of the rows and taken the game server down hard, and the database down hard.
Doing a, not destructive, but mutating, updating all of these rows in the whole player table is also probably going to lock all of those rows. It is definitely going to lock all of those rows and it's going to take the game server down hard again. Let's not do that. That's smart, yeah.
Zach: We decided, "Let's commit to doing the scripting strategy." Forced our hand on that one, but at this point, let's commit to it. Then, while I gave feedback there and like, "Here's what I did, here's the query I ran." I don't remember offhand, which would have been helpful info, but, "Here's what I ran. Here's how we could approach doing the scripted solution."
I then moved over to the C++ side and said, "Where is this bug? Let's try to get a software solution to this so we don't need to be running database queries anymore because this is madness."
Kevin: We can put the fire out and we can set up this process to patch stuff over every five minutes, but we got to solve this longer-term so that we not...The fire isn't just out, but the building is actually patched up and structural again.
Zach: The rest of my Tuesday was wallowing in guilt and simultaneously looking for the solution that gets us in a better place. That solution came on Wednesday the day after we found like, here's...
Kevin: Day three.
Zach: If it's not the root cause, it's certainly a bug and we could fix this.
Kevin: One of the contributing factors we could say in my lexicon. What was the bug?
Zach: It had to do with string comparison and...
Kevin: Classic. [laughs]
Zach: Especially too, I forget what encoding we were using for the database in particular. I'm sure it was the one that gets thrown in examples all the time, I'm sure that was the one we were supposed to be using.
Kevin: UTF-8?
Zach: There's some specific extension to that. I'll have to find it. I'll Google it. There's some encoding we were using on the database which was correct. Then what the game server was sending was slightly different variant because at the time we were not allowing Unicode player names. It was ASCII only, which was a Diablo II restriction.
Kevin: Sure. Right. Early 2000s, convert them all into ASCII.
Zach: That did eventually change because we felt like, "Hey, let's let people communicate and express themselves in languages they prefer."
Kevin: We have players outside of the US now.
Zach: Let's think globally a little, but we...We made that change later and while we were working on the bug fixes like, wait, this string comparison just won't work in certain cases.
Kevin: Interesting. If it had been a pure ASCII to UTF-8 comparison, I believe that it should have worked. Now Windows defaults to UTF-16 as its native string representation for reasons. Especially if you were doing the comparison byte wise then that would break if...but that shouldn't have broken. I don't know.
I implemented Unicode support as my first project out of school. I have all of these details locked away in neurons that I access only under duress, but... [laughs]
Zach: You read 'codepoint' and it just all comes flooding back to you.
Kevin: It really does. There's a string comparison error issue where under certain circumstances, certain players' names like...and that's interesting. The server can correctly write their names or something. This comparison is happening correctly on the write of this lock, but it's not happening correctly on the release of this lock.
Zach: I believe when I say string comparison, obviously, the failure is happening because we can't associate this player with this row and unlock the data, whatever. I believe it was something due to the size of the buffer in certain cases, etc., where we're truncating off just the last few characters, whatever. The exact details are lost on me.
Kevin: Yeah, that's fine.
Zach: The ultimate cause there was, OK, this is of course is a very simple fix to unlock these players once we go looking for it and run with a debugger and etc., etc.
Kevin: Straight up, there is a bug in the game. The ultimate conclusion here really is like where you started, there's a bug somewhere in the game, which is not releasing these lock. We go find where that bug is and we go fix that bug, and now the game starts releasing the locks.
Zach: This was a thing we caught in beta and we assumed as risk, because we need to release a product, we put out the date, etc., and we're accepting this risk, we know it can happen. We didn't think it would happen on the scale, which is why the bug wasn't as prioritized.
Then it happened, we prioritized the bug and then things got fixed. That's game production in action. Typically we skip the part in the middle where we take that production.
Kevin: We try, but obviously not always. In the beta, you were seeing this in, one, two, three players out of the...what was it? 10,000... had this issue. [laughs] There are many issues which are affecting 50, 100, 250 players, which you're like, "We got to fix these first. We know that these are going to cause problems.
The ones that are affecting 1,000 players, we can't launch the game without fixing them." You're just burning down based on observed incidence of these issues. Once you get down to a certain level and the release date is next week or two weeks from now, you're like, "We've got to make the launch commit, we got to get this out."
Then you encounter the real world and it turns out that of the 190,000 more people who are suddenly showing up to your game now, like thousands of them have whatever underlying quirk of reality that causes them to hit this failure case and oops. [laughs]
Zach: Along with the lower player count of beta, we hit very few server crashes in beta, which is partly, OK, between beta and launch, we made some changes that were unstable. We've got to review our code review process because the volume increase of that is unnerving.
Kevin: That's concerning.
Zach: Then additionally, the beta only went up to character level 30 of 100 total in the full game, and only two of the acts, and there's five acts. There's significantly less content to be explored and reveal crashes. To some certain degree we were expecting more crashes, but it was a thing we had to figure out, that's for sure.
Kevin: You were hoping for this stuff to wait until people hit character level 30, which would take them at least what? A few days?
Zach: [laughs] You'd think, but man, the speed runners for this game could get to level 80 in about six hours.
Kevin: Oh wow.
Zach: I'll say that it's not a linear scale from 1 to 100, it is exponential, but still the...It's been out for a while and they know all the tricks.
Kevin: That's true. They've had 15 years or something to go figure out all of the shortcuts. You knew that you were going to hit this stuff at least, but also, I don't know, I suspect that the speed running community understands that if the beta only goes up to level 30, six hours in, they're at level 60 and then they hit some bug, they're like, "Yeah." They know what's going on.
Zach: Yeah, and the locking issue, this was one of the, "We've got to release these locks," but people having their accounts locked is just an unpleasant thing.
Because if you're playing and the server crashes and you have to wait for it to come back up and you don't know when Blizzard brings the servers back up to unlock your stuff...It was poorly architected from the beginning, is what we've learned over the months.
There's actually a few blog posts that we wrote to explain our technical changes that were coming to save data to fully eliminate locks and whatnot. Due to a bunch of hard work from other engineers on the team, the lock system got completely overhauled.
It's significantly better today, to where you could get into games super quick, and you could get into games so quick that through strictly improvements to our infrastructure and architecture, there came new ways to play the game...
Kevin: Oh, interesting.
Zach: Back in old school Diablo 2, you could only log in 20 times per hour, which seems very high, right?
Kevin: Yeah.
Zach: There's certain enemies you could kill that give a lot of XP, and then you leave the game and then you rejoin and you just kill that one enemy over and over, and so...
Kevin: OK. You can basically farm the...
Zach: Yeah.
Kevin: OK, great.
Zach: By improving our lock system, we said, "Why don't we raise the number from 20?" We could do really interesting stuff if we don't put a limit on how many times players could join a game. It was so successful that players completely changed their strategy and killed these different enemies over and over and over, and they got players to help them kill those enemies faster.
The time it took to get from level 1 to 100 was completely shattered because we made infrastructure improvements. Very interesting case, but players were upset with us, so we had to make a few changes to balance that out. I thought it was interesting.
Kevin: Yeah, fascinating. When you talk about the sanctity of the economy, and if players can farm these ultra-rare items, then it reduces the value, all sorts of secondary effects.
That's fascinating, though. The sort of second and third order effects of what don't feel like game changes. They feel like, "Oh, that's just infrastructure. It's just the stuff that supports the..." but it has an effect, too.
Zach: It felt weird that we removed a 20-year-old gameplay restriction. The community is like, "Put it back." I understand most people aren't joining 20 games in an hour, but if you do really enjoy killing that one enemy, I think we should let you.
Kevin: Sure. [laughs] I joke a lot about stuff like that when we talk about login restriction, rate limiting on account logins to make it harder in the security world for people to do password testing, for example, where some hacker knocks over a Minecraft forum and gets all of the passwords in clear text because it's a Minecraft forum.
Then is like, "Hmm, I wonder who used the same password for this Minecraft forum that they used on their bank account," and goes and runs the email addresses.
Zach: They do.
Kevin: They do. A surprising and unfortunate number of people do, and we're working to fix that, but... we've got more work to do. It's very lucrative for these hackers. The banks have largely got this figured out now, although they didn't 10, 15 years ago.
Anybody who stands up a new web service that has its own login form, eventually realizes that like, "Oh no, somebody logging in 20,000 times in an hour, that's not a human user, that is a bot." We should probably not let them do that because anybody who can type that fast can wait. [laughs]
That's interesting that, that was this. The community dynamics around that are different in the game world than they are in the more businessy oriented world where people will find different ways to exploit that in the game. Cool.
You found this bug, the database servers are back up. I guess you fix the bug, you roll out the fix to production, the game servers start releasing the locks. Was there any other cleanup work you had to do on the incident?
Zach: We preempted the cleanup by nature of having the script running as soon as I got back. If we hadn't committed to that work, then yes, there would have been quite a few locks to cleanup prior to rolling out that game server fix the day after. I'm trying to recall if there was any other fallout. I don't believe so.
One thing we were really concerned about is, did we duplicate items? Are we going to have to reset everybody back four hours, let's say? Because if we've corrupted the economy in any way, this is untenable. We got to just fix that. This is two days after lunch. There needs to be trust in the economy of this game.
We were all good there, but that did take some work to figure out, like, "Let's go through our telemetry and see, does anyone have a wild amount of something that they should not have?" and all good there.
Kevin: That's where you can go to those events that get kicked off by the game server when somebody kills an orc and picks up the pile of gold or the ultra-rare item and look at that, those logs and be like, "OK, yeah, we don't see somebody spamming ultra-rares on..." That's good, and you don't have to do that at the database level, which is also nice, because that would be another full table scan...
Zach: Very full table scan.
Kevin: ...and have to get into that binary blob that we were talking about.
Zach: If we wanted to do any operations on the item, that blob would have been long gone. Not even in discussion.
Kevin: Interesting. The script has been running, the script has been releasing the locks for the game server, roll the game server fix out. It sounds like rolling the server out was a fairly quick process. There wasn't a ton of QA steps or anything that had to go through to get that out the door.
Zach: Yeah, we had a good repro internally. We knew what bad data looks like. We have a bunch of rows of what looks bad. "Let's copy those into our development database. Let's recreate it." We were pretty quickly able to recreate the scenario in which the server did not unlock the data, and then test with the new version it does unlock. "All right, this is a new release. Get it out ASAP."
We did do a rolling deploy. At the time, I'm not sure if things have changed, it wasn't a super technologically advanced rollout. It wasn't a containerized server. It was just a binary we drop on a box and...
Kevin: Oh, interesting. Oh, wow. 2021, and there's no CI/CD. Somebody is uploading a...
Zach: There were CD pipelines to get it out there.
Kevin: That's continuous delivery. Sorry. My parents watched one of the first episodes, and were like, "we didn't understand all the acronyms." I'm trying to break stuff up. Yes.
Zach: That's a good call out. I'm sure I've used a lot in this episode.
Kevin: It's part of the work, and people do pick stuff up from context. There is some continuous delivery stuff.
Zach: There was a job that was, "Take this build and put it on our game servers." The way we had to roll it out was not as easy as, one push, take down all the servers. I believe our game servers at the time were not... Our services were red-blue, or red-green, or blue-green, or whatever color pair you would like to use.
Where we run newest on one version, and then when we roll out a new version, it gets deployed to be inactive, we make that active and such that there's no downtime for that. We did not have that for game servers, so there was a separate process where we had to drain, and then once it hit zero, restart, and it was very manual, very tedious.
Kevin: That's drain user connections. That's like, "No new users are allowed to connect to the server. All new user connections go to a server that's running the new build of the game server. Then, once everybody finishes playing and logs off from the old server, we kill that server process. Now only new connections are being made to the new code and we can see that that's working." I suppose.
What are you monitoring to check that this isn't recurring?
Zach: The query we ran initially, before my snafu, was just the, "Check locks older than five minutes. If that is happening, there's still a bug here." Otherwise, game servers, if there's a lock that is older than that, anyone else has free reign to, "This is inactive lock, go ahead, go for it." We did notice that decrease.
It was not fully gone, but we couldn't reproduce the same scenario, locally, so it seemed to be a different bug.
The rollout is the tricky part for the game servers because, again, pets, not cattle. These game servers were special.
Once you hit zero, we had to bring it down and restart it on the new version, which was a confusing process where it's like, "Stage this binary, but don't run it yet. Then once you restart, forget about the old binary."
It was a little bit complicated to do that dance. That was certainly a retro learning, which is like, "Man, it sucks to deploy new versions." It should be fast, right?
Kevin: Yeah.
Zach: We have a bug we want to fix. We have to tell players, "Hold on, we're slowly restarting them all." They don't love that.
Kevin: There's some poor SRE in the back who's functioning more as a sys admin, who is logging into all of these servers and typing a bunch of console commands. It sounds like hoping that they get the console commands in the right order and they don't fat finger anything. That is a thing that we can build tooling around to automate and manage.
Zach: There is a very excellent command and control interface at Blizzard, where you could get all of your servers in one web view. It was very easy to send off these commands to multiple boxes. The RegExes we had to construct to selectively pick these game servers were nightmarish.
Kevin: Oh, God.
Zach: If you want to talk about fat fingering there, fat fingering a RegEx and having to just...RegExes are already unreadable, but trying to piece out which part is wrong, and RegEx 101 is constantly bookmarked and open, and it's like, "Whoof."
Kevin: We can offload that work to a computer, which is better at doing it than trying to do this kind of...It's not quite bitwise string comparison by eye, but it's definitely...
RegExes are regular expressions. They're a tool for doing very sophisticated searches over text, basically. We use them all the time in software. They've been around for a long time. They're deeply arcane. They look like somebody was just mashing on the keyboard and all of that keyboard mash has meaning.
If you put a slash in the wrong place, or you don't count quite right, or any of a hundred other kinds of issues, you can wind up selecting all of the things you want, or none of the things that you want, or a different subset of the things that you want, which you did not want. No fun to do that while people are over in Twitch chat being like, "Well, game's still down." [laughs] "Wonder when Blizzard's going to get around to fixing that."
Zach: "I'm working real hard on my RegExes, guys, I promise. Any second now."
Kevin: Any second now. Just got to run it through RegEx Tester one last time. [laughs]
One of the learnings was better automation tooling around doing staged rollouts of these kinds of fixes.
Zach: Yeah.
Kevin: Is that a thing that the newer code bases, newer server architectures have put into place, and this was a result of a lot of the code and server architecture being pretty legacy, or?
Zach: I would say yes. Certainly, in hindsight, I wish we could have modernized the servers even more. Having this in containers would have been a breeze. The thought process about restarting a container is basically non-existant, because you don't really care about its lifetime. To me, it's...
Kevin: It's the lovely thing about containers. That's moving from the model of servers as pets to servers as cattle, where it's like, "Oh, these servers are unhealthy? Kill them and restart with healthy servers."
All of the container ecosystem has built out the infrastructure to do a lot of the like, "We're moving connections from one of these pods of servers to another pod of servers without any kind of downtime," and so a lot of the tooling is already built for you there.
Zach: I would say the other key learning we had there was it's particularly hard in games to really test end to end its scale. You could do your postman for your servers if they're running HTTP or gRPC. You could get pretty good scale with just your standard web dev testing tools.
As soon as you have a custom client running its own protocol and you want to do end to end testing, we found that we severely under tested the game in regards to some of the gameplay actions players were doing.
They were finding unique ways to lock their characters that we didn't necessarily consider. It's like, "Oh, yeah, that will crash the game. We probably could have caught that with gameplay testing, but how?"
I just bring that up because Diablo IV did invest very heavily into automated testing. I think it's widely agreed that Diablo IV had a very successful launch, regardless of the gameplay. The system health was very good comparatively.
Kevin: Games have always struck me as a really challenging environment. This is coming much more from the web dev world where you make a web request, and you've got a session token or whatever so the server knows that you're logged in as Kevin Riggle or whatever.
The server is not holding on to a lot of state between web requests necessarily. Generally, a lot of that state is getting persisted somewhere else. A lot of state is getting persisted locally.
It's very easy to write an automated test suite where you can run through all the API endpoints and be like, "Does this return the results that we expect?" Whereas in a game, a game client is basically a big, weird state machine, and being...
Zach: What is a game if not state?
Kevin: Yes, 100 percent. Where in the web world we are trying to remove as much like shared state as possible, that is the only thing that a game is. I'm impressed that they were able to do automated testing at all because that means that they're finding...I don't even know how you break that problem down.
Zach: I would encourage you and your viewers to pay attention to how stable games are when they launch in the future. I think this is a thing the industry is coming around to, where a game that launches poorly will be ridiculed and may lose a bunch of players initially, and that has consequences in the market.
Players are no longer going to accept like, "Oh, we had server issues." "You had server issues on the last four launches. What have you learned? What haven't you learned?" I think we'll see a shift in the coming years about how stable games launch. It'll be interesting.
Kevin: That will be very interesting to see because that's a really challenging environment. Also, all of the state interacts with each other. That's a thing that makes that particularly challenging.
In a web app you can say, "These objects and these objects relate to each other in certain ways, and so will interact in certain ways, but if these objects don't have a relation to these other objects, then you don't have to worry about their interactions because they can't ever interact."
Oversimplifying. We all know the ways that that's not true, and that's the cause of the incidents in the web dev world. But, in the game, one item can interact with any other item in the game world, and any other player, and anything in the environment, any other piece of the state in this system any of these things can interact with.
You can't constrain. If you can't constrain the interactions, you have much less ability to enforce invariants over it, basically, and simplify the problem of testing it. Because you have to test all of those interactions, but that's exponential. You have to pick the important set of those, and that has always struck me as...
The more I learn about the way the games work internally, the more I'm impressed that games work at all, ever. [laughs] There's a lot of human testing, a lot of human QA involved in games, is my understanding. That only goes so far because you have what, 20 or 50 or 100 human QA testers, but 10,000, 100,000, 500,000 players.
Zach: I'll start with a prelude by saying human QA is one of the most invaluable and undervalued parts of game development that often gets overlooked. I think there's almost a stigma where they're just playing the game, because that's kind of what a parent who's not super involved with video games might think. It's like, "Oh, you're just playing video games all day?"
The level of detail these folks go to find repro cases and what specifically they were doing is just incredible. I don't mean to soapbox, but human QA is invaluable to games and should be treated better than they are being treated.
Kevin: I've heard stories of QA testers spending days or weeks running into the same wall trying to figure out why there's some bug, why under certain cases, you can clip through that wall and fall off the edge of the world and die, that level of obsession, almost.
Zach: We were often given dedicated QA. This person's very familiar with how back end is integrated.
Kevin: Oh, interesting.
Zach: They don't have a computer science degree. They're just listening to the terms us engineers throw around. They're familiarizing themselves with them. They're actually applying them to the game. That is such an invaluable skill set to have someone who's willing to do that.
Also, such a complete lack of prioritization on engineering's part to want to help automate some of these things that human testers are doing, but I digress.
Human testers are invaluable. To your point, there's a lot of human testers on games. Hundreds, I think, is the correct number, more so than like tens for scale on one game.
Kevin: Fascinating. They also have a limited amount of time to this. They are...
Zach: Oh, yeah. A game comes out, and let's say there's 100,000 players. Those players play for an hour. They just wiped out three months of QA time, right?
Kevin: Oh, yeah.
Zach: Just in terms of scale...
Kevin: Pure numbers, yes.
Zach: QA can't test. There's not enough QA, so things will slip through.
Kevin: There never can be enough. The real world is just so much more enormous than any organization we can possibly build. We will always be finding things in the real world that we could have found if we had followed the right path.
It's a big pachinko machine. We're only ever going to get to the parts of it that we get to. There are always going to be things outside of our...At the tails, outside of our ability to test them till the real world will show up and we will have to cope with life.
It's been all about building these processes, whether it's automation, whether it's tooling, whether it's experience and skills, organizations to respond effectively in these moments when the real world throws us a curve ball. That's my soapbox.
[laughter]
Kevin: How did the organization handle it with you?
Zach: This is probably the most influential part of the entire incident for me, was how the team handled a major outage caused by me. It was nothing but support and a pragmatism that I didn't really expect. I was expecting some form of retribution, even though the team had never been like that.
All I had to do was think a little harder and run the query by a DBE and we could have avoided this. The team understood that a blameless culture is how incidents get resolved, when you could volunteer that information without fear of retribution and you'd actually resolve stuff. That stuck with me.
I was 24 at the time. Again, never been in an incident of this scale. I very well could have caused financial damage to the success of the game based on this incident if people stopped playing because they lose trust.
It was, "Here's what we should have done. Here's what we'll do in the future," etc. It was never personal the entire time. I'm very appreciative of how that was in.
Kevin: Yeah, that's how it should be. That speaks very well of the culture there at Blizzard. It speaks very, very well. Well, and, you're never going to make that same mistake again? [laughs]
Zach: Exactly. I have a reverence for the database I did not have before. I understand the power it wields. We talk about single source of truth or a single point of failure, that big old box of state is what that is.
Kevin: That exactly, it is.
Zach: You respect that box. Whatever you have to do to protect it, you protect it.
Kevin: Yes, yes, you do. It sounds like the organization learned a lot about how important and fragile that box was. Even sitting here, I'm hearing this and I'm like, "This is surprising me." There are some things going on here that I did not know.
I would have been like, "Oh, yeah, this is a totally safe query to run. I don't need to run this by a DBE. It's not changing anything." To finish my thought, "What could possibly go wrong if I press the enter key?" That gets me into asking questions like, "Oh, what's the storage engine doing under the hood?"
"What storage engine are we running?" "Why is this happening? What does this mean for other queries that we could run in the future or other queries that could be running as part of a monthly background process that could cause similar issues and where might we see a similar thing crop up in the future?"
It sounds like the organization, the folks at Blizzard, took those lessons to heart, too, and looked a lot at like, "How can we automate the recovery process and what do we need to learn from this as well?"
Anything else that stuck out to you from the whole experience?
Zach: The only thing, we wrote a blog post when we were going to refactor and get rid of the lock system. This was one of the rare times where we talked about implementation details with players, where we talked about, "We have this global database and then we have these regional databases, they're where the writes go to." etc.
Something that I was really impressed by was the community, after reading the blog post, was super appreciative that the developers were talking at all. It is easy as a game developer -- and I'll just generalize to software engineer -- we care about our users, or players, or what have you.
You might not necessarily think of the power imbalance between you two, or necessarily the relationship, because us saying, "Hey, sorry the game has felt bad lately. Here's the steps we're doing to take it. Here's the timeline." That was super well-received.
Telling your players that things aren't working and you're working on it, typically doesn't go well. When you put some level of care into the communication and be like, "Here's the feedback we've heard, here, specifically, is how we're listening to you." Not the good old, "We're listening." "Here's the feedback we're actioning on," was super well-received.
Since that moment, I've always thought a little bit more about, what is the relationship between a developer and a player? I'd like it to be good. I'd like to get to a point where players don't actively dislike developers for adding bugs or making the game worse, which is a modern thing in gaming, which is interesting.
As developers, be completely cognizant that players are your game. If you have no players, it doesn't really matter what you've built, so respect your players and do right by them. The reaction to that blog post was very heartwarming to me.
Kevin: Good. In my corporate experience, there were times where we were really worried about legal liability if we talked too much about what was going on under the hood. That people would sue us if we talked too much about what had happened, were we negligent or something.
We similarly found that providing a level of context and honestly providing enough detail that people could see, "Oh, yeah, this is a hard problem and they are working on it. They are doing the best they can." The community feeling cared for, then I guess the community responds with care for the developers. Making that mutual connection happen is important.
There's studies about doctors, who have a similar kind of problem and a huge liability from medical malpractice lawsuits. Yet they've found in study after study that if the doctor comes out and says, "I'm really sorry, I screwed up. I wasn't able to save your sister, daughter, mother, father, brother, son, grandparent," whatever, that, both the family members come away from that experience being like, "Oh yeah, the doctor did the best they could." And, on the corporate side, they're much less likely to get sued. [laughs]
Convincing management of that is always an uphill battle. I'm both grateful to hear that Blizzard management was willing to talk about this stuff and that the response was so good and I hope that...because I think that it's a poll that we're always going to orbit around is people being like, "No, I don't want to talk about what went wrong.
People will blame us and think we're bad people. People will get mad at us." Just keeping that feedback loop healthy and open is going to be something that we're always working on, that we're always going to have to remind people of and reinforce. Zach, this has been super good. It's been really, really fun chatting with you. Where can people find you online?
Zach: Basically nowhere. I've scrubbed all my social media after a wonderful 2022 and 2023. It's for the best and I'm relishing it.
Kevin: Good. Enjoy freedom, touch grass and yes. [laughs] Good. Well, just have to find you when you pop up on other things. What are you working on now? I forget the name of the place that you're at.
Zach: I'm at a studio called Disruptive Games. We're working on an unannounced project that's being published by Amazon Game Studio. No details I could provide because it's all...but keep an eye out for it. Disruptive Games, really happy with what we're working on. Excited to show it off.
Kevin: Good. We'll find out when we find out. Super excited for it. This has been Critical Point, the "War Stories" podcast on Critical Point with Zach Johnson. I'm Kevin Riggle and we'll see you all next time, folks.
[background music]
Kevin: Thank you all so much for watching. If you liked that, please like and subscribe down below. It helps the channel. Also, we have another great episode coming hot on the heels of this one and you aren't going to want to miss it, so do get subscribed.
Also, if you have any insight into that MySQL behavior that Zach and the team at Blizzard ran into, I would love to hear from you in the comments below. I have some guesses, but it's been a long time since I ran MySQL in production and I know some of you have been deep in the weeds on it.
Like I said at the top of the show, we are now available in audio on all the major platforms as well as video on YouTube, plus full edited transcripts, which you can find at warstories.criticalpoint.tv.
Getting professional transcripts and captions done and integrated is maybe the single most time consuming and expensive part of getting each show done and posted, but I personally want the show to be as accessible to as many people as possible, so please take advantage of them.
Intro and outro music is Senpai Funk by Paul T. Star. You can find me on Twitter @kevinriggle and on Mastodon @kevinriggle@ioc.exchange. My consulting company, Complex Systems Group, is on the web at complexsystems.group. With that folks, till next time.
[music]