Kate Rudolph: I received a ping from the on-call for the daily push who was responsible for pushing code to the servers every day. They said, hey, we think something that you wrote might have broken log out on mobile devices. We were like, OK, the rollback is out, let's all go to All-Hands.
We were at a stage where people still liked to do that. I think maybe you're never too big to do demos at all hands. There was a completely different team from mine, I probably knew some of the people on the team, but I wasn't aware of what they were working on. They had decided that they were going to do a demo of this because it had just landed.
Oh, in that push, they had landed some crucial code, and so they had decided that they were going to do one of these campaigns that would only show up to the person whose laptop was presenting at all hands. It would only show up for John Yang.
They were going to give him a pop up on his Dropbox home page that says, "Get your free watermelons. Click here to get your free watermelons now."
Kevin Riggle: Hi there, folks. This is the "War Stories" podcast on Critical Point. My name's Kevin Riggle. I'm here today with Kate Rudolph, who has a story about that time she broke production. That's what we do here, is tell those stories.
Just before we get started, a tiny little bit of podcast business. We're now also available in audio as well as video on YouTube. If you go to warstories.criticalpoint.tv, you can subscribe to us wherever you get your podcasts. There are also full edited transcripts of these, because that's often how I prefer to consume podcasts.
I'll have more on that at the end of the show, but without further ado, we'll roll the titles and get started.
Kevin: We're back. Once again, this is War Stories podcast on Critical Point. We're here with Kate Rudolph. Kate, can you tell us a little bit about yourself?
Kate: Yeah. Thanks so much for having me on. My name's Kate. I'm a software engineer, and I've been in the software industry for about eight years as an IC and a manager. The story that I have to tell you today is from when I was an intern at Dropbox, so a little baby engineer starting out in the first summer of my career.
Kevin: What year was this?
Kate: This was summer 2014.
Kevin: Summer 2014, a while ago now.
Kate: Yes, a while back.
Kevin: Some water under the bridge since then.
Kate: Long enough that I can look back on it as a funny memory and not an incredibly embarrassing moment.
Kevin: Exactly, yes. You're working at Dropbox. You're an intern. You're taking the summer to go experience what software engineering is like inside a...I guess Dropbox was in 2014...they'd been around for a minute. They were not a new company, although not as large as they are today probably.
Kate: There were about 700 people at the company at the time that the story takes place.
Kevin: OK, that's a good size. That's like, yeah, still small enough that you can recognize everybody when you see them in the hall, kind of, even if you don't know everybody by name.
What was the first thing that you noticed? What was the thing that, whether you knew it or not at the time, was like a signal that something was wrong?
Kate: At the time, I was working on my intern project which was related to mobile contacts. You have contacts on your phone. You want to be able to search through them when you're sharing within Dropbox. That was what my project was related to.
I received a ping from the on-call for the daily push, who was responsible for pushing code to the servers every day. They said, hey, we think something that you wrote might have broken log out on mobile devices. I was like, oh, well, I mean, my work was on contacts, not really the log in, log out experience. I'm not sure how you got my name, but OK.
My response to this was, well, that doesn't sound good, but I have a one-on-one with my manager right now, so I'm going to go to that. I have a meeting. I went to my one-on-one with my manager, and I led with this. I was like, oh, the daily push on-call was pinging me about this.
Bless this manager. He was like, oh, OK, so this one-on-one is over. He walked me back to my desk and sat with me and helped me respond and figure out what was going on with this incident.
Kevin: An important lesson in priorities.
Kate: Yeah. Again, I was a baby intern. I didn't know.
Kevin: Right, absolutely. If you don't show up for your one-on-one, your manager is going to be mad, right? That's no good.
Kate: Right. It felt like a much more immediate concern at the time, but my manager set me straight.
Kevin: OK, so the feature you were working on is I have a file in Dropbox. It's a form that I've scanned that I need to share with my admin or something. I pop open the share widget in Dropbox, and there's some way for the Dropbox app to talk to the contacts on my phone and be able to offer me those contacts as my potential people to share to. That was the thing you were working on?
Kate: Right, yeah. Photos in particular, but it was about ingesting those contacts with your permission from your phone where you have all your contacts, and then making them available to you wherever you are on Dropbox.
There were a lot of layers to it, both on the phone side. We had some cool cross-platform Android and iOS code, and I was working at that layer to be able to only do it once. Then actually store them on the server so that maybe later you're sharing through the website. You could still use those same people to share with.
Kevin: Nice. OK, cool. An important feature for a platform whose whole value proposition is sharing.
Kevin: You were working on...were you doing this full stack? You were doing the mobile app side as well as the back end side, or...?
Kate: Yes. I was not the first one to touch contacts at Dropbox. There was something here that was working. The idea for the intern project had come from the fact that we were not really doing a differential sync of your contacts. If you had a ton of contacts, we were ending up using up a bunch of data, re-uploading them.
The idea of my project was to perhaps see what had changed, before going on, but, as projects tend to do, especially because, again, I was not that experienced, the scope had increased to respecting data limits about how large your contacts would be, how many of them we could store, making sure that we were not going to be exposed to any lawsuits from storing your contacts for too long or after you didn't want them stored.
I didn't yet have the skill of being, "OK, let's get the main thing out and save the data thing, and then we'll fix up all these little things." I was like, "Oh, that seems important. We better do that and that and that," and so there was all of these details piling up around the main core of the project as well.
Kevin: OK, sure. Yeah, it touches a lot of pieces.
The person who's on call, someone who's running the daily push, I think what I'm taking from this is that every day all of the changes which have been committed to Dropbox's main repository or merged in.
Was Dropbox running like GitOps development where you've got a feature branch and then you merge into main? Or was everything pretty much happening on main? What was the deployment strategy like at this time?
Kate: I think you basically got it. My workflow was to develop on my local branch. Once everything got approved by my mentor or my manager or whoever was reviewing my code, to merge it into main. I'm so glad we call it that now. It was definitely master at the time. [laughs]
To merge it in, and then at some point in the next 24 hours when the daily push on-call got around to pushing it, they would push it and then it would be live on the servers. Mobile, of course, had a much slower release cycle.
Kevin: You have to go through the app store approval process and all that.
Kate: Yeah. They didn't do that nearly as often.
Kevin: Interesting. It's not quite what we would these days call continuous delivery. It's releases at a pretty fast tempo where everything is getting batched up by the day and then, yeah, getting released. That's interesting.
Kate: It was a predictable cadence. Joining the daily push Slack channel was a way to...for me, it felt more like keeping my ear to the pulse of whether people are going to be stressed about issues or outages, figuring out how people are feeling.
There was an on-call rotation. Some number of maybe 10 or 12 people would take a three-day-in-a-row shift of being the one to actually do the push.
Kevin: Then being around in case any problems came up to kick off a response quickly.
Kate: Exactly. Monitoring for exceptions and being responsible, not for fixing them, but for triaging them to, say, me.
Kevin: OK, which it sounds like they did. All right. So your manager walks you back to your desk. What do you find?
Kate: It turns out that what had happened was that indeed, nobody on the mobile apps could sign out of Dropbox on the mobile app, which you wouldn't think would be that common, but when you want to do it, you really want to do it.
The issue was that I, as one of the many like little pieces of this project, we had realized that we did want to remove your contacts that had come from a phone once you logged out of Dropbox on that phone.
We decided that it would mean that you didn't want to have that. There was some...Some other company had recently had some legal issues with storing mobile contacts for too long.
Kevin: Was that the LinkedIn...?
Kate: It was in part a conservative move. I forget whether it was LinkedIn. It might have been some like Foursquare or Quora or somebody was getting sued, for sure.
We decided to do the conservative choice and clear out your contacts when you log out of the mobile app. This was one of the first pathways where all this new contacts code that I had written had sort of gotten exposed because I had mostly been operating under the assumption that everything that I did was feature flagged.
On the very first day, I had put in a feature flag and turned it off and said, "You're only using my new incremental sync contact situation if I put you in this feature flag." It's only me and my manager and my team and stuff like that. Nobody's going to see this code.
Kevin: At worst it will break for us.
Kate: Yeah. However, I protected using your contacts and uploading your contacts, but I didn't put this logging out of the thing, logging out of the app behind the feature flag, and so, it ended up finding a way into my code.
The issue is that I had some bug around if your contact didn't have a last name, which we all have like a couple contacts in there that don't have a last name. If the last name was blank, it was choking. My test case certainly didn't have a contact without a last name because I was naming all of my test cases after like pop stars like Taylor Swift and Katy Perry.
They all had last names. [laughs] Of course, everyone in the real world has at least one contact on their phone that doesn't have a last name.
Kevin: You didn't have any test case for Prince, so there we go.
Kate: [laughs] Exactly. I didn't go with the one-named pop stars.
Kevin: There we go. Yes. Because this is theoretically a general audience podcast, what's a feature flag?
Kate: A feature flag, this is a way that already existed. There was a whole framework at Dropbox and many other places that I've worked already have this set up.
It's this way to control whether or not your new code or your new feature is active by going into an admin portal and flipping a switch, essentially, so that you can decouple "the code got pushed live to production" from "my thing got launched."
Kevin: This is user by user.
Kate: Yeah. Dropbox had a very complicated system and then eventually replaced it. You could do it by users or by groups, by certain conditions about whether the user was paying for Dropbox or not. There's all sorts of different conditions that you could do.
The typical way that this went, especially for intern projects, is you put the feature flag, you turn it off for everyone, and you turn it on for you and maybe your team. It was -- from everyone I was talking to -- pretty safe.
There was pretty little that any intern could do behind their feature flag, or at least that was my assumption. As it turns out, like feature flag engineering is, actually there's a lot to it.
I have gone on in my career to manage entire spreadsheets of different feature flags we have, controlling the different aspects of the feature, which ones need to be turned on, in which order, and which ones could then be turned off without causing problems.
Kevin: Got it. The thought here is to decouple, to the extent possible, like getting the code out on the servers from actually launching the feature where we realized at some point that getting the code out on the servers is an engineering problem. Launching a new feature is a marketing and sales and business-like problem.
Nobody is happy when you go to do the big launch and you push out the new code that has the new feature in. That push fails for reasons entirely unrelated to the thing that it is that you're launching. Better to get the code out and running and make sure...
Better to decouple those as much as possible because everybody is happier. Then you don't have the biz people yelling at the engineers while the engineers are trying to debug some failing test case or...
Kate: Totally. Yeah. Decoupling the business and engineering concerns is one aspect of it. It's also a lot more immediate, like running that push and making sure that it works, it can take an hour.
What if you told the media that we were going to launch this big new thing at 8:00 AM? You don't want it to be out at 8:45.
Another huge reason is testing. Obviously, you want to write all kinds of tests and test on staging or pre-production environments. But it also just makes you feel really secure. If it's already out on production, it works for me before I flip this flag open to 100 percent.
I'm sure that I didn't mess up any of the configuration in my staging environment, or whatever it was.
Kevin: At the scale that we're operating at these days, you know that the real world is going to throw up edge cases that are maybe an edge case of 1 or an edge case of 10 or an edge case of 100 relative to a population of millions or tens of millions of users.
The ability to do staged roll outs, and rather than go from 0 to 100 [percent] in the span of a minute or an hour, to be able to roll it out over days or weeks so that you can manage the process of finding those edge cases and responding to them. It helps a lot with the stability. It helps a lot with the user experience.
Kate: Absolutely. The number of projects that I've been on where it wasn't a big marketing launch, it didn't really matter that everyone get it at the same time. It's like, yeah, 10 percent of the people have the new contacts experience. We'll just keep that true for a week and monitor for crashes, exceptions, everything like that.
Kevin: Exactly. So you get back to your desk, you've rolled this out, and knowing now that the issue was around the feature flag and your lack thereof of this logout code, what was the first thing that you really...What was the initial response experience?
The on-call comes to you and says, "We think it's related to this commit that you pushed. We're not really sure, but..."
Kate: Well, all it took was me reading the commit to be like, oh, yeah, that is my code. I read the exception. The stack trace ended in contacts code that I had been writing. I recognized the code.
Kate: I didn't need to open the link. I recognized it. Their main question for me was, is it an easy fix or do we need to roll back? I needed to learn what that would mean. Was it an easy fix?
Ultimately, yes, but it was also a couple of steps removed from the thing that was breaking. It was well, ultimately, yeah, I do just need to handle the case where this field isn't populated, but I probably want to do that for every single other field.
Maybe this is masking some other bug deeper into my contacts code. The real issue is that none of this is production tested. Everything that we're talking about, turn the feature flag on for 10 percent, that had not happened.
Kevin: No. Exactly.
Kate: We were way earlier. I wasn't expecting any of this code to be live. I believe a day or two later, I went through and checked that the feature flag would actually be applied on that case, and that people wouldn't be falling into my potentially buggy, untested code when they were trying to log out.
The answer was -- I got to this pretty quickly in discussion with my manager who started listening to me describe the things that would need to be fixed -- like, no, no, no, it's not a quick fix, it does need to roll back.
Kevin: There it's literally just, you put together a branch which removed the code from the pull request that you had submitted, and they walked that through the release process, and they pushed that out to live?
Kate: That was one possibility, was to essentially revert my changes from the most recent time, and then go forward, but that would...
Kevin: Hoping that nothing else had been touching similar pieces of the code, but yeah.
Kate: Assuming that there are no merge conflicts or more complicated issues that would happen. Instead of doing that, we decided to do an even faster thing with the way that the push was set up, and just roll back that day's push.
Kevin: The whole thing.
Kate: The whole thing. I think possibly that had had to happen on the previous day or the day before or something, because I remember the code that I had merged was...It had felt like it was from a while ago. When I first got this thing, I was like, "Oh, I don't think I have anything in today's."
I don't remember specifically, so I shouldn't cite this precisely, but I do think that it was a thing where sometimes it's broken, and then the answer is, "OK, the daily push didn't go out today. The code that you thought was going to go out today, it's going out tomorrow."
Kevin: Would new code get added to the push, or was there a buildup of pushes if there were enough of these failures?
Kate: This is why I'm thinking the code that I had landed on main was from the previous week. I think we were already in this buildup of pushes situation.
Kevin: We don't like work in progress of that kind.
Kate: [laughs] The issue was deemed like -- it's like a claustrophobia thing -- people can't get out of Dropbox. We're locking them into the app, they want to log out of the app. We have to fix that, that's pretty locked in.
Kevin: There are some security reasons also why you might need to log people out. For example, if somebody's password gets compromised and they need to change it, you want to invalidate all the old session tokens to get people logged out, so that those old sessions that maybe an adversary has access to aren't active anymore.
It's pretty important. Log out turns out to be pretty important.
Kate: In a way that I didn't really realize when I...It just felt like crossing a T and dotting an I to make sure that they got removed at log out.
Kevin: Somebody also looked at this code. Somebody reviewed this code and they're like, "Oh yeah, seems good," and missed the implications. At least that's my guess, knowing who was involved in Dropbox's engineering culture at the time. Code review was a big part of it.
Kate: I worked with awesome people who just sometimes made mistakes, but regardless, maybe it was a Friday afternoon. It was a Friday afternoon because we had All-Hands that day.
Kevin: Only the best incidents happen on Fridays.
Kate: They decided, you know what? Fixing forward is not that straightforward. We are going to do a rollback. I do think that it was a revert specifically my change. Merge that in and then try again to go forward so that we didn't, like you're saying, have days and days of work in progress. The immediate fix was to roll back, not to revert and go forwards.
Kevin: Got it. OK. There's a button they can press in whatever the deployment interface is to roll back the day's push?
Kate: Yes, I believe so. I'm not sure whether they were doing some sort of blue-green [deployment strategy] and they still had the old stuff.
Kevin: Still on blue.
Kate: Again, a lot of this felt like a black box to me as an intern who had been there for maybe six weeks. I gave the information to the on-call for the daily push who was able to use that to decide that the thing we needed to do to fix logout was to do a rollback to the previous version.
We're like, "OK, the rollback is out. Let's all go to All-Hands," because there's a chapter two of this story.
Kevin: Successfully roll the change back. Everything looks good on the...
Kate: Everyone can log out on their mobile apps. It's fine.
Kevin: The spike in...Was it customer cases or was it some kind of alerting and monitoring that flagged the logout issue? Do you know?
Kate: Yeah, there was monitoring for the exception.
Kevin: I guess it's like mobile app logout goes to zero where it's been trucking along at 10 or 20 an hour.
Kate: Exactly. Somebody saw a line on the graph.
Kevin: It's Friday. You successfully rolled back. Disaster averted. Let's roll into All-Hands and then hit the bars or go home, orwhatever it is we do on Fridays.
Kate: [laughs] Well, first, we have to go to All-Hands. We were at a stage where people still like to do...I think maybe you're never too big to do demos at All-Hands.
Kevin: No. Yes.
Kate: There was a completely different team.
Kevin: 700 people, it's a great size.
Kate: Of course. There's a completely different team from mine. I probably knew some of the people on the team, but I wasn't aware of what they were working on at all. They were building a new way to basically show ads for Dropbox in Dropbox.
We weren't opening advertising on Dropbox.com. If you were there, we wanted to be able to show you with quickly updatable ways to show you different ads or different pictures in different locations on the site, maybe a pop up or on the sidebar or on the top, just like, "Hey, consider upgrading to pro."
Then that way, an engineer doesn't have to say, like, "Oh, hey, I want to build a pop up," and build a whole new pop up for the new campaign of, "Oh, this month we've decided to offer you 30 percent off," or whatever.
The idea was that someone could go into some admin portal and put a campaign in there and then it would show up. You basically store the campaign in the database. Then this new feature shows the campaign based on whether you're one of the lucky few that gets the 30 percent off, whatever.
They had decided that they were going to do a demo of this because it had just landed that day. They were going to do a demo of it at All-Hands. They decided that they were going to...
Kevin: In that push?
Kate: Oh, in that push, they had landed some crucial code. They had decided that they were going to do one of these campaigns that would only show up to the person whose laptop was presenting at all hands. It would only show up for John Yang. They were going to give him a pop up on his Dropbox home page that says, "Get your free watermelons. Click here to get your free watermelons now."
Watermelons were a inside joke at Dropbox. One of the co-founders absolutely loved watermelons. It was commonly used for joking around like this. They were going to restrict it using one of the elements of the feature that allowed you to restrict it to a certain audience. They were going to restrict it only to Zhan Yang.
Kevin: Sort of like the feature flag stuff we were just talking about, except I'm guessing a different mechanism.
Kate: Right, but it was applied specifically to these campaigns.
Kevin: Creatives. These campaigns. Yeah.
Kate: Yes, exactly. All of the code had been going in over the previous month to display the campaigns, to make it look perfect and centered and have all the assets that you might want and everything. Then they had put this watermelon campaign in the database.
Unfortunately, the code that controlled only show it to John Yang [laughs] had only been in that code that, because of me, had been so recently rolled back. The campaign worked great. It just showed up for everyone, like everyone on Dropbox.
Kevin: Like on the Internet., not just at Dropbox?
Kate: Yeah, everyone who went to Dropbox.com had a button that said, "Get your free watermelons here. Click here to get your watermelons." Then you click and it goes away. That was it.
Kevin: That was it. OK. Whoops.
Kevin: It's not like April 1st or anything.
Kate: No, no. It was the middle of summer. It was like July.
Kevin: Well, that's when you want free watermelon, right?
Kate: Yeah, exactly. There were definitely people posting screenshots on Twitter. [laughs] "Where do I get my watermelon?"
Kevin: It sounds like the demo worked in All-Hands.
Kate: Absolutely. [laughs]
Kevin: Great. When was it noticed that this demo had maybe escaped its confinement?
Kate: When people started posting on Twitter. [laughs]
Kevin: OK, great. Social media team is sitting in All-Hands, maybe, and going, "Wait a minute." [laughs]
Kate: "Wait a minute." This is not what we meant to use this feature for.
Kevin: No, no. Did you then get pulled into another incident hot on the heels of the previous one?
Kate: I was not involved in remediating this one. I wish I remembered more clearly, but I do think we did a revert of my buggy change and were in the hour-long process of getting that one out. It was forward minus me, essentially, and minus the thing that I had changed to break things.
There is a period in there where everyone got offered watermelon.
Kevin: OK, great.
Kate: A quicker fix for that, of course, would be just take the campaign down. Once you add in the campaign to the database, you can also remove it.
Kevin: Just remove it from the database. Well, that's good. So the day's push minus your change was already in progress. Once that landed, you just had to wait for that to land. Then the code to gate the campaigns to particular people also landed, and then no one on the Internet was being offered free watermelon anymore.
Kate: "Oh, too bad." I don't think we gave out a single free watermelon as a result of that incident.
Kate: I know, right? [laughs]
Kevin: It's a marketing opportunity. Come on. Organic marketing even. [Pun unintended. -KR] Well, that's unfortunate but yeah, it sounds like it was relatively little customer blowback or harm.
Kate: Of inaccurate promises to put on your website, that's a very silly one.
Kevin: Incredibly silly, yes.
Kate: I think even with the audience of Dropbox, which has millions and millions of users, most people have a sense of humor. It was a...
Kevin: Have a sense of humor about it. Yeah, exactly.
Kate: As far as embarrassing mistakes go, it's embarrassing silly and not mortifying.
Kevin: [Not] embarrassing and shameful. Yeah. Cool. What happened when you came back in next week on Monday?
Kate: [laughs] Certainly, didn't hear the end of it. There was, I think, a number of lessons learned and things that were changed. Being an intern, I wasn't leading the charge to make the changes overall that needed to happen.
One thing, for example, I don't think we demoed in All-Hands on production for the rest of the years that I was there. If you need to do a demo at All-Hands and it's not already out to 100 percent, you can demo on staging. It's fine. You can demo on your local machine. Nobody's looking at the URL.
Kevin: Got it.
Kate: You could just not project the URL. Nobody needs to know that this is not live. Do demo magic. Don't mess with our actual users.
Kevin: Good. Probably for the best.
Kate: Another thing that I certainly learned, and I don't want to say that Dropbox learned it because many of the smart people there probably were already aware.
Kevin: It's an ongoing process.
Kate: When you have to do a rollback, it's not without consequences. Any time you change the code, you can break things. Even if you go back to a known-good thing, any data could have been added under the new code that maybe isn't compatible with or is not going to behave the way that you expect on old things or on old code.
The act of doing a rollback is like, is it safe? It depends a lot on how every single engineer who's been putting code into this machine and expecting it to go out is expecting it to work.
As long as we have the expectations that your code can get rolled back once, but not twice, then you know that the push after your push, then you're safe and it's not going to get rolled back.
Maybe you do want that guarantee, or maybe your company doesn't have that guarantee and so you need to always write code with the assumption that it's not going to get rolled back until, I don't know, the month's safe? Probably. Yeah, no one would do that.
Maybe the daily push wants to have a guarantee about that. I think we were at a place that summer where pretty often you would pass in the hallway the stressed daily push on-call, who's like, "Oh, we haven't had a daily push in three days. Will people stop breaking the build?" or whatever.
You're like, "Yep, we should stop breaking the build." It turns out that that's dangerous for more reasons than just people's code isn't going out. You need to be able to reason about that when you're thinking about how your code's going to make it to production.
Kevin: The mental models of everybody at the company about the state of the production systems and the state of their changes relative to everybody else's changes and those production systems matter a great deal.
The team that had pushed out this marketing, this advertising campaign feature, had they realized that the daily push had been rolled back or...?
Kate: I don't think they had because I think they would have noticed that this plan to demo on production was not quite going to work. I said at the beginning, joining the daily push Slack channel felt to me mostly like a way to keep my ear to the ground. Know whether these things were happening and be around for...
People would post memes. It was fun to feel like you were in the know. But actually, there's a couple more way more important reasons to be in that. One is to have awareness of whether the code that I had written had gone out. Two, for the team that worked on this campaigns feature, to have awareness of whether the code that you're counting on going out actually did.
Kevin: Actually went out. Keeping your mental model in sync with reality to the extent that you can, yeah, super important. Interesting.
Being an intern, I'm sure you didn't see all of this, but what kind of changes were put in place that you observed as a result of this, besides just not demoing in production at All-Hands?
Kate: [laughs] I do believe that there was a doc about just exactly what we're talking about, how to reason about the daily push circulated for people. I believe that after this incident, it was not necessarily two days, because I think there were still situations where the daily push could fail to go out, but two pushes.
Afterwards, you were sure that your code would not be rolled back. They said we won't roll back two.
Kevin: That's a lot. At the kind of velocity of 700 people, so about 30 percent of those are engineers, meaning you have 200, 250 engineers. Some fraction of those are committing code every day.
Two days' worth of code. That is a lot of work to go try to back out. Not all of that is even as easy to back out. If it's a schema migration in the database or something, that database is not getting un-migrated, probably, quickly.
Kate: Yeah, totally. Again, I can't tell whether this was an organizational clarification of the way things work and expectations, or this was the point in my career where I was learning that some changes you really do need to stage into multiple branches, multiple commits, and then multiple pushes.
If you're trying to change the name of something, you really do need to make sure that the backend will accept both of them and do the same thing. Then the frontend will start changing one thing, and then the backend will stop accepting the other thing.
You need to be aware of, like, OK, I have parts A, B, and C of my thing, but what if I have part B out and they roll back to where I only have part A? Let me code in a defensive enough way that my thing won't be the issue that compounds the issue that caused the rollback.
Kevin: There's a certain amount of education and a certain amount of standardization or process. We wrote down our process a little bit more formally to be like, "No, here's how long you should wait before putting any weight on this." Other changes that you saw?
Kate: I don't know particularly. I think I certainly learned a lot and had a change of mindset of what counted as done for my work. I was working on multiple branches and multiple code bases, because I had the mobile code base and the server monorepo. Up until that point, done for me really felt like merged.
Once I had merged it to main, it was off my plate. It wasn't sitting over here in the very top-of-mind list of branches that I'm working on. It wasn't in the code review tool, it was done.
Kevin: Now it's somebody else's problem. [laughs]
Kate: Right. Then at the end of the summer, I would turn on the feature flag, or go through the process of carefully turning that on, but there was this whole step that, because it hadn't been my responsibility, it didn't feel like it was my problem. But actually, yeah, my code going on to servers for the first time always had the capability to break stuff, and it did.
Kevin: That's a transition we've gone through at the organizational level. When I started in the industry, sometimes we were shipping code to customers, almost shrink wrap style, but this idea, "Oh, well, dev throws the flaming bag of code over the wall to ops, and then it's ops's problem to make it run in production," is something that we have moved away from.
It's very, very healthy that we have done so, but that idea of that it's not done until it's running and out there in the world is...It's been something we've gone through as an industry, and it's been very, very important and healthy for us, because there was no way we were going to build anything nearly at the scale that we're at if we kept on in the old world.
Kate: Correct. Absolutely.
Kevin: Besides the sort of like, OK, now you got to wait two pushes before it's really live. Were there other changes in Dropbox's release process that you saw around either testing or alerting on things like this or...?
Kate: Broadly, no, and the daily push rotation with a person who would be on call three working days in a row and then rotate to the next person. They monitored the same homegrown exception service and triaged to the person that they thought broke things I think, essentially throughout my time there...
Kevin: That just continued.
Kate: ...for the server monorepo.
Kevin: Interesting. I'm getting the sense that you came on after your internship full time at Dropbox?
Kate: I did, yeah. Essentially immediately. I had intended to go back for a one-year master's in college, but I stayed on at the end of the summer and never went back.
Kevin: Brilliant. This was the summer after your senior year you graduated to like, maybe I'm going to do...
Kate: This was the summer after my senior year. I'd graduated. You know, at the end of the year, they'll tell each of the interns whether they get a return offer. Through the summer, I was like, "Well, I think I might accept it immediately, so is there any way you could tell me whether I got the return offer a little before the very last week of my internship?"
They were like, "We'll see, we'll see" and we were in the we'll-see phase when I offered watermelon to everyone in the world. I was like, "Oh, no," but ultimately I did get that offer. [laughs]
Kevin: It sounds like it worked out. It turns out that having free time and disposable income by being in industry is a lot more fun than... [laughs]
Kate: I was learning so much more at Dropbox than I could imagine learning taking another semester of classes.
Kevin: Cool. Obviously, there were not that many personal repercussions on you as a result of this, the organization...
Kate: No, everyone was absolutely more mature than blame the poor hapless intern. Thank goodness. [laughs]
Kevin: Good. And you never made the same mistake again.
Kate: I never did.
Kevin: Funny how that works.
Kate: I learned a whole lot about the responsibilities that I have to the whole system of engineering and other people that goes into turning code that I wrote on my machine against my dev environment into code that runs for millions of people.
Kevin: How so? What kind of takeaways did you have there?
Kate: What I was talking about -- changing my definition of done and learning at what points in time things become available to different people. We had complicated staging and canary things going on before this production push.
Then, again, it varied for me based on what specific repo I was working in, whether it was the servers or the cross-platform mobile or one of the individual platforms and just connecting those dots. When I was new to being an engineer, it felt like a lot to have the code work on my machine.
Even getting to that and not having bugs and having my tests pass and figuring out all of the Git wizardry I need to do to get the right code into the right branch or whatever. All of that, to ship that, it felt like enough.
Kevin: Already an enormous lift, but there's this whole other piece and kind of work.
Kate: Right. It's like being a writer versus a publisher. The code that I write needs to also run on all of the servers. It needs to be distributed via the newspaper or Substack or whatever. That's a completely different type of work that is also engineering, and this hadn't been on my radar screen.
Kevin: Nice. How long did you stay at Dropbox then? You were an intern in 2014 and stayed on...
Kate: I stayed through late 2017, and I worked on contacts. We were the contacts team, we were team people, identity and contacts, for, I would say, two and a half years at that point.
Kevin: You stayed on the same project even. That's impressive.
Kate: I moved from mobile contacts to more server-side contacts. It was a full-stack team relating to everything, who you are and who you know on Dropbox, and it was an awesome team.
Kevin: Cool. How did the other team react to that team that had built the internal advertising platform?
Kate: You know I think, as much as I like to take credit for this one, because the rollback was indeed my fault, I think a little bit more of it fell on them in terms of the repercussions and having been the ones to cause this, we're not going to demo in production.
They were seen as a little bit more of the reckless ones, which, yeah, I don't know, putting in a campaign to offer free watermelons that relied on code that was landing that day, is it reckless or is it startup culture?
Kevin: Move fast.
Kate: It doesn't seem that bad to me, but it was certainly a shared responsibility thing.
Kate: The feature, I think, was pretty successful, watermelons aside, and was used for quite some time until the entire home page was redesigned and none of their surface area looked correct.
Kevin: Exactly. Yeah. I definitely know that I saw internal cross-selling, like upgrade to an annual plan, or we've just launched a new photo carousel or whatever, like all this stuff. I'm pretty sure that I interacted with their feature.
There's a thing that you've heard me say a bunch, but I say it every time, that most really bad incidents are not the result of what we might call a point failure. An individual push, a sign error in one line of code, a bolt shearing, a belt snapping, but they're the intersection of multiple processes, which are both operating correctly, but that interact in a way that the people involved in them don't foresee or understand.
Their mental models get out of sync with the world, and then their mental models get reset very quickly when things happen.
Kate: Often the consequence of the multiple interacting systems, each of which are working as intended, is everything completely grinds to a halt, or you get outages or data loss. I just think this example is particularly funny because you didn't get anything like that. You just got mildly embarrassing advertisements.
Kevin: Exactly. On some level, we might call this a near miss. We might call this...Yeah. It's good that it sounds like the Dropbox engineering and business organizations really took it as an opportunity to learn and improve.
Kevin: Rather than either being like, "Oh, ha ha, that was funny, that doesn't really matter. That couldn't possibly happen in a way that hurts people," or also, like, coming down on you hard.
Kate: Any time you have something showing up on the home page of every single user that none of us expected to be there. That's a wake-up call, for sure.
Kevin: That's a wake-up call. Yeah, exactly, and good to treat it as such.
Kate: Not just going to brush that one off.
Kevin: No. Yeah. Good. Did you know, was there any press coverage? Or did it really just hit social media and then...?
Kate: No. Oh, man. If there was, I really want to link to that article, but I think it was just social media. It would be amazing.
Kevin: I'll have to see if I can dig up some posts, some Twitter posts for... put them up as B-roll or something.
Kate: [laughs] Yeah. There must be a screenshot of it out there somewhere. Someone must have...It's the Internet, right? Nothing ever truly goes away.
Kevin: Yes. Well, this has been really lovely chatting, Kate.
Kevin: I'm so glad we could do this. When we were talking and you're like, "Are you sure this is a podcast episode?" I'm like, no. I know this is...Also, I could see the YouTube thumbnail in my head which is critical.
Yeah, I'm glad we could have this conversation. Any parting thoughts? Sorry. First, let's say, where can people find you online? If there is a place that people can find you online?
Kate: I prefer that they not. I'm not a very online person. [laughs]
Kevin: Love it.
Kate: Not trying to plug any of my accounts. I don't need people to find my dog through your podcast.
Kevin: Great. Perfect. Are you still doing coaching things?
Kate: Yeah, I am. Yeah. My friend Andy Scheff started a coaching company called Practica, where people can use their learning and development budgets that their company has towards coaching.
I'm one of the coaches on their platform. I can help you talk about if that one time you broke production and see if there's anything we can learn from that.
Kevin: There we go. Yes. Because this was your first incident, right? This was your first time through any kind of incident process?
Kate: Yeah. It may have been the first time Dropbox users touched my code, I think. Six weeks into the internship, beyond adding my name to the about page or something.
Kevin: First you have to find the bathrooms and the coffee machine, and there's a certain amount of onboarding, getting you spun up..
Kate: I was trying to learn C++ at the time. It was a whole thing.
Kevin: That's a whole thing. I don't know about you, but that wasn't something that they taught us in school. They were just like, "Well, it's a programming language. We've taught you three other programming languages, so you'll figure it out."
Kate: Yeah, exactly.
Kevin: It's not a forgiving language to figure out. About six weeks in is when things would start to cook.
Kate: Things are coming to a head. We have had enough time to realize the edge case about the logout situation. We've had enough time to figure out what we want to do about it. It's finally landed.
As far as introductions go, I'm thankful that it was this and not a days-long outage. Could have been a lot worse.
Kevin: Yes. Very much so. And glad that you were in a supportive organization that could respond well to it and provide everyone involved with the tools they needed to get back to normal operation quickly and metabolize the learnings in a productive way.
Kate: Dropbox was an awesome place to be, just surrounded by so many people who were really, really good at this already. Just an awesome place to learn.
Kevin: Kate, this has been lovely. Folks, this has been Critical Point with Kate Rudolph about that time she gave the Internet free watermelon, or at least offered it. Thank you so much for watching, and we'll see you next time.
Kevin: Thank you so much for watching. If you enjoyed that, please like and subscribe down below. The algorithm likes it and also, I am still just getting the channel going and it makes me very personally gratified to know that people are engaging with what we're making and want to hear me.
If you've ever have a commit you made rolled back unexpectedly, leave a comment below. Like I said at the top of the show, we are now also available in audio and in all the major podcast platforms. Check them out at warstories.criticalpoint.tv.
There are also full edited transcripts up there. If you prefer to read your podcasts, check them out too.
If you have an instant story you'd like to tell here, please email us at email@example.com. Also, I'm always looking for people who aren't cis White dudes to feature because obviously, everyone else great incident stories too.
I'm sourcing people through my networks, but there's lots of folks who I don't know who I would love to talk with. If this is you, please consider yourself especially encouraged to reach out. That email address is hello.complexsystems.group.
Intro and outro music is "Senpai Funk" by Paul T. Starr.
You can find me on Twitter as @kevinriggle and on Mastodon @firstname.lastname@example.org. My consulting company, Complex Systems Group, is on the web at complexsystems.group.
With that folks, till next time.