Critical Point War Stories with Kevin Riggle | Transcript: Twelve Lines of Code Took Down the Whole Company

Twelve Lines of Code Took Down the Whole Company - Patrick O'Doherty

August 8, 2023 / 01:01:06/E2 Download MP3

Kevin: Hi there, folks. This is the "War Stories" podcast on Critical Point. My name is Kevin Riggle. I run my own little cybersecurity consultancy called Complex Systems Group.

We are here today with Patrick O'Doherty, who is a San Francisco friend of mine and who has worked in software for many years, to talk about that time he broke production, which is the theme of the War Stories podcast, trying to tell incident stories in public.

Without further ado, we'll roll the titles and get started.

[music]

Patrick: All righty. Cool. Great to be here. I'm super excited to tell the time that I broke production, one of many.

Kevin: Yes. I feel like most of us have a number of these stories.

Patrick: This one was memorable...

Kevin: The ones that stick in the mind.

Patrick: It induced nausea.

Kevin: OK. [laughs]

Patrick: Like Kevin said, my name is Patrick. I'm a San Francisco-based software engineer. I work at Oso.
Today, we're a host-and-authorization solution. The story I'm telling dates from a previous life of mine where I worked at Intercom.

I had an eight-and-a-half-year tenure at Intercom. Within that, almost three different lives. The longest of them was on security. This story dates to an early event of mine working on the security team on what should have been a very, very standard Friday morning.

Actually, as I reminisce on my time at Intercom, all of the best Intercom incidents happened on a Friday. Just universally. I'm not afraid of deploying on Friday, but just, it is an observed phenomenon that all of the best scrambles, all of the most hair-raising events seem to come about when the final precipitating kickoff event happens on an early Friday morning, San Francisco.

Kevin: Yes. I feel like when I was at Akamai, it was always like I would be finally catching up on email at 6:00 PM on a Friday. I would read something in email and I'd be like, "Oh, shit, this is an incident, isn't it?"

Patrick: It's just when all of the day's context has fully come to you and you're like, "Oh," it dawns on you that actually, this is a lot more serious than you first gave it credit.

Kevin: Yes, exactly.

Patrick: A little bit of context. Intercom was then an organization that had engineers split in two geographies, primarily Dublin and San Francisco. We were all hosted in AWS, primarily in US-East.

At the time, what I was working on was a project to provision cloud development environments that were hosted in AWS regions that were geographically proximate to where people were working, but which otherwise worked as a seamless single network that allowed people to, for example, share a development environment in progress with each other or to share access to common resources as if they were on the same network.

To do this, it required making what should have been a very straightforward change to the existing VPC CIDR block allocations that we had in US-East.

Kevin: The best kind of change.

Patrick: I think all of maybe a dozen lines. Not too much, but just enough to get the job done. This was about a year into Intercom's usage of Terraform. We had migrated progressively more application-specific
parts of the stack into Terraform, but we had started by putting all of the skeletal, VPC, config, common security groups, all of that sort of stuff into Terraform.

The change that I was making was to take a piece of the IP space that we had already allocated to this VPC and then otherwise give it out to different regions where we could run workspaces. That would be all nice.

Kevin: Terraform, since this is at least theoretically a general audience podcast or a general computer audience podcast, for folks who don't work in the cloud so much, is an infrastructure as code a thing? Where you describe the AWS resources that you want to exist in the cloud in a text file on your local device, check it into GitHub, and then go run a command that goes and tries to make your AWS look as much like the description you've put in the text file as possible.

Patrick: Exactly that. It's HashiCorp's declarative infrastructure as code tool. You provide Terraform a desired state of what you want your infrastructure to look like. It has a state file which it uses to record pointers to whatever cloud resources it's managing under the various name aliases you've given them. Then when you try and run a change, it uses the state, does the evaluation, compares the diff, and then it goes off and it tries to reconcile that. There's a lot of magic in there.

Kevin: Oh, my God. So much magic, yes.

Patrick: A lot of magic. In particular, what was unknown to us at the time was what specific behavior...We were using loops of a very primitive form. Iteration was a very difficult thing, or not difficult, but it was cumbersome in Terraform at the time.

Kevin: Because it's declarative, loops are metaprogramming there.

Patrick: It didn't really fit well into HCL. It also translated at a very primitive way into operations. What I mean by that is so our CIDR block allocations were contained in an array. The change that I made reshuffled the contents of the array. Rather, this was performed destructively. To achieve this, Terraform destroyed all of the items in the first array and then tried to recreate the second array piece, by piece, by piece.

Kevin: I see. This is the array which contains the mapping between which machines should...

Patrick: Basically, subnet allocations, I think. It was CIDR blocks to...Something to do with subnet allocations. I can't remember exactly where this loop was, but suffice to say my CIDR block allocations were conflicting. It was not possible for this loop to complete. About halfway through or about partway through, it failed to create the rest of the array, which meant that all of the remainder regional...Sorry,
availability zones in this region were left without...

Kevin: Subnet IPs assigned.

Patrick: Exactly. Every deployed host in production suddenly went into a netsplit where it has just n equals one.

Kevin: This was for AWS networking, then. Terraform, like "Sorcerer's Apprentice" style, had helpfully removed all of...

Patrick: It did exactly what it was asked.

Kevin: Exactly, removed all of the networking configuration in AWS. Then got halfway through setting up the new one before it was like, "Oops, can't do that."

Patrick: It's the perfect mix of factors because it also...Another pet favorite in here is Terraform's validation is very surface-level because it cannot apply all of the API-level validation that happens when you make a request to HRS.

Kevin: Sure.

Patrick: Backing up, Terraform allows you to create what's called a plan, which is the exact set of operations that it thinks it's going to have to perform to reconcile your existing state to your desired state. In fact, I believe this plan successfully completed before this change that I proposed because from Terraform's perspective, all it had to do was delete this array and then create this new array. It had all of the inputs that it needed for the new array.

It had no knowledge of what the input arguments were, what their relationship was to each other, and whether there was any logical validation that needed to be applied over that. That was something for the API to do. When runtime came, sure enough, the API did that.

Kevin: It compiled, ship it. [laughs]

Patrick: I think this was a change that I proposed at half nine or 10 o'clock on a Friday morning. The engineer that approved it was a Dublin-based engineer on a sibling infrastructure team who was going out to celebrate his birthday that evening. He was going to review it on the bus.

Kevin: And Dublin is...

Patrick: ...ship it.

Kevin: Eight hours ahead.

Patrick: Eight hours ahead. It's like now 9:00, 5:00, 6:00.

Kevin: It's 6:00 PM for him.

Patrick: Exactly. The laptop, he does not have it. It is on his desk. Suddenly, the entire Intercom world goes dark, because...

Kevin: None of these machines can talk to
each other.

Patrick: Nothing in the VPC can talk to each other. Nothing.

Kevin: What's in this VPC, which is a structure inside AWS? Is this all of Intercom's...?

Patrick: It does.

Kevin: ...everything?

Patrick: All of Intercom's everything is in this VPC, and suddenly, load balancers can't talk to their registered targets. All of the AWS monitoring tools can't reach all of the things that they're supposed to be in charge of. We'll come back to that. That's [inaudible] . Basically, all traffic disappears. There is no response from Intercom for a while.

Kevin: Intercom is hard down, the whole thing.

Patrick: There's nothing going on.

Kevin: Even the marketing site?

Patrick: Possibly... not the.... Depend. You might have hit a cached version of it, but if you'd gone to the origin possibly at this stage, you would have still got to the common hosted infrastructure for all of Intercom's apps.

Kevin: Got it. How many people was Intercom at this point?

Patrick: In the [400], 500 maybe range. It's a pretty sizable operation to suddenly suddenly poof.

Kevin: Hard down.

Patrick: Thankfully, mercifully, we had created ahead of time a separate Terraform or separate AWS account where a full Terraform environment was already provisioned. This was ready made for you to perform what we call Terraform surgery. It was immediately obvious to us what had happened, like, "Oh, everything is now goofed. This change has been misapplied." It's partially applied, so it needs to be selectively edited and then applied forward as quickly as possible.

Kevin: To restore the networking.

Patrick: Thankfully, we had a practice break glass mechanism for issuing an ephemeral SSH certificate to reach this unique environment that had the specific cross-account AWS assume-role privileges to be able to reach in. Because you can't administer the account from the inside now, pretty much everything network-wise is...This has to be done externally.

Kevin: You were running this from inside AWS, inside your VPC.

Patrick: We had three AWS accounts in play here. There's the prod where everything is and everything's
going on. There was Terraform to keep the snake from eating itself—there was a Terraform account separately for manipulating production.

Then there was a Terraform emergency separate account, which had a break glass mechanism where there was an EC2, an old scaling group of "N Equals One" that had Terraform checked out and ready to go, and you could...That instance profile had the necessary credentials to be able to perform whatever manipulations would be necessary for you to rectify whatever was going on.

Kevin: Great. My experience with Terraform was running it on my local laptop, and so then it's a little different world, but...

Patrick: We really wanted to move away from people running it locally early on. It's the only way to keep it in any way sensible or under wraps. This environment had the IAM credentials necessary. It had access to the S3 state bucket, all of that to be able, and Vim and other things installed, so you could do literal
state file surgery or whatever was necessary.

I don't believe that this was necessary in this incident, but definitely, there were times where this place was used to do some real...There's a GIF from Indiana Jones where he's trying to swap out a skull for a dummy weight. I empathize with that moment a great deal. Experiences with Terraform.

Kevin: When you're in a text editor editing the internal representation of your tool's understanding of the world in order to convince it, the world either isn't the way that it is or trick it in a useful way, that's...

Patrick: You're trying to get past safety mechanisms as quickly as possible to get to the thing that you want to do because you have a very sure understanding of what's wrong, and safety things couldn't prevent you from getting there, and they're sure as hell not going to rectify it. You want direct action to get it done. Thankfully, this was already at hand.

Kevin: That's good.

Patrick: It took us about maybe 10, 15 minutes to go from, "Oh, what has happened here?" Because the Terraform application that did this was automated. It had been triggered by a merge once this pull request had gone in.

Kevin: I see. There's CI/CD happening. Continuous integration, continuous deployment, so you make the pull request, you get it reviewed, you get it approved. It passed local tests.

Patrick: The Terraform plan was posted as a GitHub comment. You approve that and then you're signing off these changes of things that are going to happen in respect.

Kevin: Oh that's nice.

Patrick: If you tried to approve something before the comment was there, it would reject your approval. It was very productive.

Kevin: You built some nice automation around this.

Patrick: There was a fair amount of homegrown tooling around it. A couple of different CLIs that performed
all of the application of this. The CI/CD workers were orchestrated using BuildKit. They were hosted in this separate AWS account.

Kevin: Got it. Hence why we need this.

[crosstalk]

Patrick: This is a way for us to have separation and the CI/CD flows, but still have all of the access that we wanted in terms of assume-role, being able to provision things, and then having this third account where we can duck in when needs be and do the changes. It was highly productive. I was always very happy to show it off and prior to speak of it to people, because it was certainly, it was very productive way of working. About 10 or 15 minutes to go from delayed reaction of what's going on because...

Kevin: What was the first thing you noticed? What was the first, like, "wait, what?" kind of signal?

Patrick: The more unusual response of zero, your application has become a black hole. You're not getting 500s, you're getting nothing. You're met with silence. I think it was pretty ominous all at once as an event

The more common thing is that an application change went out, like a code change went out because it was hundreds of them a day, and something there met reality that it hadn't anticipated and started creating a huge amount of errors. That's the more common case and that was almost automated. This was like the app accidentally and it's no longer there. None of it.

Kevin: Right. Did you start hearing somebody pinging you in Slack being like, "Hey, I see you just did a change."

Patrick: "Did we accidentally the app?"

Kevin: "Did we accidentally the entire app?"

Patrick: Yes. I was present in the office. This infrastructure change hit everything. Internal back-of-house applications that were being used by people were also hit by this. This is real down-tools.

Kevin: You're like, "I can't log into the employee portal" or whatever.

Patrick: The back of house, like CS tools, would have been down simultaneously. The only other time that I've caused a down-tool situation like this was when I was previously responsible for the San Francisco office Internet. Actually, a routing table. Never again.

Kevin: Networking is hard. Networking, as a friend of mine put it, had in his [email signature], is one letter away from not working. [laughs]

Patrick: It's doing networking updates up a ladder, one-handed in front of a room of people while sweating profusely, just wanting to disappear from everybody's observation.

Kevin: Yes. Oh, man.

Patrick: Terrifying stuff. This again was like, "Hey, everything is broken. There's nothing going on here." Also, all of the usual diagnostics were met with complete silence. There wasn't anything coming back. It was pretty obvious that we had made the change, it had been published in Slack. It was like, "Hey, this is probably relevant."

About 15 minutes later, we got back to signs of life again. Load balancers are beginning to register targets, and auto scaling groups. Things are spinning up and we're beginning to see normalization. Very shortly after that, people log into the application.

Some context. Intercom is a communications platform that businesses use to interact with customers. One of the primary interfaces that people spend their day in is what's known as the inbox. It's a multiplayer real-time inbox that takes in all of the conversations that you receive from multiple streams, and then you interact with them.

Kevin: Where people may have most interacted with this is that little chat support widget in the bottom right of a lot of web sites now.

Patrick: Exactly.

Kevin: On a lot of websites now is powered...a lot of them are Intercom-powered.

Patrick: That's the single-player version of Intercom, as you will. You're speaking to one business, but on the far side of that, there is somebody at the business that you're speaking to. They have a one-to-many interface, much like your Gmail inbox or something else, where they can have multiple conversations going
in real-time.

Kevin: Zendesk. What the customer support rep experiences, for example.

Patrick: Exactly. The business side of that is what's known as the inbox, which is this big, information-rich place where a lot of real-time updates are happening. That inbox makes very heavy use of an ElasticSearch cluster that is fed real-time updates that are tailed as an operation log from Intercom's API.

Changes to state such as a user event being tracked or a new attribute being added to a profile, these are caught in an operation log from Dynamo, I believe at the time, and were replicated into ElasticSearch cluster.

Kevin: Interesting. These are API-level changes that are happening on the platform and you generate...

Patrick: You might example that if you're widgets.com and you have a user profile for Kevin and Patrick, and an attribute on that might be the number of widgets that they've bought. When either of us makes a purchase, widgets.com will update the "n widgets purchased" to three or whatever. That is the thing that you might have a congratulations auto message triggered on within Intercom.

A big focus of Intercom's messaging is that you could create profiles based on data attributes. When people have crossed through that particular threshold of actions taken or when their profile attributes match specific key values or other conditions, that you could use those as a trigger for some other behavior, namely sending events of some forms and maybe tutorial, or feedback, or email.

Kevin: An email being like, "Thank you on being a loyal customer for..."

Patrick: Exactly, or, "We see that you've used this five times without success. What are we doing wrong?" It's like the core mechanic of Intercom, is this behavior-based messaging in the messenger and all of this.

All of that is centered around this ElasticSearch cluster, both for you being able to search through the real-time contents of conversations coming in. That's maybe 50 percent of it. The other half of it was the profile data that I spoke about.

The very, very last thing that happened for any outbound automated message was a check against the real-time search indices to see that you still matched all of the conditions because data is far from perfect. There might be cases where you just marginally dipped into a condition or you might be flapping like this.

It was very, very important to have this final check because also, you could have an expansive fan-out process that might have taken a while. People's data might have gone stale. If you did a fan out to all of your active customers in North America five hours ago and then somebody isn't active anymore or whatever, then you don't want them to get that email.

Effectively, there's some breakage in the fan out because people might not match a particular attribute anymore. Maybe a region has changed or a state has changed so they don't match the filter. You don't want them to get that message.

Kevin: What's doing this final check? Is that ElasticSearch?

Patrick: Yeah. ElasticSearch was used as one component in the basis of that last check. There was both a front of house and a back of house, two very, very heavy components that used ElasticSearch. One that drove all of the interactions that people saw when they were working on conversations.

The other was effectively a gating mechanism on any outbound automated messaging. You could still do one-to-one transactional stuff, but all of your daily automated heavy email, all of those emails, as they were going out, did a last minute validation that triggers that the message was being queued up on still matched the user so that we didn't send nonsense.

Kevin: This is the ElasticSearch cluster, which is dedicated basically to this cron system that is used for...

Patrick: There's two clusters. They're both used one front of house and one back of house.

Shortly after things begin to reappear online, we start to get user reports that there's no search results in the inbox. There's just nothing there, potentially.

Kevin: What is the change that you've made which is causing things to come back online? This is where you went into the Terraform and...

Patrick: I reverted my prospective CIDR block allocations and said, "Not today. We'll do that some other day. We'll go back to the known good state." I just put everything back as it was previously. Hosts that were in the same region, sub-region availability zone, terminology soup, in AWS, all of those availability zones were given back their old locations. We went back to as it was.

Kevin: This was a case where you could at least kinda roll back to a known good state Terraform. You give it back the old config. It looks at the current state of the world. It's like, "Oh, my God." It goes and deletes all of the networking again and recreates it back the way that it originally was. Traffic starts flowing, finally.

Patrick: Every server that previously couldn't see any other server can now see everybody again.

Kevin: Problem solved, right? [laughs]

Patrick: Ostensibly, yes. For a sweet, blissful moment there, it looks like we've escaped from what should be a pretty gnarly incident without much in the way of issue.

Kevin: Only 15 minutes of downtime. It's not great but...

Patrick: Exactly. 15 minutes.

Kevin: If you weren't trying to use the app for those 15 minutes, you'd never notice. Except now you're getting user reports. They're like, "There's no search results here. What's...?"

Patrick: Other signs of, I believe, other signs of life outbound activity alarms are beginning to show. Queues are getting backed up because messages that should be fanning out and sending on a regular basis are now no longer. Simultaneously, you have these two pressures beginning to build.

They all center around this ElasticSearch.

Kevin: Is it both ElasticSearches that are having problems, the front of house and the back of house one?

Patrick: Yep.

Kevin: That's weird.

Patrick: We get these reports that things are broken and then when you look into them, it appears that ElasticSearch is the culprit.

Kevin: The common mode failure.

Patrick: Yeah.

Kevin: What I'm getting from this is that these two ElasticSearches are both like, they're separate because you don't want search going down to affect the sending of these automated messages, but in this case, there's something in common. Maybe we...

Patrick: There's some common component to them. Like Intercom was such a heavy consumer of ElasticSearch at various points that I...It was like a staffed role within the infrastructure team...

Kevin: ElasticSearch...

Patrick: ...to run it as an internal service.

Kevin: ...-keeper?

Patrick: In this case, all of the infrastructure that was used for the staff was communal, front and back.

Kevin: Interesting. So somebody's responsible for both of these ElasticSearch clusters?

Patrick: Yeah.

Kevin: Cool.

Patrick: We look into it. What's odd is, ElasticSearch is there. When you address it, it's there. It's the same host that was there, is there. Then we look a little bit closer. We get network proximity to them and start making some diagnostic requests. It begins to show us that, while it is green, the cluster status is green, all of those alarms are OK, it is entirely empty. There is not a drop of data in it.

Kevin: There was data in it before this happened?

Patrick: There was, yeah. There was quite a lot.

Kevin: Where did it go? [laughs]

Patrick: Mystery. The uptime for these hosts is not very high. In fact, it's equal to the event horizon or
the event time rather. What's going on there? We look and look a little bit further, and
so the team that was managing ElasticSearch was using AWS OpsWorks to do so.

Kevin: I've never heard of this, which is not at all surprising. AWS world, it has so many products, but, what
is OpsWorks?

Patrick: It's a server orchestration product. I am myself not hugely familiar with it because its use was very limited at Intercom. Intercom had its own infrastructure orchestration and application deployment system that was called Muster.

I can only liken Muster as like the family dog that ate a bunch of batteries and never looked at you the same again, but just like all the same because, faults and everything, they were just like, Muster was your dog. Everybody loved Muster.

Kevin: That's, infrastructure stuff like that is, once you start with it, you're never moving away from it. You're just...

Patrick: It is as old as Intercom because it was like a boss-created code base. It was created originally by the VP of engineering.

Kevin: Oh even better.

Patrick: It predates so many different other orchestration tools that came subsequently.

Kevin: For sure.

Patrick: Muster is the thing that Intercom uses exclusively, but the thing that deploys Muster itself was OpsWorks. You spot the turtles all the way down. There needed to be something that was in charge
of keeping some... Because Muster ultimately had a fleet of workers that took a job queue and did the things, so something needed to make sure that those were always running because that was the basis by which other manipulations would happen, like code would get deployed or a rollback would happen or X, Y, and Z.

Kevin: Muster was responsible for the production services, whereas Terraform was responsible for...? Infrastructure?

Patrick: Right. These auto scaling groups that hosted those services. Muster was more responsible for taking your merges to main and bundling that up and putting it on the hosts that were orchestrated and
otherwise organized by Terraform.

Kevin: Got it. CI/CD for the production service and...?

Patrick: Yeah.

Kevin: I'm assuming that Muster also does things like checks to see if the service pings and then reboots if it doesn't.

Patrick: Yeah, exactly. Muster was effectively like a very thick layer on top of auto scaling groups, manipulating auto scaling groups, and using the healthy host count and other triggers from various things pointing at them.

Like a cluster, a Web cluster would just translate to an auto scaling group that maybe had a load balancer attached to other things associated with it, and then Muster would just manage that. There was hundreds of auto scaling groups just logically grouped into the applications. Then they would be deployed communally. The Intercom Rails code base would go out to the various 150 or so clusters that were doing various Web fleets, backgrounds, cron, or whatever.

As a separate concern, OpsWorks was the thing that was like the most core infrastructure orchestration tool beneath it all, where it was used by the infrastructure team to deploy the things that they made available as a service to other parts of Intercom. Because they didn't want to... There had been prior Muster-deploys-itself adventures and we didn't want to revisit them. OpsWorks was the thing that did it instead.

OpsWorks was running these ElasticSearch because this is prior to AWS managed anything in ElasticSearch world. This is, we are running our ElasticSearch...

Kevin: On some EC2 boxes with...

Patrick: Yeah. We are responsible for it all year round, not just for Christmas. We are doing this. OpsWorks is the thing that's responsible for those hosts.

OpsWorks in its wisdom, so its perspective of the event, when we go back to t=0, is that the systems that it was managing disappeared.

Kevin: Right. Yes. Can't ping 'em.

Patrick: It couldn't make any direct connection to them.

Kevin: Health check fails.

Patrick: It could see them in the AWS host...

Kevin: Yes, API.

Patrick: The host enumeration, because everything is still there, but it can't see them. Its next best idea is to reboot the host.

Kevin: Kick them. [laughs]

Patrick: Its first thing is a good old...

Kevin: Kill it and...

Patrick: Kick them.

Kevin: [laughs]

Patrick: What we don't know, and we only subsequently discover, is that the disks that are used as the basis for the index storage are ephemeral, which means that this rebooting empties them. They...

Kevin: They're gone.

Patrick: Yeah, they're gone.

Kevin: You're not using EBS volumes, the Elastic Block Store volumes attached to these EC2 instances for storing this data, which is good because EBS still hasn't fixed the p99 latency problem, which is like an hour. You go to make a file system request, and it's like, "Come back after lunch." You're like...

Patrick: I need it now.

Kevin: "I'm running a database." I assume that that's why, or something similar is why you're doing this. You're doing this on instance storage, which goes away when...

Patrick: Exactly. We have the same number of instances that we had previously. They're just carrying a lot less data. This has happened everywhere. This is a communal configuration issue for how this template of ElasticSearch instance is running and being managed by OpsWorks. Every ElasticSearch cluster across the company right now is pristine, unsullied by any data. Just completely clean.

Kevin: Poof.

Patrick: Poof.

Kevin: Was this the first time that this had happened?

Patrick: Yeah. I think it was the first time that we had an ElasticSearch server blight. It was at this point when I saw empty indices that I was very nauseous because it was unclear what the recovery time would be to...We're talking a billion or more user documents to put into this index.

There are systems that become...I think the term is semi-stable, where they're technically operating and responding to queries, but they're past a point of a sustainable operation because once they exit that envelope, there's no recovery. We had entered that moment.

Kevin: The ElasticSearch clusters aren't the source of truth for this data. The source of truth is the Dynamo, I think?

Patrick: Mm-hmm.

Kevin: The data still exists somewhere, and it's just not in a useful form.

Patrick: Yes.

Kevin: [laughs] Better than the alternative.

Patrick: Thankfully, the backing story of the Dynamo use prior to that...I joined Intercom in 2013. When I joined, Intercom's primary document store was Mongo.

Kevin: Love it. [laughs]

Patrick: Just oodles and bundles of fun.

Kevin: Yes.

Patrick: Basically, with respect to the contents of user documents, Intercom doesn't quite care all that much. Actually, it would prefer to keep each customer's user documents in its own domain. What is useful and necessary to keep is a unique identity index of different identifiers that have been used in different workspaces across Intercom, and then index those into specific documents. It's not necessary to query across documents so much. Certainly, not globally. What is necessary is unique identity index management within a workspace.

That plus rich document manipulation is the thing that's hard to square. Intercom ended up making use of both Aurora RDS plus Dynamo to square the circle. The literal index for identity, what unique identities exist and what workspace shard, like what customer shard or what user ID, email, or etc., what specific unique document identifier that translates to, that was all maintained in RDS. Dynamo got to do all of its great thing with respect to keeping less structured data.

Kevin: Got it. You're using a relational database to do relational database things so that you can do things both across customers, I think I'm getting...

[crosstalk]

Patrick: Not across customers. Basically, Intercom wanted a large expansive document store that was amenable to the access patterns of many different customers, but that also we didn't have to run. That was found in Dynamo.

What was not found in Dynamo was the ability to also maintain the unique identity index that was necessary to perform certain application-level features. It was the combination of the two as one API that from a developer's perspective gave people both.

This became abstracted as a document store for developers to work on. They just got to write things into it, and identities were taken care of.

Kevin: A document store, where you can pull up a user by their email address easily.

Patrick: In a shard.

Kevin: In a shard. Regular document store, what is it if you haven't...It doesn't really have indexes. You've got the key and then a huge JSON blob. The ability to go from something human-readable, human-understandable like an email address to that giant blob requires some additional machinery on top of it.

Patrick: This is the best of both worlds for us because...

Kevin: Absolutely. We must have had a similar thing at Stripe because there's no way we were going to sit there, iterating through all bazillion customers to find the one with the right email address.

Patrick: It leaves RDS to do a very relational database job in a very highly performant way. It's like, "You actually rock at this. I'm very happy to use you." It'll scale to whatever capacity I care about

Dynamo for its part was upholding the operational simplicity and less toil, which was a big, big, big thing. Previously, operating Mongo in any capacity was...It was very, very difficult, particularly at the scale that Intercom wanted to do it. It was very challenging.

Then, the final piece of that puzzle was connecting it into some good search infrastructure. This was the oplog bit that I spoke about earlier.

Kevin: Where ElasticSearch comes in.

Patrick: The changes that were being fed into user documents and other things going on in Dynamo ultimately made their way as changes into ElasticSearch.

Kevin: Got it.

Patrick: This ElasticSearch was the basis for like, search for all users that are in Canada but that have purchased five widgets and use Chrome and that last logged in a week ago or more. This is ElasticSearch's game, right?

Kevin: Absolutely. Oh, my God. Yes. It loves that.

Patrick: You got to, in many ways, use the right tool for the right job like slice and dice. Then, you're providing unified interface to it via the APIs and via the...

Kevin: Yeah.

Patrick: This was only really possible because Intercom was a large monolith Rails code base, where it was possible to do this by convention. It certainly wasn't going to be possible to do it by convention in a multi-repo unless you had other tooling.

Kevin: Oh, sure.

Patrick: We barely managed this with a bash scripts, just grep and other things that forbid you from using anti-patterns and directed you to use something else instead. We would grep for any direct document store sublayer usage, except that in these blessed files, and forbid you from using or allowing you to check your code in if you hit it.

We're like, "Hey, please go through the one door that we have prepared for you. Otherwise, we're going to lose track."

Kevin: Oh the spaghetti that would result.

Patrick: To great fortune, about six months prior, a lot of this work had come to fruition. One of the semi-stable states that Intercom had prior was recovering from ElasticSearch outage because in the Mongo world, it was really slow. It was thought that the setup that we could create in a Dynamo would allow us to have a way quicker recovery. That work had only come to be done about a couple of months prior.

Kevin: Somebody was like, "I'm really concerned that if we lose these ElasticSearches...." Because what I think I'm hearing is that these ElasticSearches have a searchable version of the oplog back to the beginning of time pretty much.

Patrick: They have a searchable history of the current document for... not over each individual operation but just what is the current form of the document, which is the baked representation of the oplog.

Kevin: Got it. The oplog is what's keeping that state up to date, and then...You need some way to get all of the documents from Dynamo into ElasticSearch so that they can be searched.

Patrick: Plus some changes that are going on right now and other things.

Kevin: Plus the changes. This process will not be instantaneous, or even if it was, you don't want to...Or you could stop Dynamo, stop all changes in Dynamo until you finish this process, but no, that's terrible. You send it over and then you start streaming the oplog immediately and eventually it will catch up to real time.

Patrick: Going back to your question. This was something that we had identified as an organizational risk that ElasticSearch was now growing too big to be managed or that a recovery would be painful if and when it happened.

Kevin: Here you are. [laughs]

Patrick: Like I said, about six months after this work had finally come to close, we had drilled some...I say we, the other team had drilled...

Kevin: The ElasticSearch team.

Patrick: Yeah, had drilled a recovery to this, and on that day it worked. Not to say that it was quick, but it was something about...I think it was 10 hours.

Kevin: [laughs] Not what you wanted to spend the Friday doing, and into the evening.

Patrick: No. Not what we wanted to spend the Friday on, but certainly it would have been days or possibly...I don't know that it was bound or you're getting into multiplying it like this is naive.

It's clear that there's something is going to wreck me about halfway through doing this job if I tell you a projection of this thing multiplied by four days, something. There's some other bound here that is not immediately visible, that's going to completely wreck us.

We were able to get back to partial states of recovery a little bit quicker, but back to our reality in the documents or it matches, the tallied reality of everything that happened, and was in around the 10-hour mark.

Kevin: Was this basically taking a dump from the Dynamo and...

Patrick: Now this is where my details, we get a little bit more hazy, but I believe this was based on pretty decent snapshotting that we had. We had some ability to recover from some more recent point in time and play a shorter form of the log because...

Kevin: There was some backups of the ElasticSearch, thank goodness.

Patrick: This was honestly the last bit of Intercom's data puzzle as it were clicking into place because there had always been parts of the...Prior to that, there had always been parts of the platform that it felt like we were building like Wallace and Gromit out ahead of us. Not necessarily...

Kevin: Classic hypergrowth.

Patrick: A success in startups is getting to replace the old jank that you built five years ago.

Kevin: Or six months ago.

Patrick: Exactly. Or having earned the opportunity to come back and replace it because we oftentimes we'll strive for perfection the first time out, but perfection isn't...

Kevin: Always a mistake.

Patrick: ...going to get you always to the next thing. Pragmatism is required to optimize your time utilization and do the thing that's worthwhile and then re-evaluate.

Kevin: In startup land, you're always six months away from going out of business or at least not being able to raise the next round, and so it's like, "Yeah." [laughs]

Patrick: The one thing that you have that nobody else has is focus on your idea, the thing that you can do, but it requires you to do that rather than getting distracted and everything else that anybody else can do. Anybody can build the more generic perfect CI/CD thing to ship an app that's rubbish? [laughs]

Kevin: Are you a CICD company? No, don't, stop that. [laughs]

Patrick: There was definitely a strong sense of that pragmatism practiced at Intercom is one of the engineering values that I was most happy to be around for so long. It was one that rubbed off on me.

Kevin: Nice. Was that the end of the incident? You recovered the ElasticSearch and then...

Patrick: We recovered the ElasticSearch and then people's messages that were queued, because...Also like message fan-out was based on search material. There was a two-phase break here where new messages couldn't be found out because we couldn't do the search to ID expansion.

Then the check-in, the final stage of sending a message couldn't complete because even all of the jobs that had been queued, they wouldn't be able to do that final validation to see, "Hey, do you still match these starting conditions?"

That was just a complete stasis, and messages built up over time. People are still queuing their day's worth of stuff to happen. We unfortunately we can't replay everything. People might have gone in and out
of conditions and that's one of the...

Operationally, Intercom was quite a complicated product or maybe it was quite rich in terms of the functionality that it provided, and that translated into a lot of requirements to be pretty sensitive to data changes and schedule things.

If the moment has passed also with respect to a pop-up message that needs to happen on a page or other things, it's either it's a hit in a very small fraction of time or a miss entirely. The loss of this search capacity was a big hit for those hours.

Then also the inbox, the in-house side of people who are working, they need to look up for other conversations related to broken widgets. They can't do that because search is down. This is also getting in the way of, they're measured on conversations they close per hour or other measures of efficiency that are very straining.

The idea that this tool isn't working, gets in the way of them. It was like a...

Kevin: It wasn't just down tools for Intercom, it was down tools for all of Intercom's customers, [laughs] which is...

Patrick: Pretty sizable. The conversation index came back much quicker, so people were able to resume the business side of getting through their conversation backlog and interacting with people more quickly. It was the outbound message because the document store for...People have way more users than they have conversations with their users.

Kevin: Yeah, because support requests are relatively rare.

Patrick: This large billion-plus document store had to be all just pushed back into ElasticSearch and drilled into shape.

Kevin: Once ElasticSearch gets back up with all the data in there, now there's another queue that has to process with all of the messages that are all of the jobs, which are running to be like, "OK, do I..."

Patrick: "Can I send?"

Kevin: "Can I send," queuing up all the messages to send. Then, there's a fourth queue. I think, we're on third queue, fourth queue, of like, "OK," actually going and sending those messages. OK. Woof.

Patrick: It's like a complex system. Just comes back to life.

Kevin: Then presumably, a bunch of people got in to work on Monday at your customers to a pile of responses to these messages or things that hadn't processed and got to go through that.

Patrick: I'm sure that there were ramifications that happened on Monday. I remember being pretty well frazzled by the Friday's activities. It was a good one.

Kevin: You closed up, what, maybe 8:00 PM.

Patrick: It was a late...We had three different offices. We had Dublin, San Francisco, and London all participating in this incident response in multiple hangouts.

It was one of the better incident coordinations that I had seen with three locations going on because it was possible to divide the problem into a per cluster or per... The clusters were logically broken up into different application areas. They oftentimes had different product teams that were the consumers of that.

It was easy for us to say, "Hey, this cluster, you go do that," or the instructions are the same for everybody. Everybody's a local responder with respect to their team, which allows us to fan out the work.

Kevin: That's nice. Being able to swarm the problem with everybody restoring their own local setup, that's nice. I don't know. I kind of enjoy those. It's like everybody pitching in on a thing. It can be really fun once you've figured it out what needs to be done and are just in the process of doing it.

Patrick: The potency is maybe getting watered down a bit, but ownership as a term, like when practice and you give people the tools, and you don't hide them away or you don't...It was a team's responsibility to run this infrastructure and to make it excellent. It wasn't their responsibility to keep it in an ivory tower or away from people.

In fact, the internal career growth for people who maybe wanted to grow into ops or grow into something or vice versa was quite strong. The more you get to show people, the more likely it is that they might take an interest.

Kevin: Were you managing this incident? Did you have an incident manager?

Patrick: Thankfully, we had an incident manager, the director of engineering at the time.

Kevin: Nice.

Patrick: A seasoned SRE from Facebook, Amazon, and some Irish groups back in the day. The Intercom office had quite a lot of SRE. The Intercom office in Dublin in particular had quite a lot of hired SRE talent from the FAANGs.

For people who don't know, there's geopolitical history of how the Internet has spread into the world. The Euro currency and the fact that Ireland speaks English where no other Euro zone country in Europe does meant that in the late 1980s and the early 1990s, there was a huge amount of foreign investment, particularly from American multinationals in Ireland as the expansion bases for their European operations.

Quite a lot of American tech companies have their European headquarters in Dublin. For example, Google started there in the early 2000s. A lot of the initial offices that they had out there were not product development offices. I can still remember the rare old times of the 2010s where there was a, "Oh, products designed in California, operated in Dublin." [laughs] Maybe, but designed in California.

A lot of the offices that were built out there by Facebook, Google, and such were very SRE-focused for all of their European and onward data center operations. This was talent that was very, very heavily represented in Dublin and was possible for us to hire in.

Kevin: Patrick, this has been really lovely.

Patrick: Super fun.

Kevin: Where can people find you on the Internet?

Patrick: I can be found at patrickod.computer, which is a little home page that I'm building at the moment.

I am on Mastodon at infosec.exchange/patrickod or @patrickod. I don't know. How does one verbally describe your Mastodon address?

Kevin: I think it's an email address like patrickod@infosec.exchange...

Patrick: There you go.

Kevin: But not actually an email address, which is confusing.

Patrick: It's linked from patrickod.computer.

Kevin: Brilliant.

Patrick: That's a good place to go.

Kevin: Any parting thoughts?

Patrick: I don't know. Stay humble with computers. They'll trick you. [laughs] Stay humble with computers. They will trick you, but have fun. Do have fun. As serious as computers can be, it's people at the end of the day that matter.

Kevin: Yes, indeed. This has been the War Stories podcast on "Critical Point." I'm Kevin Riggle. This is Patrick O'Doherty. We'll see you next time.

Patrick: Thank you so much for having me.

Kevin (outro): Thanks so much for listening. If you liked that, please like and subscribe
down below. We're just getting the channel and the podcast series going, and it helps a lot to know that people want to hear more.

If you have an incident story you'd be willing to tell here, please email us at hello@complexsystems.group.

That goes double if you aren't a cis White dude. We're great and have great incident stories, and other people are also great and also have great incident stories. It often helps to say out loud that we're
looking for as many voices as we can. If that describes you, please shoot us an email.

Also, since Twitter—I mean X—but... OK, who am I really kidding? Since Twitter are in the news again, if you worked at Twitter and you have an incident story you'd like to tell from your time there, we'd especially like to hear from you.

You can find me on Twitter as @kevinriggle and on Mastodon at kevinriggle@ioc.exchange.

My consulting company Complex Systems Group is on the Web at complexsystems.group.

With that folks, til next time.

Creators and Guests

Host

Kevin Riggle

Cybersecurity consultant. Principal at Complex Systems Group, LLC.

Twelve Lines of Code Took Down the Whole Company - Patrick O'Doherty

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere