Was Akamai the Velvet Underground of Site Reliability Engineering (SRE)? - Abhijit Mehta - Bonus
Download MP30:00
(Kevin) But yeah, I kinda, I often describe Akamai as kind of the Velvet Underground of, like, SRE because --
(Abhijit) Yeah.
(Kevin) -- it's -- the story about the Velvet Underground's first album was that only, like, 250 people bought it, but each one of them went and started a band.
And like, Akamai is kind of the same way where still most people haven't heard of it, but either the people or the thought, the -- the techniques, like…
(Abhijit) Yeah.
(Kevin) ... are a big part of the foundation of the modern Internet and modern Internet companies.
(Abhijit) Um, I remember my first incident at Akamai I actually learned about consistent hashing because it was -- it was one of those things, there was a product…
(Kevin) And -- sorry.
For our -- for our listeners, you know what consistent hashing is. I have the 20-minute explanation that they gave us when we joined the InfoSec department, but for our listeners who are coming into this cold, and actually many of our listeners who have probably not heard about Akamai at all…
First of all, what is Akamai?
(Abhijit) Sure. [laughs]
So, Akamai is a company that runs a network of servers at the edge, right? So an ISP's close to where you live. They made their money through caching. So, for example, if you type in a website or an Akamai customer, which --
(Kevin) iTunes.com was the example I always used at career fairs... That was the one we were allowed to use. We were not allowed to use the fruit -- fruit company in Cupertino.
(Abhijit) Exactly. Yeah. You notice what I did there? Where I just said -- 'cause I always forget --
(Kevin) Yeah.
(Abhijit) -- what is the list we're allowed to use and what aren't we.
But many Fortune 500 companies are Akamai customers --
(Kevin) Yes.
(Abhijit) -- right? So you type in one of these websites, and through some magic, uh, Akamai has the Internet mapped out. They figure out what server, Akamai server is closest to you on the Internet.
1:59
Closest to you in a topological sense.
The -- it's a little more complicated than that, but essentially that.
Um, DNS is the interface they use to hand out the IP of that Akamai server, so you make a request for iTunes.com, you -- Akamai figures out where you are, Akamai hands you back the IP of an Akamai server that is close to you, you connect to that, and there's a good chance that server -- so that server proxies your connection back to iTunes.com.
That server... Akamai's a content delivery network. That server probably has the song that you want to get in cache, at the edge --
(Kevin) Yes. Yes. Taylor Swift's -- the 1989 Taylor's version all the songs are there already. Like, someone else has requested them before you, yes.
(Abhijit) Yep. And so, it's all there at the edge so you get it super fast. Um, you know, there are other optimizations, right? Uh, TCP works better if you connect to something close by than something far away, so you're less -- you -- so you have better performance.
(Kevin) Right.
(Abhijit) Most of Akamai's business now is security related, right? And so there's all sorts of interesting and useful things you can do if you -- if you sort of sit as a buffer between the user, and --
(Kevin) Yes.
(Abhijit) -- the home server, or the company's server.
(Kevin) It's much easier to observe a DDoS attack when you can do that at the ISP of most of the, you know, at -- most of the ISPs whose broken customers are originating the traffic, rather than if you have to do that all in Cupertino.
(Abhijit) Exactly. Exactly.
And Akamai had a scale and sophistication that by both intelligence of the platform and just how distributed it was. It could absorb a lot of those things, right?
(Kevin) Yeah.
(Abhijit) I think there was another incident where, actually, you called me up at, like, three o'clock in the morning for a set of servers that I watched. And you were, like, hey, there's an attack, and I looked at it, and I'm, like, yup, there's an attack. And you were, like, should we do anything? I'm, like, well... Actually, the changes to do something... It's fine, right?
(Kevin) Seems fine. Yeah, yeah. Yeah. Watchful waiting.
(Abhijit) So, it's really fun, right?
3:59
Like, right out of school, to get to play with systems like that that -- that run the Internet. Um, so we were talking about consistent hashing.
(Kevin) Yes.
(Abhijit) Well, one of the cool things about. So you think about, oh, you have these hundreds of thousands of servers spread across the world, how do you get them to work together? And I think one of the important lessons I learned at Akamai was, like, the first rule of distributed systems problems is try to avoid solving distributed --
(Kevin) Yes. Yes.
(Abhijit) -- systems problems, right?
(Kevin) If at all possible, don't. [laughs]
(Abhijit) Yeah. And so, like, consistent hashing is a way of thinking about how to get -- how to deal with having a lot of machines without needing to worry about them all talking to each other.
So in particular, the idea is hey, I want to connect to a server... The server should be up, hopefully. Is there a way to do that without the system having knowledge of which 200,000 servers are up at any given moment and doing a calculation each time?
And so the idea behind consistent hashing is you kind of, you imagine there's a ring. And you imagine each server is a place on the ring.
(Kevin) Yeah.
(Abhijit) And then you as a user, you come in and you have a hash that puts you somewhere else on the ring. And then, what you do -- and -- and that hash is consistent. You always end up on the same place on the -- on that circle. So then --
(Kevin) It's picking some features about me like likely my IP address, and it's using those features to pick a place on that ring --
(Abhijit) Exactly.
(Kevin) -- that corresponds to me. Yeah.
(Abhijit) Right.
(Kevin) And it's always gonna be the same place as long as I'm coming in from the same IP every time.
(Abhijit) Exactly. And so then, the idea is you always end up at the same place on that circle. And the way you pick what server you go to is you move along the circle. And the first -- similarly, the servers were hashed on the circle.
(Kevin) Right.
(Abhijit) The first server that you hish (sic)hit is the one you try to connect to. And if it doesn't work out, no big deal. You know, you keep going along the circle, and go to the next one.
(Kevin) Okay.
(Abhijit) Um... And then that -- that's super powerful because it means, you know, the example you gave about iTunes…
6:01
It means that you are generally going to go back to the same server each time.
(Kevin) Right.
(Abhijit) People close to you will go to the same server each time.
And so... You know, it's likely the same Taylor Swift song, or something will be in cache right there, and that's -- that's how you can kinda make sure you go back to the right server without them all globally talking to each other.
(Kevin) Yeah.
My first guest on the podcast was Willie Williams who was actually an intern on the mapping team at Akamai in, like, 2004, and had a story about some changes that they made not to the consistent hashing, but to the geolocation, which wound up actually routing all of Yahoo's traffic to Japan.
(Abhijit) [laughs]
(Kevin) Needless to say, Yahoo was not terribly happy about this. Uh... But they got it fixed. And so, yeah. Same company, for any viewers who have been -- caught that episode. And yeah, some of the -- the technology underlying that, with the goal, right, of, yeah, keeping -- keeping your accesses uh, local.
'Cause also, like, clients will often request the same thing multiple times, so you know, there's a certain locality to that. There's, yeah, locality, like geographic -- certain geographic regions will get super interested in Taylor Swift at the same time possibly because there's an Eras tour concert happening there, you know, that day, and so yeah.
(Abhijit) So, yeah, and it was super fun, right? Because most, uh... Many people who've learned about consistent hashing, say, in a class in college…
I wasn't a CS major. I was learning quantum field theory, or you know, whatever. I was there doing physics. Um, so, like, the way I learned about consistent hashing was there was a incident in the mapper org where a bunch of servers that served a particular type of traffic went down, and um...
You know, the root cause had to do with how consistent hashing had been implemented for that particular product in a way that worked great in testing,
7:59
worked great for small customers, but when you put a big customer on it, you ended up with a memory leak, or something like that.
(Kevin) Okay, yeah. Sure.
(Abhijit) But it was -- it was cool because it's, like, I got to learn about this really interesting mathematical concept in distributed systems because something broke at-scale that took down a bunch of, you know, a bunch of services.
(Kevin) And suddenly the math was relevant to something in the real world, and also we had to fix it, like, yesterday.
(Abhijit) Exactly.
Another cool thing about that, though, was when that happened, there was no customer impact because the way Akamai's systems were designed was to... There were all sorts of tricks to --
(Kevin) Right.
(Abhijit) -- mitigate.
(Kevin) You kinda just move -- you kind of -- if the clients can't connect, they kind of just keep moving around that ring, and there are -- there are a lot more servers. And eventually, they hit one that works, hopefully.
(Abhijit) Yeah, yeah. And other tricks, too, right? There's just a lot of really beautiful -- beautiful mitigations, you know.
(Kevin) All the -- the towering pile of hacks that makes the Internet work as well as it does. Yeah.
(Abhijit) If one set of config seem to make everything crash, go to a machine that was running the config from an hour ago, right? Like, so many --
(Kevin) Exactly. Yes.
(Abhijit) -- so many things like that, right?
(Kevin) Yes.
(Abhijit) So it was cool to see a system that resilient. Which, like, the jobs I used to run to run stuff on Open Science Grid were a little resilient, but not that resilient.
(Kevin) Yeah.
(Abhijit) Um... The other thing that I thought was really cool, though, was just the culture, right? How people would handle an incident.
I think, like, I think Erik Nygren used to say people would ask him, like, hey, what's the thing you're most proud of at Akamai, and he'd say the incident process, right? Which, um…
(Kevin) Because then that was his baby. Or at least... When I was --
(Abhijit) Yeah.
(Kevin) When we received it, it had gone through a couple hands, I think, since him, but his -- his fingerprints, his authorship was very, very evident on, like, you know, everything there, and a lot of the innovations that we received like the business incident lead he had pioneered almost by necessity because the CEO walked into his office while he was running an incident, and as the tech lead, he was, like,
9:59
I cannot both keep the CEO happy and keep -- keep the engineers moving forward. Congratulations to the person next to him, you have the con, and he became the first business incident lead by walking Tom out of his office. [laughs]
(Abhijit) Yep, yep.
(Kevin) Or whoever was the CEO then.
(Abhijit) And the cool thing about that is, like, that wasn't just Akamai, right? Like --
(Kevin) No.
(Abhijit) So many folks at Akamai went to run critical bits of platform at Facebook and Google and other places, and uh, you know, like, you read the Google SRE book, and you see the incident process at Google, and it's, like, many of the same attributes.
(Kevin) I think some of that is convergent evolution? I've been trying to trace the -- that. I think some of it is convergent evolution, but some of it absolutely is.
Like the guy who was the architect on query at Akamai, which was a way of running SQL-like queries over the entire network to answer questions ranging from how many regions are up to, like, what software is running on these servers.
Uh, took that to Facebook to -- and was the lead architect on OZ Query -- which, folks, you may have used, which is the open source version of that. It doesn't have all the nice systemic features. Long story there. But the Upticks folks, also some former Akamai folks were working on bringing that to market, but yeah.
(Kevin) But yeah, I kinda, I often describe Akamai as kind of the Velvet Underground of, like, SRE because --
(Abhijit) Yeah.
(Kevin) -- it's -- the story about the Velvet Underground's first album was that only, like, 250 people bought it, but each one of them went and started a band.
And like, Akamai is kind of the same way where still most people haven't heard of it, but either the people or the thought, the -- the techniques, like…
(Abhijit) Yeah.
(Kevin) ... are a big part of the foundation of the modern Internet and modern Internet companies.
(Abhijit) Yeah. And it's -- I think some of it, too, is, like, Akamai had a very East coast buttoned-up culture, and --
(Kevin) Well, a very academic culture, too.
(Abhijit) Yeah! And -- and Silicon Valley companies like Google,
12:01
I think, have done a better job evangelizing so, like --
(Kevin) Yes. Oh, my god.
(Abhijit) So that's -- that's definitely a thing, right? But --
(Kevin) Akamai was afraid to talk about some of this stuff. Sometimes because it was, like, oh, this is proprietary, this is to us. Sometimes, it was, like, we don't -- this feels janky, and we're embarrassed by it.
No, actually, it's, like, the best thing in the industry. But we were very embarrassed by it.
And sometimes, you know, also because, like, it had been, we built it with Perl and makefiles in 9 -- in 1998, and just kept it going, and so the Googles of the world got to come back, and, like, build it with, like, modern software.
You know.
(Abhijit) Well, yeah. [laughs]
(Kevin) Thanks so much for watching and listening. If you enjoyed this outtake, check out the full episode with Abhijit Mehta. Link in the episode description below.
[experimental Japanese pop music outro plays]