Was Akamai the Velvet Underground of Site Reliability Engineering (SRE)? - Abhijit Mehta - Bonus

Download MP3
Abhijit:

But,

Kevin:

yeah, I kind of I often describe Akamai as kind of the Velvet Underground of, like, SRE because

Abhijit:

Yeah.

Kevin:

It's the just what sorry about the Velvet Underground's first album was that, like, only, like, 250 people bought it, but each one of them went and started a band. And, like, Akamai is kind of the same way where, like, you know, still, you know, most people haven't heard of it, but, either the people or the thought, you know, the the techniques, like

Abhijit:

Yeah.

Kevin:

Are a big part of the foundation of the modern, you know, Internet and modern Internet companies.

Abhijit:

I remember my first incident at Akamai, I actually learned about consistent hashing

Kevin:

because

Abhijit:

Okay. It was it was one of those things, there was a product

Kevin:

And sorry. For our for our listeners, you know what consistent hashing is? I have the 20 minute explanation that they gave us when we joined the infosec department. But for our listeners who are coming into this cold and actually, many of our listeners, who have probably not heard about Akamai at all, First of all, what is Akamai?

Abhijit:

Sure. So Akamai is a company that runs a network of servers at the edge. Right? So an ISP is close to where you live. They made their money through caching.

Abhijit:

So, for example, if you type in a website for an Akamai customer, which

Kevin:

Itunes.com was the example I always used to career fairs. That was the one we were allowed to use. We were not allowed to use the fruit company in Cupertino.

Abhijit:

Exactly. Yeah. You you notice what I did there where I just said because I I always forget, like, what is the list we're allowed to use and what aren't we? That's exactly. Many Fortune 500 companies are Akamai customers.

Abhijit:

Right? So you type in one of these websites and through some magic, Akamai has the Internet mapped out. They figure out what server Akamai server is closest to you in the Internet, closest to you in, like, topological sense. It's a little more complicated than that, but essentially that. DNS is the interface they use to hand out the IP of that Akamai server.

Abhijit:

So you make a request for itunes.com, you Akamai figures out where you are, Akamai hands you back the IP of an Akamai server that is close to you. You connect to that, and there's a good chance that server so that server proxies your connection back to itunes.com. That server occupies the content delivery network. That server probably has the song that you want to get in cache at the edge. Yes.

Abhijit:

So, you know, the organizer Taylor is

Kevin:

optimized server. The 1989 Taylor's version, all the songs are there already. Like, someone else has requested them before you. Yes.

Abhijit:

Yep. And so it's all there at the edge, so you get it super fast. You know, there are other optimizations. Right? TCP works better if you connect to something close by than something far away, so you're less you so you have better performance.

Abhijit:

Right. Most of Akamai's business now is security related. Right? And so there's all sorts of interesting and useful things you can do if you if you sort of sit as a buffer between the user and Yes. The the home server or the the company server.

Kevin:

It's much easier to observe a DDoS attack when you can do that at the ISP of, most of the, you know, at most of the ISPs, whose broken customers are originating the traffic rather than if you have to do that all in Cupertino. Exactly.

Abhijit:

Exactly. And and Akamai had a scale and sophistication that, you know, by both intelligence of the platform and just how distributed it was, it could absorb a lot of those things, right? I think I think there was another incident where, actually, you called me up at, like, 3 o'clock in the morning for a set of servers that I watched, and you're like, hey, there's there's an attack. And I looked at it. I'm like, yep, there's an attack.

Abhijit:

And we're like, should we do anything? Well, actually, the changes to do something, it's fine. Right? Like, watch it. Yeah.

Kevin:

Yeah. Yeah. Yeah. Yeah. Watchful waiting.

Abhijit:

So it's it's really fun. Right? Like, right out of school to get to play with systems like that that that run the Internet. You know, so we were talking about consistent hashing. Yes.

Abhijit:

One of the cool things about so you think about, oh, you have these hundreds of thousands of servers spread across the world. How do you get them to work together? And I think one of the the important lessons I learned at Akamai was, like, the first rule of distributed systems problems is try to avoid solving distributed systems problems. Right?

Kevin:

If at all possible, don't.

Abhijit:

Yeah. And and so, like, consistent hashing is a way of thinking about how to get how to deal with having a lot of machines without needing to worry about them all talking to each other. So, in particular, the idea is, hey. You know, I want to connect to a server. The server should be up, hopefully.

Abhijit:

Is there a way to do that without the system having knowledge of which 200,000 servers are up at any given moment and Right. Doing and so the idea behind consistent hashing is you kind of you imagine there's a ring and you imagine each server is a place on the ring. Yep. And then you as a user, you come in and you have a hash that puts you somewhere else on the ring. And then what you do and and that hash is consistent.

Abhijit:

You always end up at the same place on the on that circle.

Kevin:

So it's it's picking some features about me, like, likely my IP address, and it's, using those features to, like, pick a place on that ring that corresponds to me. Yeah. Right.

Abhijit:

And

Kevin:

it's always gonna be the same place as long as I'm coming in from the same IP every time.

Abhijit:

Exactly. And so then the idea is you always end up at the same place on that circle, and the way you pick what server you go to is you move along the circle. And the first similarly, the servers were hashed on the circle. The first Right. Server that you hit hit is the one you try to connect to.

Abhijit:

And if it doesn't work out, no big deal. You know? You keep the next one circle and go to the next one.

Kevin:

Okay.

Abhijit:

And then that that's super powerful because it means you know, in the example we gave about with Itunes, it means that you are generally going to go back to the same server each time. People close to you will go to the same server each time. And so, you know, it's likely the same Taylor Swift song or something will be in cache right there, and that's Right. That's how you can kind of make sure you go back to the right server without them all globally talking to each other.

Kevin:

Yeah. My first guest on the podcast was Willie Williams who was actually an intern on the mapping team at Akamai in, like, 2,004 and had a story about, some changes that they made, not to the consistent hashing, but to the geolocation, which wound up, actually routing all of Yahoo's traffic to Japan. Needless to say, Yahoo was not terribly happy about this, but they got it fixed. And so, yeah, Same company, for any viewers who've been, caught that episode and, you know, yeah, some of the the technology underlying that with the goal, right, of, yeah, keeping keeping your accesses, local. Because also, like, clients will often, like, request the same thing multiple times.

Kevin:

So, you know, there's a certain locality to that. There's, yeah, locality, like, geographic certain geographic regions will get super interested in Taylor Swift at the same time possibly Yep. Because there's an Eris tour concert happening there, you know, that day. And so

Abhijit:

yeah. So yeah. And it was super fun. Right? Because most many people would learn about consistent hashing, say, in a class in college.

Abhijit:

I wasn't a CS major. I was learning quantum field theory or, you know, whatever. Right.

Kevin:

I was

Abhijit:

doing physics. So, like, the way I learned about consistent hashing was there was an incident, in the map reward where a bunch of servers that served a particular type of traffic went down. And, you know, the the root cause had to do with how consistent hashing had been implemented for that particular product in a way that worked great in testing, worked great for small customers, but when you put a big customer on it, you ended up with a memory leak or something like that.

Kevin:

Okay. Yeah.

Abhijit:

Sure. But it was it was cool because it's like, oh, I got to learn about this really interesting mathematical concept in distributed systems because something broke at scale that took down Right. A bunch of, you know, a bunch of services.

Kevin:

And suddenly, the math was relevant to something into the real world And Yeah. Also we had to fix it, like, yesterday.

Abhijit:

Exactly. Another cool thing about that though was, like, when that happened, there was no customer impact because the way Akamai systems were designed was to there are all sorts of tricks to Right. Mitigate.

Kevin:

You kind of just move you you kind of if the clients can't connect, they kind of just keep moving around that ring, and there are a lot more servers. And eventually, they hit one that works, hopefully.

Abhijit:

Yeah. Yeah. And and other tricks too. Right? There's just a lot of really beautiful, beautiful mitigations,

Kevin:

you know. The the towering pile of hacks that makes the Internet work as well as it does. Yeah.

Abhijit:

If if one set of config seemed to make everything crash, go to a machine that was running the config from an hour Right?

Kevin:

Like, so many

Abhijit:

so many things like that. Right? Yes. So it was cool to see a system that resilient Yeah. Which, like, the jobs I used to run to run stuff on Open Science Grid were a little resilient, but not that resilient.

Abhijit:

Yeah. The other thing that I thought was really cool though was just the the culture. Right? How people would handle an incident. I think, like, I think Eric Nygren used to say, like, people would ask him, like, hey.

Abhijit:

What's the thing you're most proud of at Akamai? And and he'd say the incident process.

Kevin:

Yeah.

Abhijit:

Yeah. Which,

Kevin:

Because that was his baby. Or at least when I was when when we received it, it had gone through a couple of hands, I think, since him. But his his fingerprints, his authorship was very, very evident on, like, everything there and a lot of the innovations that we received, like the business incident lead, like, he had pioneered almost by necessity because, like, the CEO walked into his office while he was running an incident as the tech lead. He was like, I cannot both keep the CEO happy and keep keep the engineers moving forward. Congratulations to the person next to him.

Kevin:

You have the con, and he became the 1st business incident lead by walking Tom out of his office.

Abhijit:

Yep. Yep.

Kevin:

Or whoever was the CEO then.

Abhijit:

And the cool thing about that is, like, that wasn't just Akamai. Right? Like, so many folks at Akamai went to run critical bits of platform at Facebook and Google and other places. Yep. And, you know, like, you read the Google SRE book and Yep.

Abhijit:

You see the instant process at Google and it's, like, many of the same attributes. Yep.

Kevin:

I think some of that is convergent evolution. I've been I've been trying to trace the that. I think some of it is convergent evolution, but some of it absolutely is. Like, the guy who was the architect on query at Akamai, which was a way of running SQL, like, queries over the entire network to answer questions ranging from, you know, how many regions are up to, like, what software is running on these servers. Took that to Facebook to and was the lead architect on Ozquery, which folks here may have used, which is the open source version of that.

Kevin:

It doesn't have all of the next systemic features. Long story there. But, the upticks folks, also some former Akamai folks are working on bringing that to market. But, yeah, I kind of I often describe Akamai as kind of the Velvet Underground of, like, SRE because

Abhijit:

Yeah.

Kevin:

It's the the what sorry about the Velvet Underground's first album was that, like, only, like, 250 people bought it, but each one of them went and started a band. And, like, Akamai is kind of the same way where, like, you know, still, you know, most people haven't heard of it, but either the people or the thought, you know, the the techniques, like

Abhijit:

Yeah.

Kevin:

Are a big part of the foundation of the modern, you know, Internet and modern Internet companies.

Abhijit:

Yeah. It's and it's funny. I I think some of it too is, like, Akamai had a very, like, East Coast buttoned up culture. And

Kevin:

Well, very academic culture too.

Abhijit:

Yeah. And Yeah. And Silicon Valley companies like Google, I think, have done a better job evangelizing. So, like, Yes.

Kevin:

Oh my god.

Abhijit:

So that that's definitely a thing. Right? But Akamai was afraid to talk about some

Kevin:

of this stuff. Sometimes because it was, like, oh, this is proprietary. This is to us. Sometimes it was because, like, we don't, This feels janky, and we're embarrassed by it. No.

Kevin:

Actually, it's, like, the best thing in the industry, but we were very embarrassed by it. And sometimes, you know, also because, like, it had been we built it with Perl and make files in, like, 1998 and just kept it going. And so the Googles of the world got to come back and, like, build it with, like, modern software, you know.

Abhijit:

Well, yeah.

Kevin:

Yeah. Thanks so much for watching and listening. If you enjoyed this outtake, check out the full episode with Abhijit Mehta. Link in the episode description below.

Creators and Guests

Kevin Riggle
Host
Kevin Riggle
Cybersecurity consultant. Principal at Complex Systems Group, LLC.
Was Akamai the Velvet Underground of Site Reliability Engineering (SRE)? - Abhijit Mehta - Bonus
Broadcast by