She Missed the Overnight Push! - Julia Lunetta

Download MP3

Kevin Riggle: So you've been given this system which is designed around a very different sort of, a fairly different process, and need to go like cut out the stuff that doesn't apply.

Julia Lunetta: So I have all these scripts and I'm pretty, you know, I feel pretty confident that, you know, none of the, I don't know, let's say 20 of them are used in the configuration.

Kevin: Should be safe to prune.

Julia: Yes, 'should' being, of course, the operative word.

Kevin: Okay, yeah, yeah. But also, this is on the running system. This is like, there's only one system.

Julia: There's no other way to do it.

Kevin: So it's 4pm. The nightly batch job is going to start running in a few hours, and you're sitting at the console logged in, you're like, all right, I can do this.

Julia: Essentially, the order of operations, as I understood it, is move the files, wait to see if anything explodes, and then go home. So, I move the files, and everything's quiet. Like, okay, cool. This is, this seems good. I mean I was expecting precisely nothing to happen anyway, and it looks like nothing happened. So great, you know, I go home, I, you know, whatever, I go to bed. I get up the next morning. And I take a look at my phone and notice that basically everything's on fire.

Kevin: Okay!

[fade to black]

Kevin: Howdy, folks. Welcome back to the War Stories podcast on Critical Point. I'm Kevin Riggle. We are here with Julia Lunetta to talk about that time she broke production. Because that is what we do here is tell incident stories in public.

There's not really a lot of preamble since the last time, since last episode. We are still available everywhere you get your podcasts at warstories.criticalpoint.tv as well as on YouTube. But with all that said, we will roll the titles and kick off.

[music]

Kevin: And we're back. Once again, the Critical Point podcast, uh, the War Stories podcast from Critical Point with Julia Lunetta. Julia, can you tell us a little bit about yourself and how you found yourself in a place to break production?

Julia: Sure thing. So I'm currently a Scrum Master for a large financial company. But I have bounced around between like HR tech, you know, briefly worked for a company that did software development for online poker and casino. And all the way back to the investment side of the financial industry, which is my, which was my first real job out of college, which is when, which is where I broke production.

Okay. So this was a, this was a 529 company. I don't know how—

Kevin: What does that mean?

Julia: 529s are tax free or tax advantaged investment plans specifically for like education expenses, usually for colleges.

Kevin: College savings plans.

Julia: Yeah, exactly.

Kevin: I thought that was what that was. But I yeah, okay...

Julia: No it's fine, between the numbers and the jargon and the financial industry. I should know better. But I clearly don't.

Kevin: 401k, 501(c)3, blah blah blah. Like it's just, it's just alphabet soup.

Julia: Yeah.

Kevin: Okay, so company whose whose product was selling these tax advantaged college savings education savings plans and doing the investment behind them?

Julia: Yes. And so the part that where I came in was there was a whole automation engine that had been pieced together. As best I could, like, I don't want to quite say by hand, but it was, it was a heavily hacked version of some system monitoring software called 'Mon', which I had never heard of before and have never heard of since. So you know, I apologize if you know, the creator of Mon is listening.

Kevin: M-O-N ?

Julia: M-O-N as in 'mon-itor', yeah.

Kevin: Okay, but not Monit, which is what I've worked with. So okay, it's yeah, this is why we don't let engineers name things.

Julia: The config syntax is in m4, which is one of the things that made it memorable.

Kevin: I, why, why do I know...

Julia: Sendmail?

Kevin: Why have I worked... Sendmail. Oh, God. Yes, that might be why I know it.

Julia: That's usually where m4 comes in.

Kevin: 'Make' wasn't also using m4... something, some like old school Unix tool, besides Sendmail, I thought also used m4... I just, I have this memory of editing a file once that was in this format and... all that is left is just a sort of like, like ashen taste in my mouth. So... [laughs]

Julia: [laughs] Sounds about right.

Kevin: Okay, great. So Mon is this configur— is this monitoring tool, you configure it using m4 just like Sendmail.

Julia: Yeah. And...

Kevin: I'm getting the impression that this is the best software that like 1996 could produce. Because what year is this?

Julia: Maybe? This would have been 2004-2005. But it still could have been the best that '96 had to offer. Because it was, I believe actually, I'm not sure, if it was open source or not. But like I said, I know we had hacked it to hell and back so that it started out as monitoring software. And we turned it into basically this automation engine, with you know, any number of different dependencies to make sure that the different financial processes happened on time.

So each dependency would be something like, you know, is it happening in the proper time? So you know, if you say, you know, feed x is only supposed to happen between 4pm and 8pm on Tuesdays, you know, it checks that and then if it's within that period, it'll return a zero, which means 'go', or it'll return a one which means 'don't'.

And basically, you can line up as many of these as you want. And once everything across the board is zero, has returned zero in the last, I think we generally set it to like everything would run through every five minutes. That if everything was green, then the actual action would kick off.

Kevin: Okay. Oh interesting. So, sorry, just backing up a sec because I'm trying to make this a general audience podcast.

Julia: Yes. Yeah.

Kevin: And so okay, so also, can you say the name of the company? Or is that not something you want to—

Julia: Sure. Yeah, I'm not sure what the, what the, this was Upromise Investments.

Kevin: Oh, okay.

Julia: Which was, which, I guess, if we're giving the full context, was being spun off from the the Upromise that, if people know the name from seeing it in in like supermarket aisles and things like that, you know, on grocery carts and everywhere.

So this engine had been put together in the early days of Upromise to handle all their transactions around you know, tracking customer loyalty stuff, and they basically just took it, you know, intact, copied it over to a different stack of boxes, and said, here's your automation engine for running investments

I was the second engineer they hired to do this.

Kevin: Got it. Okay. On the Upromise Investments side.

Julia: Yes. And I was just out of college. This was, like I said, this is my first real job out of college. My degree is in women's studies.

Kevin: Okay. Great.

Julia: And I haven't, I never actually in my college career took a CS class. But like ever since I was, I don't know, eight, nine, 10, I was like messing around with computers and, you know, hacked autoexec.bat and config.sys to bits to get my little like 386 to play Wing Commander or whatever it was.

Kevin: MS-DOS. Yeah. Okay. Oh, yeah. Yeah. I remember those days.

Julia: Remember when Windows was just an app that you ran on DOS?

Kevin: Yes, I do.

Julia: Those were the days.

Kevin: I remember running out of IRQs. Don't miss them.

Julia: Yeah, having to manually configure the sound cards and everything else, IRQ, DMA, and oh, God, why do I still remember this stuff?

Kevin: Right? So, okay. So you were self-taught. I mean, one of my first bosses and one of, a great engineer I know was like a Russian studies major. And like in that era, it was like so many of us were self-taught. CS was a piece of the puzzle, but you weren't going to necessarily learn software engineering from that either. And so the only way you really learned that was apprenticeship and trial and error.

Julia: Basically.

Kevin: So, yeah. Okay. So you're fresh out of school with all of this practical hands-on experience. You're like, I can, I can, I got this.

Julia: Well it was great because when I was hired the person who was my manager initially, in the interview process... And I... I remember this clear as day. He was like, talk me through a troubleshooting example you had.

Because I had mentioned that, I mean, basically my most recent job at that point was, I had been a user assistant in college, so you know, basically tech support for fellow students. And so dealing with everything from like I can't print, to anything and everything else.

I mean, the one, the example I gave was a person who came in and her desktop was, it was basically complaining it wasn't finding the OS. I used to remember the specific—

Kevin: "No boot disk found."

Julia: Yes, that was it.

Kevin: Yes. Still the same error message. I got it like six months ago and I-

Julia: Some things never change.

Kevin: No.

Julia: Fortunately or unfortunately. And anyway, to skip ahead, this is not the story I'm trying to tell. Basically she'd had this machine forever and the CMOS battery had died.

Kevin: Oh, okay.

Julia: So it was a process of like through, I think three or four visits of like, oh, this happened again? That's weird.

Kevin: Oh yeah, okay.

Julia: You know, the first time, all right, well, all right, go into CMOS and then set the drive-

Kevin: Set the time.

Julia: And then go.

Kevin: Oh, there we go, yeah.

Julia: But it took a couple of times to sort of figure out what was actually going on.

Kevin: Yeah, and so for folks who have never opened up their computers before, when your computer is off and unplugged, it still needs to keep the time.

Julia: Even if it doesn't have a battery in it, like a laptop or whatever.

Kevin: Exactly, yeah. And so even in your desktop, there is a tiny little coin cell whose job is just to keep the clock running. So yeah, if the battery in your desktop dies sufficiently hard, then the BIOS will lose all its settings. So should you find yourself in that situation, go... And you have an older desktop, now you have a guess as to what's going on and could go in. And it's like a $3 part from the hardware store.

Julia: It's essentially the same kind of coin cell you think of as a watch battery or-

Kevin: Exactly. So that was the story that you told in your interview at Upromise Investments.

Julia: Yes, and the manager was impressed and said, okay, and the line that I remember that he said was basically, I don't care if you know this stuff, I can teach you this stuff. I wanna know you know how to think.

Kevin: Yes, yes.

Julia: And I've never forgotten that. And I've, it's stuck with me, from 2000, whatever, all the way through.

Kevin: Yeah, that was a lot of my early interviews too. That was kind of, I mean, that was kind of the philosophy. I was getting hired for this internship with the Russian studies guy around the same time. And that was kind of his same philosophy was not so much that you've ever done anything like that before. Cause at that point I had not had a full-time dev job.

This was my, I was a freshman and I was looking for an internship. And, but for him it was like, are you adaptable? Can you, yeah, can you learn? Can you be taught?

So he brought you on kind of as an apprentice then.

Julia: Basically, yeah.

Kevin: Cool.

Julia: And I was apprenticing under the first engineer that had been hired.

Kevin: Okay, there we go, great. Okay, he's like, I need a minion.

Julia: Basically.

Kevin: Cool, yeah. It sounds like you were racking and stacking servers at this point, I think that was what I...

Julia: Not quite. We had, so I was a feed engineer and we also had sysadmins and network admins and between those two, they generally would handle like the actual physical racking and cabling of boxes. But actually as part of that job, I did occasionally have to go into a data center and like pull backup tapes.

Kevin: Okay, yeah.

Julia: I'm speaking of stuff that...

(both laughing)

Kevin: Yes.

Julia: It was a while ago.

Kevin: So, okay, so there was an ops organization and then you were on this sort of dev side. You said, and a feed engineer, I'm guessing that that's like the financial feeds.

Julia: Yeah, so essentially, so we had all the different dependencies and everything else that lined up and then the action that would actually get kicked by Mon was just a shell script. And the shell script would have some basic environment setting stuff in it and then would run the relevant bit of Java code that Java developers had provided to us.

Kevin: So what I think I've taken from this is that Mon is basically, it's like a web service?

Julia: No, it was, web service came later. This was, I mean, I guess it had a web interface, but...

Kevin: But did you control it via the web interface or was the web interface more like about...

Julia: So web interface was mostly check on things and you could mute alerts and things like that. But actually changing the configuration, you had to do at the command line, digging through M4.

Kevin: Digging through M4, digging through the configuration files. Okay, Mon is a service then, it's got a web front end, but it's running effectively, I don't know, there's a process running on the server, which is almost like a sort of like a fancier version of Cron, which is the scheduler program, which lets you run tasks at scheduled times or scheduled intervals. So it sounds like it's a slightly fancier version of that.

Julia: Yeah, so, and like I said, we have the basic dependencies, like, is it during the stated period of the feed? Is the feed manually held? If it's been manually held, then it shouldn't run no matter what.

Kevin: So what are these feeds?

Julia: So the feeds would be, so each of the actions had a verb and the feeds all had specific verbs that were send, load, and run. So what send would do is generate a file on disk from data in the database. Load would basically do the opposite of that, take a file on disk and read it into the database and run was just some kind of daemon that would move data around within the database.

Kevin: Got it, okay, and what kinds of data are you moving around?

Julia: Basically anything and everything you would need to run a 529, so, you know, I--

Kevin: What's that?

Julia: This was, like, Sarbanes-Oxley was pretty new if I remember correctly at this point.

Kevin: Okay, yep.

Julia: Bank information, you know, names, addresses, phone numbers, the kids' names and ages.

Kevin: Oh, so this is information related to like your operations?

Julia: It's related to the customers of the 529 plan.

Kevin: So I was imagining that this was like market data feeds or something, you know, like, here's all the trades coming through on NYSE, the New York Stock Exchange, but instead it sounds like this is more sort of your backend related stuff.

Were you moving money around using the system? Like, for example, like, you know, here are all the transactions, you know, here are all the payouts that we need to make via ACH by the Automated Clearing House system. Oh, so that was one of the things.

Julia: Yeah, it was that kind of thing, you know, deposits, withdrawals, you know, rollovers from, say the Upromise loyalty accounts to the Upromise investment accounts. Tons and tons of reporting, as you might imagine.

Kevin: Yeah, yeah. Like the reporting and the compliance work is far more complicated and onerous in my experience than the actual, you know, moving money around bits.

It's like moving money is easy, convincing yourself that you haven't moved it to the wrong place and that, if you have debited it from one place, you have credited to the other place and aren't going to, you know, haven't double, you know, double counted something somewhere.

Like all of that is, that's all the work of finance. Moving money is the easy part. That's literally just like, you know, one line in a database table somewhere.

So yeah. So this is the brains of the operation. Like this thing running in Mon is like all of the business operations backend.

Julia: The business operations backend, the reporting backend in a lot of cases, and also we had a sort of sister image of Mon that was basically doing all the alerting.

So all, you know, all its, you know, is a thing running late? Is it, has it not run for whatever reason? Has something failed?

Kevin: If the nightly ACH batch doesn't go out or if it goes out, but you don't get the confirmation back in a timely fashion, like that is something that somebody needs to be paged for.

Julia: How it generally worked is if I was on call, I was getting paged and then I would end up calling, you know, Mellon Bank or SunGard or Vanguard or whoever and be like, you know, you owe us this file, you know, or, you know, hey, I know we are supposed to send you this file, but it's running late because of XYZ. You could expect to receive it, whenever.

Kevin: So these are like the banks who you're working with, who you have your accounts with. So you have their like operations people on speed dial.

Julia: Yeah. I became, I wouldn't say friends but certainly like first name colleagues with a number of people, especially, when I started I was working from 3p to 11p. So, like the overnight person, I got to know pretty well.

Kevin: Oh fascinating. Okay.

Julia: The three to 11 thing was basically because all of the day-end transactions had to be in by I think, the nightly cycle, as it was called, started at like eight, eight-thirty something like that. And that was basically to do all of the processing to then have trades go into the market first thing the next morning.

Kevin: This explains to me a little bit, you know, so I am kind of a night owl, maybe very
a night owl, and I will often like, realize that I need to move some money around or whatever, log into a financial account at like 11p or maybe 1a or maybe later.

And I will, less of these days, but still on a somewhat regular basis, encounter a "we're sorry, our systems are down between, you know, 10 p.m. and 6 a.m. Please come back at a more reasonable hour." And I'm always annoyed by that.

Not least because I, you know, my background is at places like Akamai where like we can't just be like, "We're very sorry, the internet is off between..."

Julia: We're just gonna turn the web off for a few hours.

Kevin: Turn the web off for a few hours, yes. Akamai, big content distribution network. We were responsible when I worked there for about a quarter of the traffic on the web. Okay, so you've got this backend system which is responsible for really all of the operations. You're on the phone on nights when the files don't come through, which are, I assume, like being FTP'd from point A to point B.

Julia: FTP, SFTP, FTPS. SCP was still like...

Kevin: A new thing.

Julia: I don't want to say new and scary, but it was definitely like it was just starting to be a thing.

Kevin: Yep, yep, yep. These are all methods for moving files around computers. FTP is maybe the oldest one. It's not very secure. In fact, it's not secure at all. It's also kind of a bad design in a lot of ways. And so we've been trying to push people to things that have security baked in like SCP and SFTP, but it's been a long road. I think the financial industry has finally all moved over to at least SFTP.

Julia: Working in finance, which I am now again, feels decidedly different than it did lo those many years ago. Largely along those kinds of lines where regulation and oversight and not only needing to be like, "Yes, we've checked XYZ," but proving that it would be at worst, highly improbable, if not impossible to do XYZ bad thing.

Kevin: All right, so we've got a back-end operations system that's responsible for moving hundreds of thousands of dollars, millions of dollars around.

Julia: Yeah, it became millions pretty quickly. Actually, what am I saying? It became billions pretty quickly with millions of customers during the time I was there.

Kevin: Great. Okay. And it's running on some servers that the ops folks have put together. And when you're making changes to how stuff runs, are you like... How much of a change management process is there?

Actually, I'm getting ahead of myself. I'm getting ahead of you.

So what was the first thing you noticed?

We've got this context. We know where we are. What was the first thing you noticed?

Julia: So one of the first sort of projects or tasks, as you'd say, I was given as a new hire, was
basically... So yeah, we copied everything over from the Upromise Loyalty stuff. There's a bunch of stuff in here that got copied over that we don't need. I think there's various things about checking validity of grocery shopper cards or whatever, and like, "Okay, the investment side doesn't need to deal with any of that."

Kevin: Okay. Because I've forgotten now, what was Upromise Loyalty? That was the original Upromise, and I forget what the business model was.

Julia: So the business model was you sign up for a Upromise account and register like your credit cards, debit card, etc. and then when you buy Coca-Cola, or Huggies, or some name brand item, some percentage, you know, two to five, or ten, or whatever, depending on whatever the customer wanted to pay some percentage of that money would go into a savings account which was not a 529, but was just, and then eventually you could roll that over into the 529 when that became a thing.

Kevin: Oh, that's right. So Upromise was the first sort of like, save the change program?

Julia: Not save the change in terms of like, you know, we're gonna like, you know, if you've spent $3.68, we're gonna add the extra 32 cents. This was, you know, just a fixed percentage of however much you'd spend on a grocery bill.

Kevin: Oh, and so there, you needed like, people's bank feeds to come in so you could figure out how much...

Julia: That was the, that's why the card registration and such had to happen.

Kevin: Got it. I remember seeing that, I remember being like, under no circumstances am I giving you all like my entire financial information. And like, how are you all making money on
the back end?

Julia: Some of it was information, I believe, I remember hearing like full cart data or full cart something being talked about a lot. So basically, like not only are you buying Coke, but you're buying, you know, all these other products.

Kevin: Good data to tell [marketers that] people who shop at Target also shop at, I don't know... Then the 529 savings plan stuff was spinning out of that, were these separate legal entities or

Julia: It became a separate- It became a separate legal entity, but eventually it got spun off and bought by Sally Mae.

Kevin: Oh, interesting.

Julia: Yeah, in the tail, the last couple of my years there... Not a fan. But anyway,

Kevin: So you've been given this system, which is designed around a very different sort of, or a fairly different process, and need to go like cut out the stuff that doesn't apply to the 529s.

Julia: So I have all these scripts and I'm pretty, you know, I feel pretty confident that you know, none of the I don't know, let's say 20 of them are used in in the configuration. So they, they should be good to just get moved.

Kevin: They should be safe to, should be safe to prune. Okay, should be safe to move out. Okay.

Julia: Yes. Yes. 'Should' being of course the operative word.

Kevin: Right. Yes.

Julia: So like I was trying to be, you know, you talk about the change control. I mean, there was there was no formal change control by any means. But I was trying to be very, very conscious about it. Okay, here's the list I've got, you know, I've run the test a couple times and it's still giving the same values. You know, so here's the list, you know, can— and I'm putting this in a ticket that everybody can
read and okay, here's the list, please.

Can somebody who, you know, has more familiarity with the inner workings of the system than I do, double check this to make sure I'm not doing anything colossally stupid. I phrased it a little better at the time, but you get the point. That was certainly the that was certainly the intent.

Kevin: The sentiment. Yes.

Julia: So you know, a couple days go by, and I don't hear anything, and I, at one point I poke the first engineer who was hired, who became my boss, Like, hey, you know, I think we're good to move these things out. But, you know, I just want to confirm you looked at the list, right? This is all good?

It's like, yeah, it's fine. Don't worry about it. Go ahead.

And so again, I'm trying to be like super, super careful. Like, okay, I'm gonna, you know, I think I was like, oh, it's four o'clock today. I'm going to move these things out. And you know, and then this was like two hours in advance or however long. Like, I was trying to be super careful.

Kevin: Okay, yeah, yeah. But also this is on the running system. This is like, there's only one system.

Julia: There's no other way to do it. I mean, because the files are all there on, I mean, we had a staging box and a production box. And I mean, this was all, you know, we were basically running this whole thing on I think a grand total of four Solaris boxes when I started. Eventually got upgraded to Linux.

Kevin: Okay, ooh, yes. Well, and you're interacting with, you know, third party systems and a lot of them, it sounds like and you're, you know, orchestrating them and you can write a test suite for that up to a point. But ultimately, like, the only fully complete and correct model of, you know, the BNY Mellon's servers is BNY Mellon's servers.

So like, you are gonna find some things out face first, whether you want to or not. And there's only so much you can test ahead of time.

So okay, so it's 4pm. The daily— the nightly batch job is going to start running in a few hours, and you're sitting at the console logged in, you're like, all right, I can do this, I'm gonna...

Julia: Yeah, so this was, yeah, I was still working days at that point, to get up to speed, because I was going to do it at four and then the idea was, I, I'd be leaving, you know, an hour or so after that.

Essentially, the order of operations, as I understood it, is, move the files, wait to see if anything explodes. And then go home.

Kevin: Okay.

Julia: So, I moved the files. I put in the, I put in the ticket, like, I'm moving the files right now, I moved the files, I put in the ticket, I just moved the files. I'm watching my inbox to see if, you know, alerts start going off or anything else. And everything's quiet, like, okay, cool.

This is, this.. seemed good. I mean, I was expecting precisely nothing to happen anyway, and it looks like nothing happened, so, great.

Kevin: Okay, yeah, great.

Julia: So, you know, I go home, whatever, I go to bed, I get up the next morning, and I take a look at my phone, or, I don't remember, whatever device I was using at the time to access work mail. And I notice that basically everything's on fire.

Kevin: Okay. Ah.

Julia: So I log on from home and I'm like, "Oh shit, what happened?" You know, "What did I do? What possibly is going on?"

Kevin: And when you say that everything's on fire, it's like your inbox is full of people being like, "Hey, this thing didn't run last night. Why didn't this run last night?"

Julia: Yes. Some of that and tons of messages from my fellow ops people basically like, "It looks like none of the overnight stuff ran. What happened? We need to run all of it now manually, step by step, because it's all way out of the normal period."

So by "on fire" I mean, I remember, I think, just about the entire ops team was actively working this, like, running, like, okay, "I've run this, okay, wait for that to finish, okay now you run this," it was like a Tiger Team, all-hands-on-deck kind of thing.

And so I log on and think, you know, and think, "Oh my god, what did I do?"

Kevin: "What did I do?" Yeah. Yeah. "Fuck. This is my fault."

Julia: Yeah exactly. And I was, I talked with one of the senior people who was doing most of the firefighting, and was like, "I don't understand what happened, it looked like everything was fine, I gave it time, I looked, nothing was happening," and basically there was one tiny shell script that if I remember correctly was called 'task'.

Kevin: Okay. Oof.

Julia: 'Task' consisted if I remember correctly of a single line That was, I think it was "echo $*" so it just would repeat out whatever command line arguments you'd passed it and just spit that out to standard out or the command line or what have you.

And so, you know, I remember looking at this at the time because I went through and looked at like the ones I'm moving and like, you know, this, okay this is a grocery thing that does, we obviously don't need that, it looks like we're not using this anymore. Looks like, okay, this, I don't this obviously isn't anything.

Kevin: What is this doing? Somebody left in some debug code maybe, or like...

Julia: Yeah, you you know, I figured it was, yeah, something like that or like a test action or something that was, you know, from ages ago, because it was, I'm trying to remember if it was dated, however old, but I don't actually remember that for sure.

But anyway, it turns out, that that file is basically what makes Mon actually do anything, in the way we'd hacked it to hell and back. Because like I said it's not normally supposed to do a thing, it normally would just like send an email, you know, or like display on the web thing, you know, this thing is now, you know, red as opposed to green or whatever.

So what was happening is, the dependencies were going along the merry way. But whenever it tried to fire an action, it would try to fire it, and it would fail. And it failed silently.

And so it looked like everything was fine. But nothing was actually happening.

Kevin: Nothing was actually happening. Okay.

Julia: And, remember I said this was, Mon was using it?

Kevin: Yeah.

Julia: Yeah, I basically took out production and our production monitoring in one fell swoop.

Because those were each Mon instances using a shared filesystem so that we could monitor like where files are and things like that. Also because like one box had the ability to touch the database, the other one was the one that would touch external customer systems, so we kept them separate, but, you know, there was still means for accessing stuff.

Kevin: So these two Mon instances are running on two separate physical servers. One of them is responsible for doing the work and the other is responsible for sending out notifications, like paging people if the files don't come through or whatever, but they're both using a shared file system on the backend for all of the tasks.

And because this 'task' shell script is critical path in some way to everything that Mon is doing, by moving this out of the way, neither of these Mon instances can now do anything.

Julia: Yep, they're basically just spinning their wheels.

Kevin: Dead in the water. Okay, okay. They're running through the whole dependency graph, like just fine, to get to the end of it, like, "yep, did it all, did it all, all completed successfully, boss."

Julia: "Okay, I'm gonna kick this thing now. Wiff. Okay, I kicked it!"

Kevin: Great, okay. Everything, from its perspective, everything looks fine.

But, okay. And I'm sure the engineer who came up with this, who you may never have met and who might not have worked at the company at the time felt very satisfied about this clever hack they had come up with.

Julia: Yeah, I'm sure it was so, so clever.

Anyway, so we ended up cleaning everything up. I'm sure there were, I'm sure there were as-of trades that needed to happen.

So 'as-of' being trades that need to happen after the fact and be dated "as of" they'd happened on the original date. And it's generally a bad idea if you're an investment company to make too many of those.

Sometimes you can make a little money if, depending on how the market goes, sort of like a short situation. But you don't want to be relying on that. And in general, doing too many of those is a really bad look.

Kevin: It suggests that you don't have your computer systems working properly, which—

Julia: In this case we did not!

Kevin: And so I assume that the step two, maybe, was to move that 'task' file back in place once you've, like.

Julia: Yes.

Kevin: Also was the 'task' file a thing that you figured out? Or was that something that one of the ops people, like, how did stuff get traced back down to this point?

Julia: So if I remember correctly, I was talking with the one of the more senior engineers who was in the process of doing all the firefighting and somehow, you know, credit to him, he somehow managed to simultaneously continue firefighting and also talk me down from like, "Holy shit, I'm obviously going to get fired."

Kevin: Yeah. Okay, good. Good for him.

Julia: Give up my lease now, because I'm obviously not going to be able to, you know, whatever.

Luke Henkins. If you're, if you're, if you're out there, you're a good dude. But yeah, he's also the one who told me basically, you know, everybody gets one.

Kevin: Right. Yes. Yeah. Like you're, it's gonna happen at some point. So just, yeah, it's about how you respond to it.

Julia: Yeah. And I, and I ended up, in the course of that job, doing a lot of that, basically, what that all-hands-on-deck duty, I ended up doing that many, many more times, you know, in smaller or larger parts, because, you know, various bits would be delayed by whatever

Kevin: You became Luke Henkins.

Julia: Yes. Yeah, definitely. I was, I became, like I said, I was there for six, six and a half years. By the end, I was absolutely, I knew that thing, that system inside and out, I had personally redone, like I did the holiday stuff.

It was a horrific experience. But I, you know, between what I learned in the moment, and certainly, you know, learned and seeing how how other people reacted to it. Definitely served me well in that job and since.

Kevin: Yeah, absolutely. So, like all of the stuff had to be run by hand. Did that mean that like ops engineers were going in and like running the scripts by hand?

Julia: Yup. Yeah, basically going in and running running individual actions to kick things through faster than Mon would have, if it was going to run it at all.

Kevin: Okay, because I don't know about your systems, but I'm imagining some of the like, big data systems I've worked with where we're, you know, pulling every day, you know, the day's snapshot of, you know, like downloading a CSV from somewhere, you know, for that day and, you know, loading it into a database.

But you have to like, the automated system has, you know, more or less ability for you to tell it to actually go rerun that day with that day's argument. And sometimes you just have to go in, you know, log into the box and like run what the monitoring system would have run with the arguments that you want it to have.

Because that is just faster and easier than trying to convince the monitoring system that let's just pretend that it's yesterday, and then...

Julia: Yeah, a lot of that.

Kevin: Okay, yeah. So you go run a bunch of stuff by hand. So you're kind of just like acting as the Mon system, but you know, puppeteering the scripts that the Mon system would have run. And then in parallel, you're sort of tracing down like, okay, why was nothing running? Eventually, somebody tracks it back to this 'task' file.

Were you able to get everything going for the next day before that day's stuff had to go out?

Julia: Yeah, I think if I remember correctly, I think it was done, we were sort of caught up by, I think it was mid, late morning. If I remember correctly, we pushed through and, yeah, because I eventually drove in for like the second half of the day or whatever. But I remember like, I basically jumped on from home like, "Ahh, I killed everything."

Kevin: Got it. Okay. Are you on a phone bridge? Or are you on a...?

Julia: Yeah, yeah, phone bridge. And there was— no, it wasn't... Oh, yeah, we actually had an internal IRC server.

Kevin: Oh, great, okay.

Julia: That was nice. That was being used for a lot of the team communication.

Kevin: It was what everybody used before Zoom, or before Slack rather.

Julia: Slack, yeah.

Kevin: I don't know, like, an incident that is resolved by midday, especially an incident of that magnitude, like where you're at least into the sort of cleanup phase rather than still sort of like actively firefighting.

I mean, it sounds like it was a big deal because it was the company's entire operations. But still, that says good things about the ops team and the incident process and the, like, culture there that like, it was all- hands-on-deck, everybody got stuff back, and then you were back to like, more or less normal operations by the end of the day. That's, that's impressive.

Julia: Yeah, I was, I've, I remember being very impressed with the ops team when I first joined. And then, you know, I, I feel like I eventually rose to the occasion.

Kevin: Did the company do any kind of review? Like, what was sort of the sort of follow on to that?

Julia: So we, the most I remember was basically getting, you know, the folks who had been on the phone bridge and whatever, like in a room to just sort of like, let's make sure we have, you know, dotted our t's, crossed our i's, etc.

You know, so I mean, because obviously the 'task' script had been moved back into place, like as soon as people realized that was the issue.

And then it was just a matter of like, okay, how did we get here? What could have been done differently? Maybe what should have been done differently? How could we have caught this earlier, etc.

So, you know, I remember one of my after-action items was, I put a like, five or six line comment at the top of this 'task' file basically saying, do not touch this. Do not move it. It works. It does exactly what it's supposed to do. If you don't touch it, everything is fine. Or something to that effect.

Kevin: Documenting some of the intent behind it, like, the work that it does. Yeah. Yeah.

Julia: We did something to sort of further split out production and alerting so that it would be a lot harder to take both of them out at once. And it might, it might have just been like separating the file system or something like that. I don't remember offhand.

Kevin: Yeah, definitely. Redundant systems are only as redundant as their most common shared state.

Julia: Single point of failure.

Kevin: Yeah, yeah, exactly. So, okay. So splitting out the backend so that that was, yeah, less fate sharing there.

Cool. Okay. Any other sort of changes that were made as a result?

Julia: Oh, actually, well, I mean, this had kind of been an ongoing process, but we definitely made sure we had, I mean, not just general documentation on like, how does everything fit together, but like, very specifically, like, here's what has to happen in the nightly cycle, step by step, here are the specific actions that need to happen, you know, with notes about like, it needs to happen within this time or what have you.

So making sure that it was as clear as possible. That's, you know, when I talk about, I, my colleagues will know this, I always talk about like, you know, write something like you would have, like, you would, would want it to be written if you had to debug it at three in the morning, six months from now.

Kevin: Yep, yep. Yeah, exactly.

Julia: Because I've done that. So many times in this job.

Kevin: Because you will. Because you will. Yes. And you will thank you of the past, or hate you of the past, in direct proportion to, you know...

Julia: Eh, six of one...

Kevin: [laughs] Yeah. Cool.

Julia: Yeah. So that was definitely, that was another skill I learned was basically like, make sure you have, you know, enough of the, the overall system flow in your head that you sort of understand where like your piece fits in and how it impacts other stuff or vice versa.

So that, you know, you can sort of let, you know, you can kind of line the dominoes up again, as it were, after everything's gotten blown down.

Kevin: Right. We're always like, evolving our model of the systems we operate and interact with. And those systems are also themselves always evolving. And so keeping ourselves in sync with them is a key part of the work.

And when things break, that is the opportu-, you know, that is a sign that some, you know, our mental model of the system and the system itself are out of sync.

And... the, it sounds like folks were pretty like, I don't know, what we would these days call blameless in the retrospective. Like, there wasn't a lot of finger-pointing or like, yeah, it sounds like they, the organization handled that pretty well.

Julia: Yeah, it was, it was very much the, you know, okay, we know what happened. It's, it's documented basically exactly what had happened. You know, and it was essentially, you know, well, Julia, you know, you, you're obviously trying to be extremely, you obviously were being extremely conscientious about this whole thing. But, you know, stuff happens anyway. And it was, and again, it was very much the like, you know, everybody gets one.

So, you know, I know that I was never going to touch that file. I mean, from adding a comment, I was never going to touch it again.

Kevin: Right. And so, in fact, you didn't make that mistake again. You know.

Julia: Yeah. I didn't, and, I mean, nobody else did at least while I was there.

Kevin: Sounds like the organization also took its own steps. It wasn't just you who were responsible for the cleanup, but some of the rest of the organization took on work too, to, like, split out the file system backends.

Julia: It was a good group of people to work with. And many of them I've kept in touch with in the intervening years. And actually, some of them have— we've followed from one job to another.

Kevin: Oh, yeah, for sure. Yeah, cool.

Was there anything else that stood out to you about the experience or that you feel like you learned, you took away to later work?

Julia: I'd say probably the big one is it felt like a very clear object lesson of the importance of both positive and negative testing.

Kevin: Mmm, okay. How so?

Julia: So not just testing that nothing explodes, but making sure that everything is still happening as it's supposed to.

Kevin: Actually working, okay.

Julia: Yeah.

Kevin: Maybe sticking around a little, maybe staying a little bit late to like watch the nightly batch job starting.

Julia: Yeah. I mean, or even just double checking like, okay, this action has run, did it actually do anything?

Kevin: Oh, yeah.

Julia: You know, if I had, it would have been possible to detect that a lot earlier than, "Holy crap, the nightly cycle didn't run."

Kevin: Okay, yeah.

Julia: But, oh well.

Kevin: Because I guess if you're watching the web interface, it's just like, "Well, green across the board."

Julia: Yeah, that's green. Occasionally stuff is going green and then turning, you know, whatever.

Kevin: I feel like that's a danger of dashboards, is that we believe them, and they are only ever, you know, still only every representation of the world, and not the world itself.

Julia: That's surprisingly relevant to a lot of what I do, a lot of what my teams currently do, is we provide internal visibility with dashboards to the various other teams to check on. A lot of it is job status and, you know, and things like that, but also...

Kevin: Well, yes, a Kanban board is also only a representation of the world and not the world itself. So, yes.

Julia: Yeah. Or a Scrum board or... Yeah, all of that.

Kevin: Yeah. It's green across the board, boss.

Julia: Yeah. Like, okay, you know, we... Our velocity went up by 10 points. It's like, okay, did you actually... Is that good? Did you give us more code? Did you give us better code? Is there something I can show the stakeholders?

Kevin: Any parting thoughts?

Julia: Computers are deterministic except when they're not.

Kevin: Except when they're not. Yes. Yes. Goodness. That is becoming an emerging theme of this podcast.

Julia: Well, yeah.

Kevin: Where can people find you online?

Julia: Most... L-E-D-I-V-A. Most places I am that or something close to that. I'm mostly on, like... I split my time kind of mostly between Bluesky and Mastodon.

Kevin: I feel like that's what we're all... Or at least that's definitely what I'm doing too. And I like have different conversations in each place, but I'm starting to have good conversations in both places, which...

Julia: Yeah. It's interesting. There's definitely different audiences and different sort of...

Kevin: Different communities.

Julia, this has been really lovely. Thank you so much for sharing your story with us.

And, yeah, this has been the War Stories
Podcast on Critical Point.

I am Kevin Riggle. This is Julia Lunetta.

Thank you so much for listening, folks, watching, and we will see you next time.

Julia: Bye!

Kevin: Thank you all so much for watching and listening.

You hear it a lot if you're watching this on YouTube, but if you are, please like the episode and subscribe down below. It really does help more people like you who enjoy content like this to find out about it.

And also, there's another episode I'm really excited about. A good friend of mine from way back who has an incident to tell us about from his time at Twitter.

I spent months trying to get him on the podcast. We finally made it happen. I'm super excited to bring him and his story to you.

There's not a lot of hot gossip in the episode, but there's a little hot gossip in the episode. So get subscribed and you won't miss it.

If you've encountered a truly incredible hack in the wild, like the one that Julia tripped over, tell me about it in the comments.

And if you're watching this on YouTube, but you'd also like to listen to the audio version of the podcast, we're now available on all the major podcast platforms. You can find links as well as full edited transcripts at warstories.criticalpoint.tv.

And if you aren't watching this on YouTube, you're missing out because I'm starting to post outtakes as well as exclusive YouTube only videos on the Critical Point channel. So maybe check out the channel and get subscribed there too.

If you have an incident story you'd like to tell, please email us at hello at complexsystems.group. And I know I say it every time, but every time it really is true, that goes especially if you aren't a cis white dude like me because obviously, Julia here, people who aren't cis white dudes have great stories to tell. And I know I'm asking for a favor when I say that I particularly ask you to reach out, but do please reach out. I want to hear from you.

Intro and outro music is Senpai Funk by Paul T. Starr.

You can find me on Twitter as @kevinriggle, on Mastodon at @kevinriggle@ioc.exchange, and I've added BlueSky at kevinriggle.bsky.social.

My consulting company Complex Systems Group is on the web at complexsystems.group.

And with that folks, til next time.

She Missed the Overnight Push! - Julia Lunetta
Broadcast by