Inside Infra: Chris Thistlethwaite
"Inside Infra" is a new interview series with members of the ASF Infrastructure team. The series opens with an interview with Chris Thistlethwaite, who shares his experience with Sally Khudairi, ASF VP Marketing & Publicity.
Let’s start with you telling us your name --how is it pronounced?
It’s “Chris Thistle-wait” --I don’t correct people who say “thistle-th-wait”-- that’s also correct, but our branch of the family doesn’t pronounce the second “th”.
What’s your handle if people are trying to find you? I know you’re "christ" (pronounced "Chris T") on the internal ASF Slack channel.
Yeah --anything ASF-related is all under "christ".
Do people call you "Christ"?
They do! I first started in IT around Christmastime and was doing desktop support and office-type IT. When people started putting in tickets, and my username was "christ" there, they were asking "why was Christ logging into my computer right now?" and it became a thing. When I was hired at the ASF I told Greg (Stein; ASF Infrastructure Administrator) about that story, he said "you gotta go with that for your Apache username."
When and how did you get involved with the ASF?
A long time ago I started getting into Linux and Open Source, and naturally progressed to httpd (Apache HTTP Server). Truth be told, that’s where it started and stopped, but I’ve always been interested in Open Source and working with projects and within communities. Three years ago I was looking for a new job and stumbled across the infra blog post for a job opening. I fired up an email, sent it off to VP infra and that’s how everything started. The ramp up of the job was diving deep into everything there is with the ASF and Open Source --which I am still diving. I don't think I found the bottom yet with the ASF.
How long have you been a member of the Infrastructure team?
This November will be my fourth year.
What are you responsible for in ASF Infrastructure?
Infrastructure has a whole bunch of different services that are used by both Apache projects as well as the Foundation itself: the Infrastructure team builds, monitors, supports, and keeps all those things running. Anything from Jenkins to mailing lists to Git, SVN repositories and on the back end of things we keep everything working for the Foundation itself within, say, SVN or mailing lists, keeping archives of those things, keeping your standard security and permissions set up and split out. Anyone you ask on the Infra team will say: "I do everything!" It's too hard to explain --it's quite possibly a little bit of everything that has anything to do with technology --as broad as it can possibly be.
So you really have to be a jack-of-all-trades. Do you have a specialty, or does everybody literally do everything?
Everyone on the team generally does everything --for the most part any one of us can jump into the role of anyone else on the team. Everyone has a deep knowledge of a particular or a handful of services that they’ll take care of --like, Gavin (McDonald; ASF Infrastructure team member) knows more about Jenkins and the buildbot and build services than most people on the team. At any one given point we’re on call and need to be able to fix something or take a look at something, so everyone needs to be versed enough in how to troubleshoot Jenkins. That can also be said for not just services that we offer, but also parts of technology, like MySQL or Postgres or our mail system or DNS: we do have actual physical hardware in some places, and we have VMs everywhere too, so sometimes we’re troubleshooting a bad backplane on a server or why a VM is acting the way it is. There's a very broad knowledge base that all of us have but there are specifics that some people know more about than others.
How does ASF Infrastructure differ from other organizations?
There are a lot of similarities but a ton of differences. A big part of how Infra is different is, to use a "Sally-ism": if you look at it on paper, it wouldn't work --I've heard you describe the ASF that way. If you explained the way things work at the Foundation to somebody, they would literally think that you're making it up and there's no way that it would possibly be working the way that it does. There's a lot of that with the Infrastructure team too: many people that I keep in contact with that I've worked with over the years, from my first job where we would buy servers, unbox them, rack them, wire them up, set them up, and run them from the office next door to us --I'd be impressed whenever I had 25 servers running in our little "data center" at that job, and now I talk to these guys about what we do at the ASF: we have 200 servers in 10+ different data centers that are vendor-agnostic and we make it all work. They ask: "how the heck do you do that?!" We just do --it's an interesting thing as to how it all works together because we solve problems that others have as well, but their problems are often centralized to one thing, or a data center that they control and own, or one cloud provider that they control and own, where they deal with a single vendor and possibly at most have to talk with the same vendor in two different geographical areas. We're having to deal with stuff with one cloud vendor that's a VM and other stuff on the other side of the world that's actual hardware in a co-location or data center running and the only thing that makes them the same is that they're on the Internet.
It's a good summation of the team too due to the fact that we’re all based out of worldwide locations, we’re not all in one spot doing something.
Describe your typical workday. Since you're all working on different things on such a huge scale, what's it like to be you?
"It's amazing" [laughs]. Everyone on the team generally has some project or projects that they are working on --long-running things for Infra.
I'm currently working on rewriting a script for Apache ID creations. The process of putting your ICLA in, sending off to the Secretary, the Secretary says, "OK good," puts in all your data, and that gets put into a file in SVN ...currently, we have a script that we manually run that does a bunch of checks on the account and whatnot, and then creates it, sends off a welcome email, whatever. I'm rewriting that because it's an old script, it's in several different languages. It's actually six scripts that all run off of one script. Consolidating that into one, massive script, that's in a supported language for us, and then moving forward with it into something that we could potentially automate, versus me having to run a script manually a couple of times a day.
Fluxo (the ID/handle for Apache Infra team member Chris Lambertus) was working on some mail archive stuff in our mail servers. Gavin (Apache Infra team member Gavin McDonald) is working on some actual build stuff. Everyone has kind of "one-two punch" tasks that they work on during the day, and then the rest of the time is (Jira) tickets or staying on top of Slack, if people are asking questions in the Infra channel or in our team channel or something like that. The rest of it is bouncing around inside the ASF and checking things out, or finding out new projects to work on, or ways to improve such-and-such process.
How many requests does Infra usually receive a day, in general?
Over the past three years, we've resolved an average of 6 Jira tickets a day, year-round. We've had 213 commits to puppet repositories in the last 30 days. We handle thousands of messages on our #asfinfra Slack channel, and have had 659 different email topics in the last year.
Dovetailing that, how do you keep your workload organized?
Everyone on the team does it their own personal way. I have a whiteboard and a Todoist list. We also have Jira to keep our actual tickets prioritized and running. We have a weekly team meeting/call and talk about things that are going on, and is the more social aspect of what we do week-to-week.
How do you get things done? You're juggling a lot of requests --what's the structure of the team? How do you prioritize when things are coming in? Is there a go-to person for certain things? If you're sharing everything, how do you balance it and who structures it? How does that work?
To one end, the funnel to us starts with Greg and David (ASF Infrastructure Administrator Greg Stein and VP Infrastructure David Nalley). It's different from other places that I've worked, where I'm on a team of other systems administrator/engineering people, and we have a singular, customer-facing site. Someone says, "Hey, this should be blue instead of red," there's a ticket and we make the change and then it goes to the production.
There're many different ways to get a hold of the Infrastructure team. Everyone gets emails about Jira tickets and gets updated as soon as one of those comes in. If it's something that you know about --say, the Windows nodes that we handle-- those all fall into my wheelhouse because I'm the last one to work with Windows extensively. Everyone else knows how to work with them, but it makes more sense for me to pick it up in some cases.
Most of the stuff in Jira are very "break-fix" kinds of things. A lot of the requests on Slack are too, for example: "DNS is busted," and we fix DNS. It's a very quick, conversational, "Let me go change that," or, "I'm going to go fix that real quick." Of course, some of the Jira tickets are very long-running, but the end result is they're fixing something that used to work.
We were originally running git.apache.org, and Git WIP, so we hosted our own internal Git servers and we would read-only mirror those out to GitHub. Somewhere along the line, Humbedooh (the ID/handle for Apache Infrastructure team member Daniel Gruno) started writing out Gitbox or building Gitbox based on the need to have writable GitHub repositories. He built Gitbox and set up with the help of some other people on the team, got it going, and that became our replacement for git.apache.org. While we still host our own Git repositories, people are free to either write to ours or write to GitHub, and the changes are instantaneously mirrored between the two.
We had Git hosted at the ASF, and had GitHub as a read-only resource. The need arose to have rewrite on both sides: Humbedooh went and built out MATT (Merge All The Things), which does all of the sync between GitHub and our Git instance.
MATT started a while ago, and Humbedooh added on to that to do the rewrite to GitHub. Basically what all that does is once your Apache ID is created, or if you have one already, you go on ID.apache.org, you add your GitHub username in there and then MATT --there's another part of that called Grouper-- MATT and/or Grouper will run periodically, pull data from our LDAP system and say, "Oh, ChrisT at apache.org has ChrisT as his GitHub ID. I'll pull those down." It says, "ChrisT is in the Infrastructure group. Hey look, there's an Infrastructure group in GitHub. I'll give ChrisT write access to the GitHub project." In a nutshell, that's what that does.
There's a ton of other house cleaning things, if you get removed from the LDAP group ... we run LDAP and keep all this stuff straight. If you get removed from the Infrastructure group at LDAP then MATT/Grouper will go and say, "Oh, this person's not in this LDAP group but they do have access in GitHub. Let me pull that so that they don't have access to that any more." It does housekeeping of everything as well as additions to groups and that kind of thing. There's a ton of technical backend to that, and that's what Humbedooh's doing.
At first when Git and GitHub were set up, it was fine: the ASF has to keep canonical everything about what goes into each project. You could only write to our Git repos. Then it was conveniently mirrored out to GitHub because there's a lot of tools that GitHub has that we didn't have or weren't prepared to set up. GitHub has a very familiar way of doing things for a lot of developers. Once GitHub Writable came along with Gitbox and the changes to MATT, that opened up a whole other world of tools for people on projects to use. If they wanted to use pull requests on GitHub, they could start using pull requests on GitHub to manage code. They could wire up their build systems to GitHub with Jenkins so that whenever a PR was submitted and got approved, it would kick off a build in Jenkins and go through unit tests and do all the lovely things that Jenkins does.
It was really an evolution of, "Here's the service that we have. Someone, somewhere, be it infrastructure or otherwise, once they have writable GitHub access, here we go." And here's the swath of things that that now opens up to projects inside the ASF that if they could come and set up a project with us, and then never, ever actually commit code to the ASF, it would always go to GitHub but still be safe and saved on our GitHub servers for ASF project reasons.
At the same point, we saw a need and said, "Let's build this out and go." Another funnel that comes into us is when we're on-call, something breaks and we ask, "Why do we do it this way? We should be doing it a different way." We then come up with a project to fix that or build it. It's a very interesting process of how work gets into the Infrastructure team.
It's been an interesting ride with that one.
There's always stuff that we're working and fixing and making better. For the most part, Gitbox as it is now is kind of in a state of "It's getting worked on". If there are bugs that need fixed, it gets fixed, but I don't know what the next feature request is on Gitbox. There's talk of other services ...like GitLab. If someone wanted to write code and put it in GitLab as opposed to GitHub, then someone would need to come in and write the connector from Gitbox to GitLab. So it's possible. I don't know if that's necessarily an Infrastructure need as much as it is a volunteer need for infra. But it's a system that can be set up to any other Git service as long as someone goes in and writes that.
You brought up an interesting point here, which is volunteers. Do volunteers contribute to Infra also?
We sometimes have volunteers, yes. We have a lot of people on the infra mailing lists that will bounce ideas back to us or they'll work on a ticket or put in a pull request.
Well, the need is not as critical because you have a paid team, versus Apache projects.
Right. That's exactly true. There's a bit of a wall that we have to have because we work with Foundation data, which not everyone has access to. Granted, we're a non-profit, Open Source company and everything's out there to begin with, but usernames and passwords of databases and things that we have encrypted that the team has access to isn't necessarily something that you would want any volunteer to have access to.
How do you stay ahead of demand? This is a really interesting thing because part of it is you're saying, "Necessity is the mother of invention." You guys are doing stuff because you've got those binary, "break-fix" types of scenarios. In an ideal situation, do you even have enough runway to be able to optimize your processes? How do you have the opportunity to fix things and improve things as you're going along if you're firefighting pretty much all day long?
That's a really good question about just how our workflow is. In other companies that I've been in, there's the operations people that are doing the "break-fix", and then there's the development people that are doing "the next big thing". The break-fix folks are spinning the plates and keeping them spun without breaking, and that's a lot of firefighting. That's literally all that job is. Even when you're not firefighting, you're sitting around thinking about firefighting in a sense of, “when is this going to fall over again? If it does fall over, what can we do to fix it so it doesn't do that anymore?" And in the past, the break-fix guys, the firefighters, would end up saying, "Hey, there's this thing that needs fixed." And it would fall over the wall to the developers. They would develop the fix for it, and then it would go back into production and then the cycle continues.
To some extent, that's kind of where DevOps came from: if you merge the two of those together, then while you're firefighting you can also write the fix for the problem, and then you don't have to wait for the lag between the two. We don't have that split here. Everyone on the team is firefighting with one hand and typing out the solution with another. And a lot of the times our project work, like getting a new mail server spun up or my task to rewrite the workflow for new Apache ID creations, I've been working on that for a very long time because it will keep falling off ... it gets put on the backburner while we're like, "Hey, we found out that our TLP servers are getting hammered with downloads from apps and people trying to use them instead of the mirror servers." So, let's set up downloads.apache.org and we can funnel stuff over to that so that that server can get hammered and do whatever it needs to do so that our www. site and all the Apache Project websites stay up and running in a more reliable way.
What's the size of the teams that you were dealing with before that had a firefighting team and a dev team versus ASF infra?
The last "big" corporate job I had was ...six ops people that kept the site going, four database people, another eight technical operations-type people… all-told it was about thirty.
There were technically thirty firefighting people and we had a NOC (network operations center) that was literally people that only watched dashboards and watched for alerts. Whenever those go off, they’d call the firefighting people. The NOC was another 20 people. And then the development teams were ... twenty to fifty people.
What kind of consumer base were they accommodating? Does it match the volume that ASF has? Was it more of a direct, enterprise type of, "We have a customer that's paying, we have to respond to them" situation? Or is it different?
This was at a financial services company that transacted on their Website: completely different from the type of stuff we're dealing with here at the ASF. Volume-wise, they were much smaller, but it was much more ...visible, as their big times were at the start of market and end of market. After end-of-market came all the processing for the day to get done before markets started the next day. The site had to be up 100% of the time. We had SLAs of five minutes. If you got paged or something broke, you had to get the page and respond to it in a way of, "Hey, this is what's going on and these are the people that I need involved with it," all within five minutes of it going off. That was the way the management structure was. It was intense.
In scale, Apache probably does way more: they do way more traffic across all of our services in any given day. If someone doesn't get mail for a little bit, then they come and tell us or we get alerted of it by our systems, and we go and we fix it and we take care of it. But with the financial services group, people were losing money: dealing with people and money is just a very stressful situation for anyone working in technology because you have to get it right and it has to be done as fast as possible before someone's kids can’t go to college anymore. It was a completely different minefield to navigate.
The type of stress that's involved or the type of demand or the pressure is different, but you also have the responsibility with ASF that systems have to be up and running. I understand it's not mission critical if something goes down for more than five minutes, which is different in the financial sector, but do you feel that same type of pressure? Is it there or is it completely different for you?
No. I think I do because we also have SLAs here: they're just not five minutes. We have structure around that and the way that we handle uptime and that kind of thing. I get very attached to the technology that I'm working with and the communities that I'm working with, so if a server goes down or a site's acting wonky, I take that very personally. That reflects on how I do my job. If a server's not working or if something's broken either because of me or something externally that's going on, I want to get that up and running as fast as possible because that's how I would expect anyone to work in a field that has ...any technology field, for that matter. And generally, that's the same attitude the rest of the team has as well.
How has ASF Infra changed over the years?
It's matured quite a bit. When I first started, it was Gavin, Fluxo, Humbedooh, Pono (former ASF Infrastructure team member Daniel Takamori), and me. There were five of us. The amount of stuff that we got done, I'm like, "Man, there's no way that five people can do this."
That's kind of what I'm pointing at. If you're a team of eight or five or twelve or whatever, compared to the other thing that you did with the other job that had maybe a core team of twenty, thirty --that in itself is insane.
We were five people, everything was very, "Here's the shiny thing we're working on," and then something else would come up and we'd have to jump on that. Then something else would come up and we'd have to jump on that. We were very ...I don't want to say we were stretched thin, but there wasn't necessarily ...time for improvement.
There was a lot of stuff we had still in physical hardware, and a couple of vendors that we no longer use. But things were moving more towards a configuration-based infrastructure with Puppet instead of one person building a machine, setting up all the configs themselves, installing everything and then letting it go off into the ether to run and do its job. We were moving everything towards Puppet to where you configure Puppet to configure the server. So then if the server breaks, or goes down or goes away or we need to move vendors or whatever, all you need to do is spin up a new server somewhere else, give it the Puppet config, it configures itself and then goes off into the ether to run and do whatever it needs to do.
That's great. More automation.
Right. We were automating a lot more stuff right when I first started. Over the course of the next year, the team kind of ebbed and flowed a little bit until we were eight in the last year. We started to get to the point of "where can we point the gun to next? What can we target next to get it taken care of and done?" That's where we started taking on more specific infra projects, for instance, mail. Our mail server has been around since the dawn of time, and it's virtualized so it moves servers every now and then, but the same base of it is quite old in technology standards.
Fluxo started moving this on to newer stuff and he got that going. We started taking care of projects that were not broken, but needed to be worked on. Instead of waiting for it to break, we're fixing and upgrading and moving down that path versus firefighting, break-fix, that kind of thing. We were moving more towards, "Hey, I see a problem. I have time. I'm going to take care of that and make that into a more serviceable system."
Automation has helped quite a bit with that. I also think that just as the team grew, it just got to a point where I think tickets were getting responded to quicker, emails, chat was responded to quicker. And then we also could focus more on the tools that we use for the foundation. Like, HipChat was going away. We needed a new chat platform, so we chose Slack. And then we updated and moved everything over to Slack, and that's where we are with that. It started following into its own with workflows of like, "Oh, okay. How do we get this done? Let's go do that."
What areas are you experiencing your biggest growth? Is it a technical area? Like, "Hey, all of a sudden mail's out of control"? Or, "Hey, we need to satiate the demand for more virtual machines," or is it a geographic influence that's coming in in terms of draw? Where are you guys pointing all your guns to?
Currently we're trying to get more out to the projects and talk to people more often. Not that we didn't do that before, because ApacheCons and any Meetups that we had, Infra would always have a table. We were always accessible, but we were always passively accessible. We weren't really going out and talking to projects proactively to say, "Hey. What do you guys need from us? What are we doing with this?" So I think that's one part of it, that I think that we're moving towards a little bit. It's not at all technical, but more of a foundation broadening, community broadening thing that we're doing.
That's one part of it. The other thing that we're doing too is from a more technical or infrastructure standpoint, is we're really trying to get our arms around all of the services we provide, and then really take a look at those and say, how is this used inside the ASF? How is it used in the industry as a whole? Do we need to put more time and energy towards those things in order to make the offerings of the infrastructure team a little bit a more solid platform, kind of thing? Generally, that ... and on top of any other automation and that kind of stuff, I think that's really the two spots that I see infra growing in a lot in the next year-ish of just really boiling down our services to, "Hey, we've seen a lot of people using this. And a lot more projects are using this. It's not just a flash in the pan. We need to build out more infra around blah service, so let's really do that and make that a solid platform to use."
What do you think people would be surprised to know about ASF Infra? When you tell someone something about your job and they go, "Whoa, I had no idea" or, "That's crazy." What would people be surprised to know?
That Apache has an infrastructure team. [laughs]
Why are you saying that?
Because honestly, I don't think a lot of people know about the Infrastructure team. Those that do, have used us for something, not used us for something, have talked to us about something, and worked with us on something. Those that don't are like, "Oh, I didn't know the ASF paid people to be here," --that kind of thing. That's kind of the two reactions I've got from people. It's like, "Oh, that's cool. You work for the infrastructure team." Shrug. And then the other people are like, "Oh, sweet. Yeah, that's great. I know Gav. I've worked with him on blah, blah, blah." But that's not necessarily surprising. I mean, it is in a sort of way.
When people ask, "What are you doing for work?" and you say you work for ASF, do people even know what that is? Do they know what you're doing? Do they care? Are they like, "Oh, okay. Whatever"?
There's literally three types of people that I've run into that ask, "Oh, what are you doing for work?" One person is the person that has no idea what the ASF is, not even the vaguest hint of Apache, and they're like, "Oh, okay. That's cool." There's that next person that does, and may or may not know about the ASF but knows of Apache, the Web server, or some other lineage of that. They're like, "Oh, whoa. That's super cool. It's impressive.” That's wild. Then the third people ask "Why are ‘Indians’/Native Americans running software? That doesn't make any sense to me" and "Are you on a reserve?" I swear to God I've gotten that question before. I don't even know how to answer that. I'm like, "No, buddy."
Are these technologists or are these just guys off the street? Are they in the industry?
Guys off the street. I say Apache Software Foundation, and they're like "Apache" and "software" doesn’t make sense. Actually I've gotten mean tweets too whenever I've been tweeting about being at ApacheCon. Things like I'm "taking away" from Native Americans and whatever...
We also get that on Twitter, on the Foundation side: we get included in tweets about some kind of violation along the lines of, "Stand up for the ..." I get it. From time to time we also get sent these "How dare you?" letters, that sort of kind of thing. It's an interesting challenge, the whole issue of "why do Native Americans run this thing?" misinterpretation.
Let’s move on. What's your favorite part of your job?
The whole job is my favorite part of the job.
That's funny because everyone at Infra ... You know how people have bad days or may be grumpy or whatever, in general you guys seem to all like each other. You all have a great camaraderie. You all get along. You work really closely together. It's a very interesting thing to see from the outside. Is that true? Or are you just playing it up? Does it really work that way?
That's absolutely true. I've found that generally speaking, when you get a bunch of nerds together, they either really like each other and everything works or they really don't like each other and nothing gets done. The team is great, and it's like no other team I've ever worked with before. But it's very odd because you go through the interview process, and the interviews are interviews. I mean, you get to know people in interviews, but not really. Then you start working with people, and at some point you start getting below the surface. And at some point you get deep enough to where you find out whether or not ...how you gel with all these people.
It's very odd that all of us have the same general sense of humor. We'll talk about food non-stop in the channel, and recipes and cooking, and different beers or different whatevers. It's nice to get to that point with a team that you're comfortable enough with everybody to ... like I said, I've been here three years and there is still so much that I don't know, both technical and non-technical, about the ASF. I ask very dumb questions in channel and say, "I have no idea why this is doing this this way," or, "Can someone else take a look?" or, "I don't know what I'm doing here." And never in the entire time I've been here, from the day one until now, has anyone ever chastised me for not knowing something or said anything about the way that I work or something like that. Well, at least not in channel. At least not publicly.
Everyone's very supportive. It doesn't matter if you know everything there possibly is to know about one singular product or thing you're working on, or don't know anything about it. You can ask questions and really learn about why it was done the way it was done, or figure out how to fix a problem. No problem on the team. It's just like, "Okay, yeah. This is what you have to do." Or, "Here's a document. Read up on it." Or, "I don't know either." And then out of that comes an hour of conversation and then a document pops out, and then the next person that asks, we can say, "Here, go read the doc." Yeah. I mean, we're all very happy. Very happy.
Which is really good. Looking back when you first started, what was your biggest challenge when you came onto the team?
Oh man. I look back at that and I feel like the learning curve was ... It wasn't a curve. It was a wall. I've used Linux, I've used Ubuntu for a while and various other flavors of Debian and whatnot, so getting spun up on all of ...expanding my Linux knowledge was a big deal, expanding everything about the ASF and how it works. Which I'm still trying to figure out. If you know, send me something to read to figure out how that all works. I mean, I don't want to sound like I was completely out of my depth and I have no idea what I'm doing, but I feel like I was completely out of my depth and I had no idea what I was doing.
There's a lot about the ASF that is just tribal knowledge, and there's a lot about Infra that's tribal knowledge. It's just no one has anything written down --"the server's been running under Jim's desk for the last 15 years in a basement that has battery backups and redundant Internet, so it's never gone down. But don't ever touch that server, because if it goes down, then all of our mail goes down" or whatever. There was a lot of figuring all that out for myself and digging around. Which is, frankly, one of the parts that I really enjoy, is just, "Hey, this thing broke. I've no idea what that thing is. I've no idea where it lives," and just diving in and trying to figure out what's going on with it and how it's built, and then the hair trigger that sets it off to crash and never work again. Yeah. That's an interesting question too.
What are you most proud of in your Infra career to date? You're talking about overcoming these challenges, I'm always curious just to see what people are like, "Yeah, I'm patting myself on the back for that one" or, "Ta-da. That's my ta-da moment."
I did lightning talks at ApacheCon Las Vegas and didn't get a phone call from you when I was done. [laughs]
I wasn't at lightning talks --what did you say? What would make me call you?
I didn't say it. We were on stage, and it's John (former ASF Infrastructure team member John Andrunas), Drew (ASF Infrastructure team member Drew Foulks), and I, and we figured we'd do lightning talks: "Hey, we're the new guys: ask us infrastructure questions." A week or two before ApacheCon, there was a massive outage at a particular vendor. It wasn't: "Oh, our server's down for a while," the server went down and then it was *gone*. It got erased from the vendor side. I can't remember what service it was. There was something that disappeared two weeks before Vegas and never came back.
It wasn't just us, though: tons of companies had this issue. So we're on stage answering questions, and someone asks where this service went: "What happened to XYZ?" And John has the mic and he goes, "You should probably go ask [vendor name]." At that point it was very widely published that the vendor"s response was like, "Whoops, someone tripped over the cord that powered the data center. And when it came back up, then deleted all of your VMs.” They totally acknowledged it and they didn't give refunds for it, so it was a little bit of a PR kerfuffle for them. The vendor is in the other room handing out buttons and stickers, and John was like, "Oh yeah, go ask the [vendor] guys what happened to your server. That's their fault," he said it jokingly but my jaw dropped.
[laughs] No one told me this story. No one said anything. Someone's trying to protect you. I had no idea this happened ...oh my gosh.
Well, David Nalley was in the back of the room, and he's screaming with his hands cupped around his mouth, "Don't badmouth the vendor and the sponsors." I deflected and quickly moved onto something else. [laughs]
But yes, that's another good question that I haven't actually reflected on. Looking back and seeing where Infra was when I first started and where it is now, it was a very runnable and very good team then, and it's a very runnable and it's a very good team now. I feel like a lot of the work that I've done and a lot of the work that the team has done over the last three years has been getting from a spot of "everything's on fire, who's holding up what this weekend?" to things being stable and us nitpicking on whether or not something needs to be updated or not. That's huge. That's a big step from like starting a company and treading water to being profitable and having resources to do other things versus just keeping your employees paid. I mean, it's a big step for a company and it's a big step for Infrastructure.
I love your talking about how you guys are tightly-knit and all that. How would your co-workers describe you?
The other odd part about that too is being completely remote and not having day-to-day, face-to-face interactions with people. You get a very odd sense of people through text for a 24-hour period that you're online reading stuff. It's a different perspective than if I was in the office every day, working on something and interacting with people. Even though every day, except for the weekends, I'm online talking to these guys and doing stuff. How would they describe me? Dashingly good looking and ... I don't know. [laughs]
I know that Infra's "just Infra," right --you guys are all under the Infra umbrella. Do you have a title? When you got hired, what do they call you?
We're all systems administrators. The only person that actually has a title is Greg, and he's Infrastructure Administrator.
What are the biggest threats you face? For infra folks or systems administrators or infrastructure administrators even, what do you need to watch out for these days? What's big in the industry? Is everyone saying, "Oh, XYZ's coming"? In terms of your role in the job: is there something that you need to keep your eye on? Is there something that you would advise other people, "If you're in this job keep an eye out for blah, this is a new threat" or anything along those lines?
General scope stuff. 16 years ago, everything was hardware: you bought hardware and you had to physically put it somewhere. And virtual machines came along about the same time. People were starting to do virtual stuff to where you could have a physical machine and then multiple machines running on that, sharing resources. Then cloud and infrastructure as a service, and everything's been moving more and more towards that over the years.
Of course, there's still people that work in office IT, doing desk support stuff or office infrastructure type things.Those are still a majority of how things run at companies. As everything is moved more towards the cloud or hosted services, more systems administrators are becoming more like software engineers. And software engineers are becoming more like systems administrators. They're kind of melding into one, big group of people. Now of course, there are still people that only write software. But gone are the days where it used to be someone would write some code and say, "I need to deploy it and get it out to all these computers." They would write the code, they'd hand it off to a systems person. Systems would go and configure on whatever server to get it out to however many machines and hit the button and go. The software developer never really needed to know hardware specifics of the systems that it was going to run on. And the systems people never really needed to know what software packages this was getting put together. There's exceptions to that, but for the most part ...
Over the years, it's fallen into a thing now where the software developer knows exactly what systems this is going to run on and how it's going to run there, so it's more efficient and things work better and they're releasing less buggy code based on the fact that they know they're closer to the hardware. And the systems people, they want to troubleshoot it more and work with it and fix problems because they're closer to the software and know more about its internal workings and how it's going to run on systems. Everything is getting more and more chunked down into, first it was VMs, then it's cloud, then it's containers with Docker and things like that, and it's going to get more virtualized down into that. Knowing about Docker orchestration and things like Kubernetes and Apache Mesos. The reality is other people run Kubernetes, people run Docker, people run everything. That's the interesting thing in terms of how they do it at ASF. We don't require folks to do just one thing.
In terms of where the industry's going ... everything's getting pushed down to "a developer can work in a container on a set of systems, write software for that and then deploy that to a machine themselves, never involving a systems engineer at all, and build a product using that." It's getting stuff out the door faster, and it's also keeping the unicorn of the industry a while to go ... even today, I developed this thing, it works on my machine. If I move it over to another computer, it stops working. Why? What's the problem with that? Containering or containers fix that problem. The container you run on my system runs the same way as it does on every system everywhere. It takes the "runs on my machine" thing out of the equation.
What's your greatest piece of advice? What would you tell aspiring sysadmins?
Part of the ASF is the community behind it, and a giant part of that is what makes it work. I mean, you could say all of it. That's what makes everything work with this. Right when I first started the sysadmin kind of thing, I didn't get into Meetups and Linux Users Groups and any of that stuff. I didn't get into the network. I didn't go into the community that I had around me. And honestly, I don't know if that's because it didn't exist or because I didn't know about it or what, but now that I'm older and wiser, the community part of it is really ...there's a massive benefit to that. Aside from socialization, or networking and how to get a better job through networking, getting together with like-minded people and talking through your problems is an amazing tool to use. And I didn't do that enough when I was a sysadmin starting out, and looking back it's something that I sort of regret not doing, was really sharing knowledge with other people in the community and building a group of people that I could ping ideas off of, or help with other ideas, or share in the knowledge of, "Hey, this is what's going on in the industry" or, "Hey, I saw this at work the other day. How do we work around that?" or that kind of thing. It's much easier these days with social media: the never-ending amounts of social media. But it's a big, important part of my day-to-day now, that I wish I had 16 years ago.
That's powerful. OK, If you had a magic wand, what would you see happen with ASF infra?
If I had a magic wand, I'd update our mail server instantly or maybe magic wand a few other projects.
Wait. I know you're joking, but what is the problem with the mail server?
It's running on an older version FreeBSD that doesn't play well with our current tools. Some form of that server has been upgraded, patched, moved, migrated, etc for the last 20 years. We want to bring it up to more modern standards. Mail runs fine for the most part, but it's probably the most critical service we have at the ASF and we want to make sure everything continues to hum along. Because of that, it's a huge project that touches a ton of different parts of our infrastructure.
How big is it?
It's all of our email. Every email that goes through an apache.org address.
This is a huge project and Chris (Lambertus) has been working on it for a while --it's not a simple thing to fix. It's very, very complicated. We couldn’t do it without him.
Back to the magic wand thing: I'd wish for more wands.
Chris is based in Pennsylvania on UTC -4. His favorite thing to eat during the workday is chicken ramen.
# # #