Inside Infra: Greg Stein –Part II

The "Inside Infra" interview continues with ASF Infrastructure Administrator Greg Stein, who shares his experience with Sally Khudairi, ASF VP Marketing & Publicity.

"Who are these crazy guys spread around the world that are keeping 200 machines up and running for all these different projects and committers and contributors?"

PART TWO.

How or what would you describe the Infra "brand" to be?

I don't really know. I've never really thought about branding or marketing ourselves, so ...

Well, you guys have a certain persona, you have those funky t-shirts you wear at ApacheCon ...there's definitely some kind of street cred that's different from everybody else. I was curious to see if that's part of your natural sense of hip, or is that something that you guys deliberately planned for.

The t-shirts and other things go back to the team bonding kind of thing. We'll give ourselves an identity, but haven't tried to create or market ourselves. I think it is something that we do need to take some control over. We hired a part-time writer in December and he's been organizing our content to provide a better and more useful front to Infrastructure.

There were a lot of pages on www.apache.org that have now moved over to infra.apache.org. That creates a more coherent Web space, if you will. We can really talk about those different channels. "How do you reach Infrastructure? Do I go to the Slack channel or do I file a JIRA ticket: how do I decide?" So he's helping to, while I wouldn't say "market a new face", he's certainly helping people figure out who we are, what we do, what we can help with and getting that information organized.

Which is good. That's new. Even to have you guys featured in a project like this, it's unusual and it's refreshing. I'm personally curious, and I'm sure other people are also curious about what's behind Infra.

Right, right. Who are these crazy guys spread around the world that are keeping 200 machines up and running for all these different projects and committers and contributors?

So Andrew (technical writer Andrew Wetmore) is primarily going to work on the infrastructure docs until those are whipped into shape because a lot of the material that we have, a lot of the Webpages, is really infrastructure related. He has been working with the team on those pages. What's going to be harder though is when he's kind of at a stopping point for that, what to turn his focus to, and that would be www.apache. But then it gets a lot more difficult because when he wants to update the How It Works page, who does he talk to? Who's authoritative? He can do some edits for flow and word consistency, punctuation, clarity, right, but he can't really update the process.

Right. Right. That's the Foundation thing.

Yeah. But the problem is we don't really even have a concept of who's in charge of that How It Works page, who is, you know, it's just there's nobody that the foundation is willing to say, "That person controls that process." You know what I mean?

I totally do --I come across the same pages and people go, "Are they yours?" It's hard to determine not only evolving processes, but who signs off on this or who gets it. I hear you.

I've recommended for the past year, or three, that Marketing is the owner of DubDubDub (www.), but you know, that's the "face" of Apache. You know? But the raw content, as you point out, who approves the raw content.

One thing that I asked Drew and Chris, and I'm always curious with people who are super busy and juggling 50 things, is to describe a typical workday for you.

I wake up, I look for email first, generally, sometimes I'll hop onto Slack because sometimes people ask me directly for something. Then I go look at email and sort through a number of different categories between direct team stuff, operations, the Apache Board, and then Apache in general. And then of course, if there's any vendor email to deal with. So there's a bunch of different categories in priority order. After I get through that initial work, then it's go and read all the back scroll in the team channel, which is anywhere from 200 to 400 lines of back scroll ...

Can you get any work done? Beyond just catching up on the communications?

Yes. But it does take like 30 minutes to read that back scroll. For me there's a lot in there about what the guys are doing and what they're working on, how to solve a particular problem when they're asking somebody else, "Hey, can you look at this? Can you help me with this?" But I don't, for the most part, "serve", you know ...they are the technical staff... I can do it: I have technical chops, but I let them do their jobs as they know best. I do like reading the back scroll because I'm also looking at it from the angle of "how is the team working together? Is that going well? Is there something that I need to poke and prod to improve how they're working? Are they getting jammed up on something that I can unblock for them so that they can get their work done?"

Stuff like that. That's what I look for when I go through that back scrolling, so it's important to me to read that back scroll. Most of the guys do tend to, when they first sign in in the morning, go back and scan for stuff where they might be needed. I've never really asked them how detailed they get, but I think pretty much everybody reads all of it to catch up, but they're going to be looking at it with a different lens than how I look at it. Mostly I'm looking at unblocking --are they running into problems that I can ease for them?

How do you keep your workload organized?

I don't.

Fair enough. Again, there's a lot, so it's curious to me, like everything at Apache, with the exception of a handful of things, everything could be a priority, if you're always on fire and always running around, putting out fires, you know? It's funny when I've talked to the Infra guys and you also, you all have the same reaction to that question, which is the laugh. I think that's the nature of the beast with the ASF.

Yes. That really is the nature of system administration work. My career has been product development, and you can reasonably plot that out. You can say, "We're going to develop these five new features, which is going to take us between two and four months." We'll see...we might cut a feature to try and limit our time development. The feature is going to change, unless we'll plan in time for change. But system administration is very reactive, so it's a very different beast. This is where, like I said, we were kind of treading water with four people, but we could see as Apache was growing we were not going to be able to keep up. And we certainly weren't going to be able to move ahead of the curve and do things like selfserve.apache.org where, you know, before we would get a dozen tickets to create repositories and that took time. Now we don't have to do anything.

It's all selfserve.apache.org, but we had to write the tool first and have enough air time to get that tool written. So I think we're ahead of the curve. We're getting some of our longer-term initiatives done, but it is still a very reactive thing. For myself, my back office work is pretty straightforward and it's a lot of email and Website work, you know, going in, paying an invoice, putting in the infrastructure credit card, sending out a purchase order, stuff like verifying and improving payroll, that doesn't require me sitting down and writing Python scripts.

The other half of my job is being present on that channel because I also help to set priorities. When something comes up, I ask, "Is this a thing that we want to do? Do we want to take on this new task? Do we want to provide this new tool to the projects?" You know, like a project is going to say, "Well, we want to integrate this thing into our GitHub repository," and we go and review it. It may require permissions that we simply don't want to allow. So there's some of those kinds of policy kind of things that I also help with. And there's always being present to help set policies and priorities.

OK... so how do you work with (VP Infrastructure) David Nalley? Are you making the decisions? Infra is an unusual type of group as opposed to other areas of activity operationally at the ASF. How do you work together?

Correct: I'm the day-to-day, so I look at it like he's the brains and I'm the hands. That said, he's like the strategic brain and I do all the tactical decisions.

I make all the tactical decisions. I am an officer of the corporation. I can make any decision that I need to, related to Infrastructure. If I feel it's a little bit weird, then I'll bounce that off David, but for most of the stuff, he doesn't feel a need to inject himself in. He feels comfortable letting me go ahead and run with the things, and rely upon me asking when it seems a little sketchy.

That's good: that process suits both of your personalities, both your sensibilities. It sounds like a good fit.

I report to the VP of Infrastructure, and that is still David, even though he became Executive VP and is now (ASF) President. He still holds that title. He's asked me, "Well, Greg, maybe you should just be VP Infra," and I said, "No way." Because we're paid people, but the Foundation is all volunteers. I told him I do not want to be a VP, because I want to report to a volunteer. I think that I (and the team) should report to a volunteer that always has a volunteer eye on the Foundation's long-term goals.

Because I manage all the day-to-day, it's a very lightweight hat for him. That VP hat is a tiny aspect compared to his President hat. One day, he'll find somebody to take over that VP Infra hat, but I've essentially mandated to him that it has to be a volunteer position.

It's not that I see we're going to go all out of control and we need a check from a volunteer; I just want a volunteer to always be able to say, "Okay, you guys are a little bit crazy, let's redirect our long-term thinking more in line with what the Foundation wants," and have a volunteer interpret what the Foundation wants.

That perfectly dovetails into what folks referred to in our ("Trillions and Trillions Served") documentary, where they were talking about Greg Stein's famous "plan for the ASF for 50 years..." This super long-term vision, which again, everyone goes back and says, "Greg Stein said..." What does that mean exactly, and how does that translate to Infra, considering that you can't really plan that far out? How does that work?

Well, actually we can plan that far out. I wrote that "50 years" in one of my Director's statements, I think it was 2014 or 2012 ...maybe earlier. Where I was going in that Director statement was the Board doesn't deal with the communities. The Board is there to support the communities. So we want the Foundation to exist for 50 years so that these communities can continue to run and see through evolution.

Some communities are going to move to the Attic, new ones are going to come along, but we want the Foundation to be viable. To say "forever" is okay. Nobody can really put that in their brain. So I just said, "OK, we can think what 50 years means." That is long enough out, but still within people's brain capacity to think, "Okay, what _does_ 50 years mean?"

And so that's where I came up with that. What does the Board need to think about to ensure that we are here 50 years from now and our projects are successful and can run through their lifetime, lifecycles. Apache HTTP and Tomcat, I don't think they are ever going to go away, but you could see maybe in 30 years they might. There might be some other mechanism in computing that would obsolete them, but the model of Apache does need to exist for at least that long.

Now, within Infra, I think we actually can plan that far out because we have growth curves. We see what kinds of computing resources people need. So we can plan for project growth, for machine growth. We can do long-term planning on how we allocate machines among our various cloud resources that we have, and start to schedule those further out. None of that really affects our day to day, but it is something that we can project out a ways and think about what kinds of resources we are going to need two, three, five years from now.

There isn't anything really that we can do for 50 years, but we can keep it in mind. Okay, that is going to be a larger team. That is going to need a larger staff, a full time manager, a full time HR person, a full time... There's different things that will change over that time, but we can actually do some of that projection, although we haven't bothered.

I do the five year plans for the Board, but mostly that is a simple cost growth as opposed to actually changing the structure of the team or the role assignments, because like I said, I think probably within 10 years, we'll probably need to add one or two more staff on top of the head count of six that we have right now. And I think supporting that would still be fine for a part-time person like myself. But once it grows to 10 or 12, then I think it's going to need a real change. Where we need to have a full-time person managing and so, we'll need to adjust the budget considerably to make that happen.

But if we ever get there, the Foundation is going to be likely in a very different position. We're talking 10 years from now. And so, who knows.

So with more than 350 projects and initiatives as we've discussed before, how do you guys stay ahead of the demand? And again, if you're trying to plan for five, ten years out, you mentioned earlier cloud computing. Not so long ago, cloud computing was a novelty. How do you plan for this?

And that is where we try and move more things to selfserve.apache.org, where we look at the kinds of requests that we're getting. The kinds of tasks that we’re performing and find a way to automate that workflow and create more self-serve options for the kinds of tasks that we regularly get tickets on.

Where we used to get tickets on creating Git repositories, we get zero now and, and we can see over the past six months, we've had 20 tickets to do X, is there a way that we can automate that, so we don't have to get our hands on that ourselves and save our hands for doing things like machine upgrades, for rebalancing some of our computer resources, where things are running on an old operating system and we need to get that onto a newer version. Right now, all of our machines are managed by a system called Puppet, which does the basic configuration work for us. But today, we're on two different versions of Puppet, a really old one and a reasonably new one.

And we're trying to get everything migrated off the old stuff onto the new but once we finished that migration, we're going to have to start all over again, or maybe switch to a different tool. We're looking at a tool called Ansible to use instead of Puppet.

And so there's this never-ending ongoing set of tasks, but each time we do it, it reduces our workload by that much more. So when we upgrade from Puppet 3 to Puppet 6, we get an improvement in the maintainability of that server. And that means that we spend less time with that server going forward and have more time to do other things or to deal with project growth.

Regarding a scale of efficiency, how do you close your skills gaps? When I spoke to Chris and Drew, they both said, "We do everything." How do you do that? How do you know all of this? Do you look at this big picture and say, "Okay, we need a person to specialize in X and Y and Z," and then you send them out to learn about it? How do you cope with that?

The team definitely specializes. And the guys have specializations around different areas, but we do a little bit of cross training, but not a lot because as I mentioned, we've got like 200 machines, each individually doing their own thing. If we cross trained everybody in everything, we'd get nothing done. So, there's a little bit of cross training, but mostly some specialties. It does create a little bit of bus factor...

Which is very scary. I was just going to say, your bus factor is very scary. Talk about that.

The thing is that Puppet allows us to create configurations and that's in version control. If all of a sudden somebody leaves, another person can backfill them because if somebody leaves, it's not like they take their work with them: all the work is in version control. And so that work doesn't go with them, but we may need to backfill some education on that particular specialized area. For example, Chris (ASF Infra team member Chris Thistlethwaite) does a lot of our monitoring work. If he left, now we need somebody to get a little more familiar with NodePing and a little more familiar with Datadog, but that'll be like a week for somebody to pick that up.

It wouldn't be, "Oh my God, this is three years of expertise that we need to go backfill" ...we don't have anything that is that highly specialized.

Is that because the team is more well rounded or because you guys are more efficient or what about it? Because of technology evolution, or...

We don't deal with systems of that level of complexity. We've got 200 machines, like I said, each doing their thing, but it's not like we've got a cluster of 200 machines all trying to coordinate to create one particular outcome. It's, here's my SQL server, here's a JIRA server, here's a Puppet server. Things like that, where the amount of technology is pretty small in each little pocket ... but we just have a hundred pockets on our pants.

Inside Infra: Greg Stein --Part II

[END OF PART II]