Inside Infra: Greg Stein –Part III

The close of the "Inside Infra" interview with ASF Infrastructure Administrator Greg Stein, who shares his experience with Sally Khudairi, ASF VP Marketing & Publicity.

"Apache is growing: we're just seeing the demand explode and it's a hard problem for us to solve."

PART THREE.

We were talking about ensuring that the team is up to speed with everything required of them...

So there certainly are skill gaps; this is one of the things I want to help motivate the team with, where if somebody says, "Hey, I want to go and investigate Ansible as a potential Puppet replacement," I say, "Go forward."

This would be similar to Google having their 20% projects. I'm sure you've heard of that.

Oh, yeah.

It's almost the same where it's not 20%, maybe 5%, but it's the same as Google, no matter what they want to tell you, because everybody's got their job and you have to be really rigorous to carve out 20% of your time. And strictly speaking, it does actually make your Google manager a little upset if you carve out the entire 20%. But anyways, the concept is similar.

So for us it’s like, "Well, go in and investigate Ansible, see if it'll work for us and put your notes into the Wiki." That's how we make forward progress, up our game, and learn new skills. If someone says, "I want to go and figure this out," the response is almost always, "Okay. You go do it." There's certainly an allowance for people to learn new skills. But most of the time we simply rely on, say, Gavin (ASF Infrastructure team member Gavin McDonald), knowing more about JIRA configuration than the other guys.

That added component of sharing what you know, and adding it to the JIRA or to the Wiki actually is great because then everyone's learning. This is like the rising tide: everybody's learning about this, whether they're doing it perfectly or not. I think this is a very interesting process.

Yes, and that's also where Andrew (technical writer Andrew Wetmore) is helping us out. He’s organizing that information that we have learned, that we have documented, that we memorialized into the Wiki.

Because our (ASF’s) legacy is quite Medusa-like over all these years, it's interesting to see how everyone can get caught up and also contribute...you have to go back and deal with the legacy, but you also have to be able to move forward. To be able to bring others with you is brilliant. That's really cool.

The infrastructure has grown organically over 25 years from when Brian Behlendorf first said, "Hey, I have this server called hyperreal.org: you can run a CVS repository on it for the Web server."

That computer was under his desk at the Wired offices way back when, wasn’t it...

Yes it was. And it's just grown organically over those 25 years. Then we had Minotaur and it did six different things ... now it only does half of one and we've moved the stuff out onto newer machines and newer processes and this and that. But the organic growth means that we've got some really hairy stuff. Our move to Puppet --first Puppet 3, and now to Puppet 6-- at each step we're improving it and making it less hairy and more manageable and something that somebody can come along, look at, pick up and run with it from there. That makes it a lot easier, so that we don't have to spend 100% of our time cross training.

What are your thoughts on products, the hype cycle, where everyone's demanding Kubernetes, to use that as an example. Do you decide which products to provide support for, or is that up to Apache projects in the communities? You mentioned Ansible, just not too long ago, that was your internal decision to move. But I remember not long ago, GitHub entered into the landscape. How did that happen? How did you decide to make a move like that? That's a significant thing. Can you tell me a little bit about that?

It's a lot based on community input. So if we see a lot of people asking for a particular tool, we'll like, "Oh, hey, David, can you go and take a look at that and see if that's something…” Not David (ASF VP Infrastructure David Nalley), but Chris (Infrastructure team member Chris Lambertus) or somebody else. "Can you go take a look. Is that something that we can support? Because we're getting some queries about it."

And there's a little chicken and egg problem there that if the communities don't know to ask for the egg, we don't know whether to prep the chicken. It's like, “okay, wait, they don't even know to ask for a tool because we haven't said we will make this tool available, because we're not going to make the tool available until somebody asks”. But sometimes people file tickets like, "Can I get this set up?" and we'll go, "No."

Then six months later, somebody else will file a ticket: "Can I get this set up?" and we'll go," No." But after enough of those, we're like, "Maybe that's something that we really want to do." For GitHub, specifically that’s what happened there. Well, even before that Git, where we ran our own Git server, that was a volunteer that made that happen. That was, six years ago or so.

Well...the volunteer came along and said, "Well, I'll do this. I'm not going to take any time from Infra." There's been a couple things for the past few years where I've told people, "No, Infra will not work on that. But if you want to volunteer or find a volunteer, then we'll stand it up for testing." You know what I mean? Why not? So there's a couple things where people have stood up for test examples and there hasn't really been a lot of usage.

So, we're not going to support that. But something like Ansible is our own internal workflow and the tool we’ll experiment with, then to see if it'll improve our stuff. But from the community, they pretty much have to ask and it has to be a sustained ask. That's how we ended up with Travis CI: we actually pay for capacity in Travis CI, and that's based on community input.

So many people wanted to do their continuous integration through Travis that eventually we decided to pay for it. But it's tricky because some of these systems like Travis CI and others require certain permissions that we don't want to provide to the community. So we will want to hold those only within Infra. And so it gets hard to integrate certain tools. We've had to say no, but then again, we've found other ways to improve that so that we can lock down the permissions or use a proxy or other ways that we can route around some of these issues and then integrate the requested tool.

So further to that, have you been in a situation where a project or a community has made unreasonable demands of Infra or have expectations, where it's like, so over the top or so out of scope, it totally surprised you? Have you had something like this?

Nothing surprises me.

Nothing surprises you? Okay. Have you been in this situation? Like “was never going to happen”...

Yes, yes. There's been several times where one of the guys on the team is like, "Oh man, I got this ticket. I don't know what we want to do with this. Greg, go take a look." And I go and look at it and that's where I make that call: "Okay, is the Infra team going to take this on, or do I just say ‘no’ right now?"

So, yeah, there's been a number of times where I've said no and probably two or three times where I've gotten a little bit of pushback on that no. I say, "My answer is no, but here's how you escalate." I've had escalation a few times and I'm actually, mid-process --I'm dealing with one right now. So, I've said, "no, if you don't like my no, you can go to VP Infra and VP Infra is, probably going to tell you the same thing. And then you can go to the President. Right now those are actually the same person."

The same person is a double "no".

That really is the true escalation path. I have to describe that to people and say, "I don't think you're going to get what you want." If I'm the one that says, no, you probably are not going to get it because VP Infra and President, and after that is the Board. They're probably not going to say, "Greg is wrong. Yes, we'll give that to you." But it's there. There's been a couple of times where I said "No, you have to ask the Board for the budget for those additional virtual machines." They went to the board and said, "Can we have budget for three machines?" and the Board said, "Yes."

So Infra went ahead and gave them the three VMs that they had initially requested. Strictly speaking, we would track those machines against their budget, but that detail is more than what the actual budget was. So we don't spend that time doing that, but I have had to say, no. I have had to... There was Apache Maven: they were keeping a copy of Maven Central, and Maven Central is run by Sonatype...

Which is a commercial product...

Yes. They're using the trademark “Maven”, essentially a licensing agreement from us, a MOU. So with Maven Central, you could imagine if someone decides to just turn it off one day ...we wanted a copy. Apache Maven was making a copy of it, and it just started consuming so much disk space. We were like, "We can't support that growth rate. We can't support that even for the next six months. If you want to keep doing it, go ask the Board for money to keep doing it." They never did. We turned it off.

I wouldn't call that a ridiculous request --it was something where we didn't have to just say, "No, not going to do it. Bye." A lot of the requests are mostly just, "We aren't going to run that extra software. If you want: ask for a VM and you can run it, but we're not going to take responsibility for it."

Over the years, obviously ASF Infra has changed. Was this all reactive or was it also proactive? Do you plan for those changes as you go or has it all been in response to Project X or in response to X emergency?

The growth of Infrastructure and its movement from volunteer-only to paid staff was part of just the growth of Apache. The volunteers could no longer keep up and things, like account creation, used to take sometimes four weeks to get an account. You’d put in a request for an account, four weeks later, it would finally get created.

My gosh, that queue was crazy, huh?

Well, it wasn't even a long queue, it was simply that we didn't have volunteers making sure the queue stayed empty. Today it's down to one, two, maybe three days, and the account is created, because every day a staff member goes and creates the accounts first thing in the morning.

It was how I said that my day starts with looking at messages on Slack and then reading emails to see if there's stuff to handle. Well, one of the guys on staff, first thing he does in the morning is go and look at account creation. So he's been off and on pondering on a tool to make that easier for himself; he hasn't finished the tool, so he still has to do it manually. That's his incentive.

“Work quickly”...

This is Chris Thistlethwaite. I say, "Chris, we can do something about that." And he says, "No, no, this is still my project. And every day when I run the script, it just makes me remember, I need to finish this."

So when the volunteers could not keep up with the amount of work, that's when we hired Joe Schaefer, then we hired another person, and hired another person. And so it was just trying to keep up with the rate of requests.

That's how we ended up with hiring six people. And then I'm half a person, like I said, I'm part-time. So, it's just the growth of Apache. I think we're in much better shape than when I started. We're ahead of the curve. We can stay ahead of the curve because one of the things that I can do because I don't fight the fires every day ... that's for all the guys who know their stuff. They fight the fires and I can look at if I need to go and ask for another head count. And that's how we ended up with Andrew (technical writer Andrew Wetmore): “Well, you know, what we really need is somebody to manage all this documentation.” This was part of Sam's (former ASF President Sam Ruby), “If you had some money, what would you do with it?” That's how the technical writer/editor came around, because we've got 20 years of organic growth. We had...let's just call it “organic documentation”. That revamping project is going really well, I think.

So, in what areas are you guys experiencing your biggest growth? As I was asking Chris and Drew, is there like a geographic influence on the demand? We’ve had a huge influx of users in China. Does any of that change the way or what you guys are doing? Or is it just more of everything?

Our biggest pain point, I would say, is continuous integration/continuous development: CI/CD. Jenkins, Travis, CircleCI, and things like this, where people make a change and they want that change built and tested. The more projects we get and the larger the communities get, the more changes and the more testing and the more building and the more this, more, more, more. It's kind of one of those things where it's “expand-to-fit”. So if we gave people 100 machines, they'd use 100 machines. If we doubled it to 200, they'd use all 200. It's just this rapacious need for CI machines. It's very hard to figure out how to plan around that other than just telling the communities, “No: we just don't have that much capacity: if you want to build it, do it on your own machine. You just can't use Apache hardware to do it.”

That's an unsatisfactory answer. That's been one of our hard problems and it's also kind of a newer problem: the development workflow that uses CI probably is just maybe five years old. Before that, certainly, automated building and testing was a thing, but it's really kind of grown into community workflow much, much more over the past five years, and more and more people are wanting to do it. The communities are growing. Apache is growing: we're just seeing the demand explode and it's a hard problem for us to solve.

China is the one case where we see regional issues, and that's because of the great firewall of China. Not because we're getting more Chinese developers, but because they have problems accessing our servers because they're located outside of China, and so we're looking at CDNs, a content distribution network to essentially make our content available closer to China. We've found that even with one of those CDN drop points in Hong Kong, they still have problems just reaching it there in Hong Kong, and so ... and we don't want to buy or lease or rent a server in China because doing business in China is too high of a hurdle for the Foundation.

Oh?

You know, Microsoft and Google have to do business in China and they've got a pack of lawyers and a giant vault of money to deal with all the barriers. The Foundation does not, so it's also a hard problem to solve. We think we might be able to do it through Microsoft Azure, that they have a CDN that resides in China that Microsoft has done all that paperwork, so we're looking at that, but as far as regional things, it's not so much that we run into issues. We see Open Source communities in Europe and Brazil and Australia and Sri Lanka: none of them really have any problems because they don't have that firewall. It's not really about the Chinese people, but about the China firewall.

That's bigger than us. And that’s not something we can fire hose.

We do see little engagement from Japan and Brazil, and that is partly for language reasons and partly because the Brazil community is more about Free Software than Open Source software.

Yeah. They're very pro-FOSS.

Not OSS. But pro-free. And so, they're going to deal with the Free Software Foundation rather than the Apache Software Foundation.

I see. That’s an important distinction.

And then you also have the Portuguese language barrier. People contributing from Europe and India, Sri Lanka, etc., they pretty much know English and that's fine. A lot of the Brazilian developers do not know English...this is the same with the Japanese Open Source developers. Japanese and Brazilian, they tend to not know English, and so that kind of isolates them from the larger Open Source world, or Free Software world, in the case of Brazil.

Would we consider localizing anything that we do, or are we going to continue as-is, as the ASF is all English?

The Infrastructure team will not translate our documents to serve those other languages. That's just too high of a bar.

There are a couple groups that have user mailing lists that are not English and that's totally fine, and Infrastructure will... well, you don't have to file a ticket anymore. It's, again, back to selfserve.apache.org: “self-serve” on Apache will create a mailing list for users communicating in Brazilian Portuguese, for example, or communicating in Japanese. But Infra doesn't do anything about that, that's just the self-serve tools. We certainly can't support non-English, and I don't think that the Foundation itself is going to make any moves towards that.

Fair enough. So a lot of companies are really struggling to accommodate their teams working from home in response to the Coronavirus and all that. These stay-at-home orders are kind of shaking companies, but from day one, the ASF has always been a virtual organization. Has anything changed with your operation on that front? Has anything impacted the ASF's day-to-day, from this pandemic?

(chuckling) Not at all. I shouldn't laugh, but no. It really hasn't changed. We've been on our team channel for all three years, three and a half years that I've been here, and the world is burning down around us, but we still sit on the team channel.

Now, that said, (Infra team member) Daniel Gruno got stranded in Canada.

Right! He’s still there?

He's still doing work from Canada. This is why when he travels to Canada for two months at a time, I don't care, you know? Because if his butt is in a chair in Denmark or in a chair in Canada, it's the same butt, so, you know...

As long as you have connectivity and a computer, you can do it.

Right. But if he has to be offline for two months, I'd say no. Or if you want unpaid time off, well, I'm not going to pay you, of course. Certainly the discussions have changed, you know? I mean, going shopping. You know, some members are immuno-compromised and that had an effect on our team meeting that we were planning in Nashville: they were the first to say, “No way. I'm not going,” so, there’s that, but our day to day hasn't changed.

That's more of a social thing versus an operational thing. Safety first.

So the notion of, “Oh, I got to run out to the grocery store. I need to strap on a mask,” changes, but not the operation.

Right. Right. So...what do you think people would be surprised to know about ASF Infra?

I don't know if it'd be surprising, but we are global. We've got four people in the United States, one in Canada, one in Denmark, one used to be in Australia, but is now in the UK, which actually kind of hurt a little bit, because in Australia, that meant that we always had somebody in that time zone, but now we have kind of this gap of Australia/Asia time zones when...

A “Gavin” gap.

Yeah, well, I might be awake at that time, but I can't go and fix a MySQL server, so it does mean that we don't have that straight-up 24-hour coverage.

The notion that we are worldwide is kind of a neat thing about our team, and is what makes us pretty unique relative to other IT departments. I don't like being called an IT department, but that is essentially what we are.

Surprise.

What's the name of that TV show? The one that's about IT...

“The IT Crowd”, is that what you’re referring to? The British show?

Yeah. So, you know, that's a funny show, but mostly when you think “IT department”, you think of some corporate people with button-up shirts, but ...most of us, we're in our pajamas.

Good one. What's your favorite part of the job?

I definitely like the team and that's why, nominally I'm part-time, but I'm pretty much constantly on the team channel and interacting, and so I think I just put that down as volunteer hours, where before I might work on Apache Subversion, but now I hang out with the team or I write some little tool or something like that. That's definitely been one of the more rewarding changes. Up until I started with this, I'd been a director for 15-and-a-half years, and that was kind of how I contributed to Apache. Now my work for Infrastructure is a new way to contribute to the Foundation. I'm also part of a new community, where before I would hang out with the httpd community, APR community, the Subversion people ...now it's the Infra people and my hobby time is kind of blended in with my work time, and vice versa. I mean, when your work time can also be seen as a hobby time, that's pretty cool.

I do think it's the team that makes it interesting. That's what I like the most, and that I'm working with a new, interesting community to contribute to the Foundation.

Not only did you switch roles, you switched communities. What was your biggest challenge going into this new role?

I would say probably trying to delineate what I was going to handle for the guys and that I wasn't going to tell them what to do or how to do it. It's like, “OK, I'm here to assist, to unblock things, to enable you guys, rather than to block you or micromanage you.”

To earn that trust, that I wasn't going to be some pointy-haired boss telling them how to do their work. Now, I don't know if that was ever a problem for them, but that was certainly one of my initial concerns: how to properly create my role. This was the first time Apache's even had somebody fill in this role, so I also had to find the role, which is, again, why I came up with “Infrastructure Administrator”, is because I wanted to define it as an enabler role, as an administrator, so they could get their work done but I would not be their manager. I would not be their boss: I was simply there to enable them.

So, what are you most proud of in your infra career to date?

Ooh. I don't know. I would say by being hands-on, being the “hands” of Infra, it means that VP Infra didn't run away screaming.

David said in January 2016, maybe earlier, he was like, “No way. I'm out.” And after I was on the job for about two months, he said, “Huh. All right.”

“I'm in!”

And so I get that feedback from him, “You know, you make the VP Infra hat quite easy for me.” I think that's probably what I really like about taking on the role, is that one of our volunteers got to stay rather than drop it because it was just causing so much anxiety and pain and time and frustration. Otherwise, most of the stuff I do is really boring. Not to me, but I don't have “accomplishments”. I push paperwork, basically, so the other guys can do accomplishments.

Speaking of the other guys, how would your co-workers describe you?

I have no idea. I don't know. I really don't know. (laughing)

Where I just got done talking about what I saw as an issue, trying to frame what my role would be, it might have been fine with them and I was overly worried about it, but it’s hard for me to know. We don't do 360 reviews in Infra, so I don't get any feedback, really, from the team on what they think about myself or how I'm doing my job, so you'd have to ask them.

I have. Just kidding. So...what are the biggest “threats” that infrastructure managers or infrastructure administrators need to watch out for? What do you think is a “big thing” that people should be aware of, or is ASF so unique that you don’t feel like anyone really experiences what you experience?

There's our capacity issue with things like Travis, but I think you're asking a different question.

I am, but that's fine. What's your greatest piece of advice? What would you tell aspiring infra administrators?

Actually, one of my greatest fears is really, as a small charitable foundation, it's hard for us to compete with well-funded corporations and some well-funded start-ups.

Related to that, I touched on it earlier, is career development ...you go into Google or Microsoft and there's a career ladder; we simply don't have a career ladder. There's salary growth. There's bonuses. If you want to have a resume or a LinkedIn profile that shows changes in growth and titles and career ladder, we can't offer that, and that's going to cut out some people. It's a very hard problem for me to solve. You know, there's things I can maybe do, but I also want to keep the team egalitarian and sort of level, rather than, “Oh, well, this guy is now the team lead.”

Given what I talked about, our social aspects, because we are all equal peers, keeping everybody with the same title, same position on the ladder means that we are peers and it's a little easier to interact that way. It's a real, real difficult problem. You ask what's scary: that's scary.

But there's a counterpoint to that. You may not have a traditional career ladder path, but to say that you've worked in Infra for Apache carries weight. That's significant.

I believe it does, especially when you can demonstrate the hundred different types of tasks...

Well, that's exactly it. The breadth of work and the scale of what you guys do and the skill sets that you have to have and the fact that you have to play nice in the sandbox, all of it. The demand is immense, so to be able to be there and thrive and develop something from yourself in terms of a career is tremendous. Our team is exceptional. I mean, they're not expecting a linear ladder or something that others have.

You know, in other jobs, somebody might say, “I was a MySQL administrator.” Here, you're a MySQL administrator, PostgreSQL administrator… They had one role; here you've got dozens.

If you had a magic wand, what would you see happen with ASF infra?

I'd like to solve that CI problem. The other magic wand would be upgrading our mail server from 10-year-old technology to modern technology.

Is that happening or is that literally a wish list issue?

It's happening, but it's been happening for three years. The thing is that email is so central to the Foundation that we can't really experiment with that. There are certain things we can do, but most of it, not so much, and so it means that we're being super-careful. There's about 10-12 different moving parts to it, and we're upgrading each of those a little bit by a little bit, until we can finally pull that big, scary, Young Frankenstein lever to hit the lightning bolt, you know?

Yeah: I see the visual of that.

The magic wand would be to just make that all happen and make it work. Without the wand, it's going to take another 6-12 months.

Right. What else do we need to know that I haven't asked? What should I be aware of or what should I be sharing?

Oh, I don't know. This is where my creativity ends. Ask me a coding question.

Oh no coding questions. All right. Our time has also ended. Before we go, who should I be interviewing next?

I would say Daniel (Gruno), because his role ... he's 20-30% system administration. The rest is tool development, so that makes his role rather unique in the team.

Perfect. Thanks so much, Greg. I really appreciate it.

= = =

Greg is based in Austin on UTC -5. His favorite thing to drink during the workday is a big 32oz cup of Diet Mountain Dew.