SVN Service Outage - PostMortem

Summary

On Wednesday December 3rd the main US host for the ASF subversion service fails resulting in loss of service. This loss of subversion service prevent committers from submitting any changes, and whilst we have an EU mirror it is read-only and does not allow for any changes to be submitted whilst the master is offline.

The cause of the outage was a failed disk. This failed disk was part of a mirrored OS pair. Some time prior to this the alternate disk had also been replaced due to a failed state.

Timeline

0401 UTC 2014-10-26 - eris daily run output notes the degraded state of root disk gmirror
1212 UTC 2014-10-30 - INFRA-8551 created to deal with gmirror degradation.
2243 UTC 2014-12-02 - OSUOSL replaced disk in eris
0208 UTC 2013-12-03 - Subversion begins to crawl to a halt
0756 UTC 2013-12-03 - First contractor discovers something awry with subversion service
0834 UTC 2013-12-03 - Infrastructure sends out a notice about the svn issue
0916 UTC 2013-12-03 - Response to issue begins
1010 UTC 2013-12-03 - First complaints about mail being slow/down
1025 UTC 2013-12-03 - Discovery that email queue alerts had been silenced.
1225 UTC 2013-12-03 - Discovery that Eris outage affecting LDAP-based services including Jenkins and mail
1613 UTC 2013-12-03 - First attempt at power cycling eris
1717 UTC 2013-12-03 - Concern emerges that the 'good' disk in the mirror isn't.
1744 UTC 2013-12-03 - OSUOSL staff shows up in the office
1752 UTC 2013-12-03 - Blog post went up.
1906 UTC 2014-12-03 - New hermes/baldr (hades) being set up for replacement of eris
1911 UTC 2014-12-03 - #svnoutage clean room in hipchat began
2040 UTC 2014-12-03 - machine finally comes up and is usable.
2050 UTC 2014-12-03 - confusion arises between which switch is in which rack. Impedance mismatch between what OSUOSL calls racks, and what we called racks.
[Dec-3 5:50 PM] Tony Stevenson: whcih rack is this
[Dec-3 5:50 PM] Tony Stevenson: 1, 2 or 3
[Dec-3 5:50 PM] Justin Dugger (pwnguin): 19
[Dec-3 5:50 PM] David Nalley: what switch?
[Dec-3 5:50 PM] Justin Dugger (pwnguin): HW type: HP ProCurve 2530-48G OEM S/N 1: CN2BFPG1F5
[Dec-3 5:50 PM] David Nalley: ^^^^^^^^^ points to this impedance mismatch for the postmortem
[Dec-3 5:50 PM] David Nalley: no label on the switch?
2054 UTC 2014-12-03 - Data copy begins
0441 UTC 2014-12-04 - data migration finished
1457 UTC 2014-12-04 - SVN starts working again - testing begins
0647 UTC 2014-12-05 - svn-master is operational again with viewvc

Problems

It took us far too long to spin up replacement machine. This in fact took a few hours due to having to manually build the host from source media and encountering several BIOS/RaidController issues. Our endeavour to have automated provisioning of tin (bare metal) would certainly have improved this time considerably had it been available at the time of the event.

Many machines pointing to eris.a.o for LDAP - not to a service name (such as ldap1-us-west for example) which meant we couldn’t easily restore LDAP services for some US hosts without making them also think SVN services had also moved.

Assigning of issues in JIRA - It has perhaps been a long held understanding that if an issue is assigned to someone in JIRA then they are actively managing that issue. This event clearly shows how fragile that belief is.

DNS (geo) updates were problematic - Daniel will be posting a proposal on Thursday, which will outline our concerns around DNS and a viable way forward that meets our needs and is not reliant on us storing all the data in SVN to be able to effect changes to zones. (This proposal was not created as a tiger of this event, it has been worked on for a number of weeks now).

architectural problems for availablility

We couldn't promote svn-eu to master - data differences/corruption https://issues.apache.org/jira/browse/INFRA-6236
Current monitoring setup was not sufficient in catching disk errors and correctly alerting infra.

To Do

Daniel to investigate and evaluate multimaster service availability.

Implement an extended SSL check that not only ensures the service is up, but also checks cert validity (expire, revocation status etc), and the certificate chain is valid.

De-couple DNS from SVN

De-couple the SVN authz file from SVN directly. Also breser@ has suggested we use the authz validation tool available from the svn install we have on hades, as part of the template->active file generation process.

Move the ASF status page (http://status.apache.org) outside of our main colos so folks can continue to see it in the event of an outage.

Vendor provided hardware monitoring tools mandatory on all new hardware deployments.

Broader audience for incidents and status reports

More aggressive host replacement before these issues arise

Things being considered

Mandatory use of SNMP for enhanced data gathering.
Issue ‘nagging’ - develop some thoughts and ideas around the concept of auto-transitioning un-modified JIRA issues after N hours of in activity and actively nag the group until an update is made. This for example is how Atlasssian (and so many others) handle their issues. For example if an end-user doesn’t update the issue within 5 days, it is automatically closed, if we don’t update an open issue within 6 hours for a critical issue then we get nagged about it.
Automatically create new JIRA issues (utilising above mentioned auto-transition) to notify of hardware issues (not just relying on hundreds of cron emails a day).
Again as part of a wider thinking of how we use issue tracking consider the concept that you only assign an issue to yourself if you are explicitly working on it at that moment, i.e it should not sit in the queue assigned to someone for > N hours and not receive any updates.

Things that went well

The people working on the issue worked extremely well as a team. Communicating with one another via hipchat and helping each other along where required. There was a real sense of camaraderie for the first time in a very long time and this see of team helped greatly.

The team put in a bloody hard shift.

There is now a very solid understanding of the SVN service across at least 4 members of the team, as opposed to 2 x0.5 understandings before.

A much broader insight into the current design of our infrastructure was gained by the newer members of the team.