Mail outage post-mortem

Overview:
During the afternoon of May 6th we began experiencing delays in mail delivery of 1-2 hours. Initial efforts at remediation seemed to clear this up but on the morning of May 7th the problem worsened and we proactively disabled mail service to deal with the failure. This outage affected all ASF mailing lists and mail forwarding. The service remained unavailable until May 10th, and it took almost 5 additional days to fully flush the backlog of messages.

You can find a timeline here that was kept during the incident: https://blogs.apache.org/infra/entry/mail_outage

This was a catastrophic failure for the Apache Software Foundation as email is core to virtually every operation and is our primary communication medium.

What happened:

The mail service at the ASF is composed of three physical servers. Two of these are external facing mail exchangers that receive mail. The final server handles mailing list expansion, alias forwarding and mail delivery in general. That latter server had two volumes that experienced a disk outage each. This degraded performance substantially and led to the mail delays seen on May 6th and 7th. The service was proactively disabled on May 7th in an attempt to let the arrays rebuild without the significant disk I/O overhead caused by processing the large mail backlog. Ultimately multiple attempts to rebuild the underlying arrays failed and eventually other drives in the array where the data volume was stored failed rendering recovery a hopeless task on May 8th. We began working to restore backups from our offsite backup location to our primary US datacenter. When this began to take longer than expected, additional concurrent efforts began to restore service in one of our secondary datacenters as well as in a public cloud instance. Ultimately we ended up completing the restoration to our primary US datacenter first and were able to bring the service online. When the service resumed, we had an estimated 10 million message backlog in addition to our normal 1.7-2 million ongoing daily message flow. The amount of backlogged mail taxed the existing infrastructure and architecture of the mail service and took almost 5 days to completely clear.

What worked:

Our backups were sufficient to allow us to restore the service in good working order.
Early precautions taken when we discovered the problem combined with our backups resulted in no data loss from the incident.
Our mail exchangers continued to work during the outage and held incoming mail until the service was restored.

What didn't work:

Our monitoring was not sufficient to identify the problem or alert us to the symptoms.
No spare hard drives for this class of machine were on-hand in our primary datacenter.
The restore time from our remote backups took an excessively long time. This was partially due to the large size of the restore data, and partially due to the transport method used for the data.
After the service was restored we had approximately a 10M message backlog that took days to clear.
The primary administrator of the service was on vacation, and the remaining infrastructure contractors were not intimately familiar with the service.
Our documentation was insufficient to easily restore the service in a rapid manner by folks without intimate knowledge.

Remediation plan:

Our immediate action items:

Update the documentation to be current/diagram mail flow.
Improve the monitoring of the mail service itself as well as the hardware.
Insure we have adequate spares on hand for the majority of our core services.
Place our mail server under configuration management to reduce our MTTR

Medium-to-Long term initiatives.

Crosstraining contractors in all critical services
Work on moving to a more fault-tolerant/redundant architecture
More fully deploy our config management and automated provisioning across our infrastructure so MTTR is reduced.