For the past few months the Infrastructure team have been working extremely hard to re-design, implement and manage changes to the email service architecture.  Today we are proud to announce that phase 1 of this has been completed, and has been running for several days now.

Phase 1 covers all components of the service except the listserv service, and mail archives.  These will be included in phase 2, which we will come onto later. When we started out on this project to review, update and manage our email infrastructure we had a several guiding principals that either the old system must be made to conform too; or any new service would need to meet before being accepted.  When we talk about these principals really we are talking about criteria, these are: 

  • The service must be entirely managed (operationally) from our puppet service. 
  • The software (packages) must all be packaged - i.e. .deb's, either upstream or packaged locally and in our own repo. Deploying from source is no longer acceptable.
  • All the work carried out by puppet et al must be idempotent
  • We will not allow the service design to restrict our ability to either adapt it, or grow it at will and on demand. 

Very early on in the design and testing work it became clear that we needed clear separation of each of the roles in the email service infrastructure. This would allow us, in the future too add more capability of any given type if for some reason it were needed. Lets say for example we needed for SpamAssassin capability this can we scaled sideways and allow us to swallow the load without needing to also make it an MX host or listserv host etc. 

The design we have settled upon, with phase 1 complete can be seen in this diagram. http://www.apache.org/dev/mailflow.jpg - This diagram shows that we have deployed several MX hosts (each of which are more than capable of handling our entire inbound mail load comfortably); in differing AWS regions globally. This decision means that while we dont need 3 to cope with capacity we wanted 3 to cope with networking resilience should any of these instances suffer network degradation or outage.  

These MX hosts are simple Postfix instances that run Postfix Postscreen, RBL checks, and Amavisd-new.  This simple protection of only performing RBL checks at the edge frees up the internal scanning hosts from having to scan emails needlessly. Amavis is simply used to pass the emails internally for scanning. 

Once the mails have been passed on by the MX (and there is an interesting detail about how exactly the mails are handled by Amavis that might be a blog post in the near future) they are handled by our scanning cluster. This group of hosts utilise SpamAssassin, ClamAV and again Postfix. While these may not be new technologies, again having a dedicated host or hosts in our case allows us to tune the services specifically for the resources dedicated to scanning and not worry about choking other local services. Of course it also means that should we see a marked increase in mail volume we can easily deploy a new node in a matter of minutes and have it join the rotation and start scanning email.

All of the scanning nodes are being fronted by a HAProxy instance, this allows us to load balance our nodes and not have to reconfigure the MX hosts should we change the number of scanning hosts.  It also means we can take a node out of rotation for maintenance and none of the MX hosts need to be reconfigured or modified in anyway.

As we said earlier this is only phase 1.  You will see in the diagram that we are still running our old ezmlm/qmail stack. This will now become the focus of phase 2, to determine what changes, if any best suit our projects and the foundation as a whole. One of the failings of the current system is that if the listserv host goes down, mail basically stops flowing, as this is the authoritative host for all apache addresses. We will also be looking very hard as to how we can run multiple listserv hosts to remove that single point of failure concern. 

The foundation relies on email as it's official internal communication mechanism, this is evident no more than when we say "If it didn't happen on the list, it didn't happen". Moving this service forward will be a significant challenge, one which we hope to deliver as soon as we can. 

As always, if you have any questions please email infrastructure@apache.org  and we will do what we can to help.

On behalf of the Infrastructure Team
--pctony