Dragon Pastoral

凯龘牧歌

Chapter24 - Distributed Email Service

Posted at # SystemDesign

Distributed Email Service

We’ll design a distributed email service, similar to gmail in this chapter.

In 2020, gmail had 1.8bil active users, while Outlook had 400mil users worldwide.

Step 1 - Understand the Problem and Establish Design Scope

Non-functional requirements

Back-of-the-envelope estimation

Step 2 - Propose High-Level Design and Get Buy-In

Email knowledge 101

There are various protocols used for sending and receiving emails:

Apart from the mailing protocol, there are some DNS records we need to configure for our email server - the MX records: dns-lookup

The priority numbers indicate preferences, where the mail server with a lower priority number is more preferred. (Just like queue in SIP) If the connection fails, the sending mail server will attempt to connect to the mail server with the next lowest priority

Email attachments are sent base64-encoded and there is usually a size limit of 25mb on most mail services. This is configurable and varies from individual to corporate accounts.

Traditional mail servers

Traditional mail servers work well when there are a limited number of users, connected to a single server. traditional-mail-server

In traditional mail servers, emails were stored on the local file system. Every email was a separate file. local-dir-storage

As the scale grew, disk I/O became a bottleneck. Also, it doesn’t satisfy our high availability and reliability requirements. Disks can be damaged and server can go down.

Distributed mail servers

Distributed mail servers are designed to support modern use-cases and solve modern scalability issues.

These servers can still support IMAP/POP for native email clients and SMTP for mail exchange across servers.

But for rich web-based mail clients, a RESTful API over HTTP is typically used.

Example APIs:

[{id: string        Unique folder identifier.
  name: string      Name of the folder.
                    According to RFC6154 [9], the default folders can be one of
                    the following: All, Archive, Drafts, Flagged, Junk, Sent,
                    and Trash.
  user_id: string   Reference to the account owner
}]
{
  user_id: string                      // Reference to the account owner.
  from: {name: string, email: string}  // <name, email> pair of the sender.
  to: [{name: string, email: string}]  // A list of <name, email> paris
  subject: string                      // Subject of an email
  body: string                         //  Message body
  is_read: boolean                     //  Indicate if a message is read or not.
}

Here’s the high-level design of the distributed mail server: high-level-architecture

Here’s what the email sending flow looks like: email-sending-flow

We need to also monitor size of outgoing message queue. Growing too large might indicate a problem:

Here’s the email receiving flow: email-receiving-flow

Step 3 - Design Deep Dive

Let’s now go deeper into some of the components.

Metadata database

Here are some of the characteristics of email metadata:

At gmail/outlook scale, the database is typically custom made to reduce input/output operations per second (IOPS).

Let’s consider what database options we have:

Based on above analysis, very few existing solutions seems to fit our needs perfectly. In an interview setting, it’s infeasible to design a new distributed database solution, but important to mention characteristics:

In order to partition the data, we can use the user_id as a partition key, so that one user’s data is stored on a single shard. This prohibits us from sharing an email with multiple users, but this is not a requirement for this interview.

Let’s define the tables:

Legend for tables to follow: legend

Here is the folders table: folders-table

emails table: emails-table

Attachments are stored in a separate table, identified by filename: attachments

Supporting fetchin read/unread emails is easy in a traditional relational database, but not in Cassandra, since filtering on non-partition/clustering key is prohibited. One workaround to fetch all emails in a folder and filter in-memory, but that doesn’t work well for a big-enough application.

What we can do is denormalize the emails table into read/unread emails tables: read-unread-emails

In order to support conversation threads, we can include some headers, which mail clients interpret and use to reconstruct a conversation thread:

{
  "headers" {
     "Message-Id": "<7BA04B2A-430C-4D12-8B57-862103C34501@gmail.com>",
     "In-Reply-To": "<CAEWTXuPfN=LzECjDJtgY9Vu03kgFvJnJUSHTt6TW@gmail.com>",
     "References": ["<7BA04B2A-430C-4D12-8B57-862103C34501@gmail.com>"]
  }
}

Finally, we’ll trade availability for consistency for our distributed database, since it is a hard requirement for this problem.

Hence, in the event of a failover or network parititon, sync/update actions will be briefly unavailable to impacted users.

Email deliverability

It is easy to setup a server to send emails, but getting the email to a receiver’s inbox is hard, due to spam-protection algorithms.

If we just setup a new mail server and start sending mails through it, our emails will probably end up in the spam folder.

Here’s what we can do to prevent that:

You don’t need to remember all of this. Just know that building a good mail server requires a lot of domain knowledge.

Searching includes doing a full-text search based on email contents or more advanced queries based on from, to, subject, unread, etc filters.

One characteristic of email search is that it is local to the user and it has more writes than reads, because we need to re-index it on each operation, but users rarely use the search tab.

Let’s compare google search with email search:

ScopeSortingAccuracy
Google searchThe whole internetSort by relevanceIndexing takes some time, so not instant results.
Email searchUser’s own email boxSort by attributes eg time, date, etcIndexing should be quick and results accurate.

To achieve this search functionality, one option is to use an Elasticsearch cluster. We can use user_id as the partition key to group data under the same node: elasticsearch

Mutating operations are async via Kafka in order to decouple services from the reindexing flow. Actually searching for data happens synchronously.

Elasticsearch is one of the most popular search-engine databases and supports full-text search for emails very well.

Alternatively, we can attempt to develop our own custom search solution to meet our specific requirements.

Designing such a system is out of scope. One of the core challenges when building it is to optimize it for write-heavy workloads.

To achieve that, we can use Log-Structured Merge-Trees (LSM) to structure the index data on disk. Write path is optimized for sequential writes only. This technique is used in Cassandra, BigTable and RocksDB.

Its core idea is to store data in-memory until a predefined threshold is reached, after which it is merged in the next layer (disk): lsm-tree

Main trade-offs between the two approaches:

Scalability and availability

Since individual user operations don’t collide with other users, most components can be independently scaled.

To ensure high availability, we can also use a multi-DC setup with leader-folower failover in case of failures: multi-dc-example

Step 4 - Wrap up

Additional talking points: