Dragon Pastoral

凯龘牧歌

Chapter18 - Nearby Friends

Posted at # SystemDesign

Nearby Friends

This chapter focuses on designing a scalable backend for an application which enables user to share their location and discover friends who are nearby.

The major difference with the proximity chapter is that in this problem, locations constantly change, whereas in that one, business addresses more or less stay the same.

Step 1 - Understand the Problem and Establish Design Scope

Some questions to drive the interview:

Functional requirements

Non-functional requirements

Back-of-the-envelope

Some estimations to determine potential scale:

Step 2 - Propose High-Level Design and Get Buy-In

Before exploring API and data model design, we’ll study the communication protocol we’ll use as it’s less ubiquitous than traditional request-response communication model.

High-level design

At a high-level we’d want to establish effective message passing between peers. This can be done via a peer-to-peer protocol, but that’s not practical for a mobile app with flaky connection and tight power consumption constraints.

A more practical approach is to use a shared backend as a fan-out mechanism towards friends you want to reach: fan-out-backend

What does the backend do?

This sounds simple but the challenge is to design the system for the scale we’re operating with.

We’ll start with a simpler design at first and discuss a more advanced approach in the deep dive: simple-high-level-design

In the above example, websocket servers subscribe to channels for the users which are connected to them & forward location updates whenever they receive them to appropriate users.

Periodic location update

Here’s how the periodic location update flow works: periodic-location-update

Here’s a more detailed version of the same flow: detailed-periodic-location-update

On average, there’s going to be 40 location updates to forward as a user has 400 friends on average and 10% of them are online at a time.

API Design

Websocket Routines we’ll need to support:

HTTP API - traditional request/response payloads for auxiliary responsibilities.

Data model

Step 3 - Design Deep Dive

Let’s discuss how we scale the high-level design so that it works at the scale we’re targetting.

How well does each component scale?

Scaling deep-dive on redis pub/sub component

We will need around 200gb of memory to maintain all pub/sub channels. This can be achieved by using 2 redis servers with 100gb each.

Given that we need to push ~14mil location updates per second, we will however need at least 140 redis servers to handle that amount of load, assuming that a single server can handle ~100k pushes per second.

Hence, we’ll need a distributed redis server cluster to handle the intense CPU load.

In order to support a distributed redis cluster, we’ll need to utilize a service discovery component, such as zookeeper or etcd, to keep track of which servers are alive.

What we need to encode in the service discovery component is this data: channel-distribution-data

Web socket servers use that encoded data, fetched from zookeeper to determine where a particular channel lives. For efficiency, the hash ring data can be cached in-memory on each websocket server.

In terms of scaling the server cluster up or down, we can setup a daily job to scale the cluster as needed based on historical traffic data. We can also overprovision the cluster to handle spikes in loads.

The redis cluster can be treated as a stateful storage server as there is some state maintained for the channels and there is a need for coordination with subscribers so that they hand-off to newly provisioned nodes in the cluster.

We have to be mindful of some potential issues during scaling operations:

Adding/removing friends

Whenever a friend is added/removed, websocket server responsible for affected user needs to subscribe/unsubscribe from the friend’s channel.

Since the “nearby friends” feature is part of a larger app, we can assume that a callback on the mobile client side can be registered whenever any of the events occur and the client will send a message to the websocket server to do the appropriate action.

Users with many friends

We can put a cap on the total number of friends one can have, eg facebook has a cap of 5000 max friends.

The websocket server handling the “whale” user might have a higher load on its end, but as long as we have enough web socket servers, we should be okay.

Nearby random person

What if the interviewer wants to update the design to include a feature where we can occasionally see a random person pop up on our nearby friends map?

One way to handle this is to define a pool of pubsub channels, based on geohash: geohash-pubsub

Anyone within the geohash subscribes to the appropriate channel to receive location updates for random users: location-updates-geohash

We could also subscribe to several geohashes to handle cases where someone is close but in a bordering geohash: geohash-borders

Alternative to Redis pub/sub

An alternative to using Redis for pub/sub is to leverage Erlang - a general programming language, optimized for distributed computing applications.

With it, we can spawn millions of small, erland processes which communicate with each other. We can handle both websocket connections and pub/sub channels within the distributed erlang application.

A challenge with using Erlang, though, is that it’s a niche programming language and it could be hard to source strong erlang developers.

Step 4 - Wrap Up

We successfully designed a system, supporting the nearby friends features.

Core components:

We also explored how to scale restful api servers, websocket servers, data layer, redis pub/sub servers and we also explored an alternative to using Redis Pub/Sub. We also explored a “random nearby person” feature.