Dragon Pastoral

凯龘牧歌

Chapter17 - Proximity Service

Posted at # SystemDesign

Proximity Service

A proximity service enables you to discover nearby places such as restaurants, hotels, theatres, etc.

Step 1 - Understand the problem and establish design scope

Sample questions to understand the problem better:

Functional requirements

Non-functional requirements

Back-of-the-envelope calculation

Step 2 - Propose High-Level Design and get Buy-In

API Design

We’ll use a RESTful API convention to design a simplified version of the APIs.

GET /v1/search/nearby

This endpoint returns businesses based on search criteria, paginated.

Request parameters - latitude, longitude, radius

Example response:

{
  "total": 10,
  "businesses":[{business object}]
}

The endpoint returns everything required to render a search results page, but a user might require additional details about a particular business, fetched via other endpoints.

Here’s some other business APIs we’ll need:

Data model

In this problem, the read volume is high because these features are commonly used:

On the other hand, write volume is low because we rarely change business information. Hence for a read-heavy workflow, a relational database such as MySQL is ideal.

In terms of schema, we’ll need one main business table which holds information about a business: business-table

We’ll also need a geo-index table so that we efficiently process spatial operations. This table will be discussed later when we introduce the concept of geohashes.

High-level design

Here’s a high-level overview of the system: high-level-design

Algorithms to fetch nearby businesses

In real life, one might use a geospatial database, such as Geohash in Redis or Postgres with PostGIS extension.

Let’s explore how these databases work and what other alternative algorithms there are for this type of problem.

The most intuitive and naive approach to solving this problem is to draw a circle around the person and fetch all businesses within the circle’s radius: 2d-search

This can easily be translated to a SQL query:

SELECT business_id, latitude, longitude,
FROM business
WHERE (latitude BETWEEN {:my_lat} - radius AND {:my_lat} + radius) AND
      (longitude BETWEEN {:my_long} - radius AND {:my_long} + radius)

This query is not efficient because we need to query the whole table. An alternative is to build an index on the longitude and latitude columns but that won’t improve performance by much.

This is because we still need to subsequently filter a lot of data regardless of whether we index by long or lat: 2d-query-problem

We can, however, build 2D indexes and there are different approaches to that: 2d-index-options

We’ll discuss the ones highlighted in purple - geohash, quadtree and google S2 are the most popular approaches.

Evenly divided grid

Another option is to divide the world in small grids: evenly-divided-grid

The major flaw with this approach is that business distribution is uneven as there are a lot of businesses concentrated in new york and close to zero in the sahara desert.

Geohash

Geohash works similarly to the previous approach, but it recursively divides the world into smaller and smaller grids, where each two bits correspond to a single quadrant: geohash-example

Geohashes are typically represented in base32. Here’s the example geohash of google headquarters:

1001 10110 01001 10000 11011 11010 (base32 in binary) → 9q9hvu (base32)

It supports 12 levels of precision, but we only need up to 6 levels for our use-case: geohash-precision

Geohashes enable us to quickly locate neighboring regions based on a substring of the geohash: geohash-substring

However, one issue \w geohashes is that there can be places which are very close to each other which don’t share any prefix, because they’re on different sides of the equator or meridian: boundary-issue-geohash

Another issue is that two businesses can be very close but not share a common prefix because they’re in different quadrants: geohash-boundary-issue-2

This can be mitigated by fetching neighboring geohashes, not just the geohash of the user.

A benefit of using geohashes is that we can use them to easily implement the bonus problem of increasing search radius in case insufficient businesses are fetched via query: geohash-expansion

This can be done by removing the last letter of the target geohash to increase radius.

Quadtree

A quadtree is a data structure, which recursively subdivides quadrants as deep as it needs to, based on business needs: quadtree-example

This is an in-memory solution which can’t easily be implemented in a database.

Here’s how it might look conceptually: quadtree-concept

Example pseudocode to build a quadtree:

public void buildQuadtree(TreeNode node) {
    if (countNumberOfBusinessesInCurrentGrid(node) > 100) {
        node.subdivide();
        for (TreeNode child : node.getChildren()) {
            buildQuadtree(child);
        }
    }
}

In a leaf node, we store:

In an internalt node we store:

The total memory to represent the quadtree is calculated as ~1.7GB in the book if we assume that we operate with 200mil businesses.

Hence, a quadtree can be stored in a single server, in-memory, although we can of course replicate it for redundancy and load balancing purposes.

One consideration to take into consideration if this approach is adopted - startup time of server can be a couple of minutes while the quadtree is being built.

Hence, this should be taken into account during the deployment process. Eg a healthcheck endpoint can be exposed and queried to signal when the quadtree build is finished.

Another consideration is how to update the quadtree. Given our requirements, a good option would be to update it every night using a nightly job due to our commitment of reflecting changes at start of next day.

It is nevertheless possible to update the quadtree on the fly, but that would complicate the implementation significantly.

Example quadtree of Denver: denver-quadtree

Google S2

Google S2 is a geometry library, which supports mapping 2D points on a 1D plane using hilbert curves. Objects close to each other on the 2D plane are close on the hilbert curve as well: hilbert-curve

This library is great for geofencing, which supports covering arbitrary areas vs. confining yourself to specific quadrants. geofence-example

This functionality can be used to support more advanced use-cases than nearby businesses.

Another benefit of Google S2 is its region cover algorithm, which enables us to define more granular precision levels, than those provided by geohashes.

Recommendation

There is no perfect solution, different companies adopt different solutions: company-adoptions

Author suggest choosing geohashes or quadtree in an interview as those are easier to explain than Google S2.

Here’s a quick summary of geohashes:

Here’s a quick summary of quadtrees:

Step 3 - Design Deep Dive

Let’s dive deeper into some areas of the design.

Scale the database

The business table can be scaled by sharding it in case it doesn’t fit in a single server instance.

The geohash table can be represented by two columns: geohash-table-example

We don’t need to shard the geohash table as we don’t have that much data. We calculated that it takes ~1.7gb to build a quad tree and geohash space usage is similar.

We can, however, replicate the table to scale the read load.

Caching

Before using caching, we should ask ourselves if it is really necessary. In our case, the workflow is read-heavy and data can fit into a single server, so this kind of data is ripe for caching.

We should be careful when choosing the cache key. Location coordinates are not a good cache key as they often change and can be inaccurate.

Using the geohash is a more suitable key candidate.

Here’s how we could query all businesses in a geohash:

SELECT business_id FROM geohash_index WHERE geohash LIKE `{:geohash}%`

Here’s example code to cache the data in redis:

public List<String> getNearbyBusinessIds(String geohash) {
    String cacheKey = hash(geohash);
    List<string> listOfBusinessIds = Redis.get(cacheKey);
    if (listOfBusinessIDs  == null) {
        listOfBusinessIds = Run the select SQL query above;
        Cache.set(cacheKey, listOfBusinessIds, "1d");
    }
    return listOfBusinessIds;
}

We can cache the data on all precisions we support, which are not a lot, ie geohash_4, geohash_5, geohash_6.

As we already discussed, the storage requirements are not high and can fit into a single redis server, but we could replicate it for redundancy purposes as well as to scale reads.

We can even deploy multiple redis replicas across different data centers.

We could also cache business_id -> business_data as users could often query the details of the same popular restaurant.

Region and availability zones

We can deploy multiple LBS service instances across the globe so that users query the instance closest to them. This leads to reduced latency; cross-dc-deployment

It also enables us to spread traffic evenly across the globe. This could also be required in order to comply with certain data privacy laws.

Follow-up question - filter businesses by type or time

Once businesses are filtered, the result set is going to be small, hence, it is acceptable to filter the data in-memory.

Final design diagram

final-design

Step 4 - Wrap Up

Summary of some of the more interesting topics we covered: