Mokum architecture: rivers, pt. 1

Published by @squadette on 2016-11-20

In this chapter we discuss the question of how you actually see the posts from people you’re subscribed to.

As we said in a previous chapter, timelines are never accessed directly. Timelines are visible through so called “rivers”.

Rivers (and river entries) are “secondary” objects, they live in separate database. This database is actually transient, it could be lost at any time. It contains posts together with comments, likes and everything else in a heavily denormalized JSON format.

Rivers also allow very fast access to “hot” pages, which you visited recently. If you visit Mokum more every day, the river corresponding to you newsfeed will be maintained indefinitely, until you don’t visit it for 24 hours.

Realtime page updates also use rivers as foundation. Only posts which somebody visited recently will get realtime updates. This allows to scale proportionally to a number of active users, independently of a number of posts stored in the database.

River-based architecture also allows for horizontal scalability, because rivers are completely independent from each other and from the primary database, so you can have more than one web-server creating and maintaining rivers.

There are three tables containing this secondary data: rivers, entries, and rivers_timelines (NB: timeline_entries table from previous chapter has somewhat unfortunate name. It is important not to confuse timeline entries and river entries, although they are somewhat related).

Table with rivers has the following schema:

user_id INTEGER
their_id INTEGER
finished BOOLEAN
constructed BOOLEAN
visited_at, created_at, updated_at TIMESTAMP
version INTEGER

user_id field contains ID of a user who this river belongs to. This field could be zero for anonymous users (all anonymous users see the same river, updated at the same time for all of them).

Rivers use the same way of specifying corresponding objects which is used for timelines: tuple of (what, their_id). List of possible river types looks very similar to timeline types, but there are all kinds of differences in two lists. Possible river types are:

  • “Homefeed” (everything you’re subscribed to);

  • user feed, private sub-feed, “for your eyes only”;

  • user likes, user comments, posts faved by user;

  • “My Discussions”;

  • direct messages for user (both sent and received);

  • group feeds;

  • single post river;

  • everything, “best of”, “most faved” (for a certain language);

  • search results river;

  • there are also few more river types which we omit for brevity.

Rivers are constructed from one or more timelines. The structure of each river type is defined in code.

Some rivers are constructed from a single timeline, for example /bob/comments river corresponds to timeline ("user_comment", bob_id). Some rivers are constructed from several timelines: for example, the user page (/alice) is constructed from timelines for Alice’ primary feed and private sub-feed (if she has one). Some rivers could be constructed from many timelines: e.g., if you’re subscribed to 100 users, your homefeed river will be constructed from more than 300 timelines (3 on average for each user * 100 + timelines for “your discussions”).

When you look at a post page, you actually look at a “single page” river, which will always contain only this post. This river enjoys special handling in three or four places in the codebase, and it has counter-intuitive timelines structure. However, resulting benefits of unified handling wrt real time updates, access control and caching greatly outweigh the need for this special handling.

Access to rivers is checked in several different ways:

  • First, you won’t be able to access some rivers at all, for example if you try to look at the feed of private user who you are not subscribed to. In that case you will get “access denied” error even before the river is created.

  • Second, there is no way to request access to some rivers such as “Direct messages” of another user (the URL for that page is /filter/directs, so it does not contain a name of the user, so only rivers for current user will be constructed.

  • Third, before posts are added to the river, Mokum checks if you are allowed to access this post by looking at its decisive timelines, as described in previous chapter. So, if some user comments on private posts, you will not see those posts on comments page of that user.

  • Fourth, when creating river, Mokum checks each timeline to see if you have access to it, and ignores those which you don’t have access to. This is needed mostly for ease of coding, for additional performance, but also as an additional precaution.

When river is created, initial list of timelines is kept in rivers_timeline table which has only two fields: river_id and timeline_id. This list is used to incrementally update the river and for distribution of real time updates.

River contains entries, which contain heavily denormalized post data in JSON format. The entries table has the following schema:

CREATE TABLE entries (
river_id INTEGER,
post_id INTEGER,
version INTEGER,
text_folded, likes_folded, comments_folded, attachments_folded TEXT,
per_river_folded TEXT,
updated_at, fresh_at TIMESTAMP,
deleted BOOLEAN,
river_version INTEGER

Each entry contains data for the post, formatted as JSON string in five fields *_folded. We will discuss version field in a chapter on river consistency. deleted and river_version fields will be discussed in a chapter on river reconstruction. fresh_at field is the field which river is sorted on.

River entries are created on demand when you visit a new page or when you click “older entries” link. Rivers are maintained while you’re actively using them, and after some timeout of few hours entries are deleted to reclaim space.

If the river is already created and you visit the page again, in the best case the only SQL query server will need to do is "SELECT * FROM entries WHERE river_id = ? ORDER BY fresh_at DESC LIMIT 25". This makes Mokum very fast for active users. There are many different optimizations involved in delivering fresh contents of a river, with multiple layers of caching, so that even if the case is not so good, the page rendering is quite fast.

Lifecycle of a river goes like this:

  • river is created on first visit of a certain page by a certain user;

  • river is updated by high priority background worker if user has that page open in their browser (this is called “Tier 1 updates”), and updates are sent to user’s browser through Websocket. “Updates” are new likes, comments, edits, favs, etc.

  • if user closes browser page, river continues to be updated by lower priority background worker until grace period of few hours is expired. This is called “Tier 2 updates”.

  • river entries are updated immediately for changes caused by a user. If you like a post, river entry you’re looking at will be updated by web server itself and sent back to you as HTTP response to minimize latency. This is called “Tier 0 updates”.

  • sometimes rivers will have to be reconstructed. For example, if you subscribe to someone, or someone changes their feed visibility to private, and in many other cases. River reconstruction will be discussed in a separate chapter.

  • if you did not visit some page again for few hours, its river will be cleaned up by a cron job;

In the following chapters we will discuss all those phases of river’s lifetime: creating, updating, distributing real-time updates, and reconstructing.