Joins

Joins are a perenial source of grief among web programmers, especially those who don’t interact that much directly with the database. However, they’re an indispensible and quite powerful tool.

In a standard setting, joins are used to do something pretty simple, which can best be described as “fetch other data related to the data I already have”. I believe this is the most common use case as well, and it’s a pity it’s not put more often at the forefront.

Suppose you hava a bookshop database which looks roughly something like this:

create table Book (
  id int,
  name varchar
);

create table Review (
  id int,
  book_id int foreing key (Book.id),
  text varchar
);

If you want to retrieve a particular book, you might write something like this:

book = db.query('select * from Book where id=:id', id=10)

If you want to retrieve the book and it’s comments, a natural way would be to write something like this:

book = db.query('select * from Book where id=:id', id=10)
book.reviews = db.query('select * from Review where book_id=:book_id', book_id=book.id)

This is OKish code to write when performance isn’t an issue. In fact, it’s the kind of code you’d write for databases without foreign key relationships, such as many of the NoSQL ones. However, the code is quite inneficient. You need two queries to retrieve the data, which means two connections and all the assorted extra bandwith and increase in latency for the end-user. Furthermore, the database doesn’t have a chance to optimize the queries as a whole. There’s not much it could do for this one, but for more sophisticated queries it does matter.

Perhaps the most important issue, however, is that the complexity of our querying is limited. Suppose we’d want all books written in 2016 and their reviews. The code might look like:

books = db.query('select * from Book where year_published=:year', year=2016)
reviews = db.query('select * from Review where book_id in :book_ids', book_ids=[b.id for b in books])
# Match reviews to books
for r in reviews:
  books[r.book_id] = r

We have a large number of books being transmitted from the database. Then a second and very large query is sent. Finally the reviews are retrieved and must be matched by hand to the books.

If we ever want to retrieve Authors, Critics or anything like that, things become more complicated to code, and the performance profile turns to rubbish.

The answer, is, of course, joins. The first query can be written as:

book_with_reviews = db.query('''
    select b.*, r.*
    from Book as b
    join Reviews as r
    on b.id = r.book_id
    where b.id = :id
    ''', id=10)

Conceptually, this query does the same thing as the one above, only the matching up of the book with its reviews happens on the database side, with filtering taken into account. With proper indices, the same entities are touched, but a single request is needed. Due to the limitations of relational representation, the result returned is not quite the most intuitive, however. It is a table with all the columns in Books and all the columns in Reviews concatenated, and the rows for the book repeated for each reviews.

However, when it comes to to retrieve all books and all reviews, the situation is much simpler. The query looks like:

book_with_reviews = db.query('''
    select b.*, r.*
    from Book as b
    join Reviews as r
    on b.id = r.book_id
    ''', id=10)

It’s very easy to extend this to more and much more complicated queries. Again, a single request gets made, and the database is free to optimize the query as it sees fit:

super_books = db.query('''
    select b.*, r.*, a.*, e.*
    from Book as b
    join Reviews as r
    on b.id = r.book_id
    join Author as a
    on b.author_id = a.id
    join Edition as e
    on e.book_id = b.id
    where e.year > 2000
    ''', id=10)

The joins we’ve seen so far are examples of inner joins. Here, in order for a row to be emmited, both a book and a review which match must exist. This is not exactly what we want, because it’s quite possible for books to have no reviews. A left join includes the entity coming from the first table always, and emits nulls for the second entity wherever there isn’t a match. It is the one which produces the correct result for the example. A right join includes the entity coming from the second table always, and emits nulls for the first entity wherever there isn’t a match. Finally, full joins include all entities from the first and second tables, and match whatever they can.

One of the reasons joins are harder to grok than select and where is that programmers don’t usually have a mental model for them, when first approaching SQL. While filtering and mapping are ingrained patterns for data processing for most ot them, joins are rarely encountered. Granted, hierarchies of objects uniformly accessible in memory are a more expressive model than the relational one, so there isn’t any need for it.

However, in recent years, joins have made an appearence in the standard libraries of programming languages, as first class citizens aslong side map, filter and reduce. C#’s LINQ and Parallel LINQ are one example of this. Especially for compute intensive taks, it’s many times more elegant to express everything in LINQ, as a relational-ish transformation, than break out the for loops. Other interesting examples are Google’s FlumeJava system and Apache Spark, where distributed data processing pipelines are described using a similar syntax.