Designing Data Intensive Applications Review

This is a book review for Designing Data-Intensive Applications, by Martin Kleppmann. The tl;dr is that this is a very good book and you should go out and read it. It’s the kind of book where you can genuinely feel a better developer at the end of it than at the start.

The target audience is builders of Internet applications. Specifically, folks who need to deal with the backend of things. Thus it deals with databases, services, message queues and the like. The core issue is, of course, handling data in single-machine and distributed systems - and handling lots of it. And that’s rather fitting because in such systems organizing and properly accessing data is the hard part. I’d argue though that the book deals with or can only provide advice for “data-intensive” applications. That is, applications dealing with TBs of data per day and processing of PBs of data. Even people who work with small data need to know these things. And perhaps it’s even more important in this context in order to avoid hard to detect bugs. An badly written transaction will be quickly discovered at 1000 QPS, but less so at 10 QPS.

There’s three parts to the book, all roughly equal in length, but not in the difficulty of the material.

I’d characterize the first part as dealing with the interface between data systems and regular users. The first chapter “Reliable, Scalable, and Maintainable Applications” covers engineering goals for building applications, such as performance metrics, operations desiderata etc. The second chapter, “Data Models and Query Languages” speaks about the various incarnations of these across history. So SQL, NoSQL, relational vs document stores, graph models etc. The third chapter “Storage and Retrieval” goes more into the inner-workings of storage engine. But again, from the point of view of the user who needs to break the abstraction barrier of the tool they’re using and make some choices about data layout themselves. The last chapter “Encoding and Evolution” deals with non-database storage and transport formats such as Thrift, Protocol Buffers, JSON and their use in distributed systems. Again, an interface issue.

The second part deals with distributed data storage proper. As such this is the most technical part, and each chapter is quite dense. However this is the part I felt provided the most bang for the buck as it synthesized a lot of results into something like a framework and tied it all with how real systems use the thing in practice. Furthermore, many of the bits in other chapters can be picked up in a lot of places - blog articles, product documentation and tutorials etc. Whereas the topics of this chapter are reserved for research papers, distributed systems researcher’s blogs etc. The first chapter “Replication” describes the various ways data systems replicate data among the participating nodes in order to achieve redundancy and better performance. The second chapter “Partitioning” dealt with the issues of partitioning data sets among the node of a data system, for better performance. There’s a more engineering bent here, and less grand theories. The third chapter deals “Transactions” is one of the best overviews of relational datastore transaction processing I’ve read. The book would be worth it for this chapter alone I’d say. The next chapter, “The Trouble with Distributed Systems” looks at the problems which occur in such systems. These are basically what differentiates a distributed system from a single machine one. Faults in components, bad clocks etc. all make an appearance here. And hot on its heels comes “Consistency and Consensus”, which deals with providing tools for solving some of the issues in the previous chapter, and the tradeoff they require. There’s a lot of references to how actual systems make use of “consensus” implementers like ZooKeeper or etcd, issues with 2PC etc.

The third part deals with “derived data”. That is, data computed from whatever you’re storing in your data system as the source of truth. This is not as intense as the last part was, as many of the theoretical issues which plague a distributed system can be overcome when you’re looking at just tranforming data from one form to another. The first chapter “Batch Processing” deals with systems like Hadoop and MapReduce and the menagerie of tools around them. The second chapter “Stream Processing” is a natural evolution to processing infinite streams of data. There’s some talk about the futuristic architectures some companies are starting to employ, such as basing everything off event streams. And it goes into more detail in the last chapter, “The Future of Data Systems”. The final pages are a warning about the dangers of collecting too much data without any safeguards in place for user privacy and against tracking. Which is quite poignant these days.

Besides the material itself, there is a wealth of references. Many of them are to well known papers in the distributed systems literature. But some are to well known blog articles. And I find this quite the good thing. In these sort of mass-audience books you don’t really see references of any kind. Then there are technical books where you see a lot of references, but they are to research papers mostly. But this book strikes a middle ground - there’s enough references to lead you on to more knowledge in distributed systems, but they aren’t to nieche or obscure papers. Rather to the big ones which are a good read. Similarly, a lot of my education in this topic comes from starting to read blog articles, and it’s good to find some of my sources in here as well. And it certaintly makes it easier to approach a topic with a short article rather than a 20 pages research paper.

Anywho, give the book a try and you won’t regret it.