Friday Blast #5
A lighter week for this issue.
The rise of test impact analysis (2017) - talks about advanced ways of running tests. Test Impact Analysis tries to figure out which tests would be affected by a certain set of changes, and only runs those. The idea is to speed up the testing phase, for cases where there’s a large burden of tests.
The infrastructure behind Twitter (2017) - very high-level overview of the infrastructure systems used at Twitter. I found the graphs with the percentages of machines dedicated to certain tasks interesting. Otherwise there’s a lot of pointers to the interesting tools Twitter has put out.
Jeff Dean’s lecture for YC AI (2017) - interesting mostly because Jeff Dean. Otherwise it’s an overview of what the Google Brain team has been up to these last few years. It’s quite impressive, but there isn’t that much to take away other than more machines help.
Damn cool algorithms: log structured storage (2009) - an alternative to storing a B-Tree and write ahead log for storage engines. It’s also a close relative of Log structured merge trees, of BigTable and HBase fame. The idea is similar to that of an in-memory persistent data structure. Instead of modifying data in-place, you create new versions of it, and update the indexing structure to reflect that. In the case of disks, you need to organize the whole process through a log structure, so you have to worry about copies and memory consumption a bit more.
Log structured merge trees (2015) - a presentation on how log structured merge trees work. These is an alternative data structure to B-Trees & their friends for usage in database systems. BigTable, HBase, MongoDB etc all use or can use them. Whereas B-Trees organize data as a tree, which requires a bunch of writes for every mutation, thereby providing a limited write throughput, LSM trees organize data as a series of files, chronologically arranged. Inside each file, the data is sorted by primary key. Writes first go to a in-memory sorted tree and a secondary WAL. But once this gets too big, it is flushed to disk as another of these files. Write throughput is large, as each mutation corresponds to a single write (in the WAL). A read first checks the in-memory tree, then each file in reverse order. The check for a file is quite fast, as there’s just a binary search to do. But since many files could be checked, it is heavier than a read. Periodic compaction needs to happen in order to keep the file set small. Read throughput is not that great even with compaction, in principle, but large memory caches for reads largely alleviate the problem. The article is well worth a read, and contains extensions and pointers to relevant literature.