Notes On Capturing The Data Exhaust Of A Web Application
The “classical” view of the server side part of a web application is that it consists of a bunch of web servers, database servers, caches etc. Basically, a set of components which run the application code, and a set of data storage components the application code makes use of. This is regardless of where you are on the monolith - services - microservices - serverless application architecture spectrum.
The interesting thing about the data storage components is that they are, for lack of a better word, active. I mean the application is the one responsible for managing the data. Think of the classical example of a bookstore. A books inventory service will mark books as sold out, record when a book gets sold etc.
But there’s another very important category of data components an application relies on. Applications generate action logs, exception logs and metrics. Together they form a part of the data exhaust of a system.
Action logs are commonly derived from web logs, but usually more structured. They don’t have to be derived from web logs since not all services are going to be HTTP servers. Thus gRPC, Thrift or even something custom built can generate them. There also doesn’t have to be a 1-to-1 correspondence between a request and an action log, though that usually makes the most sense. The biggest difference is that you can add application level information in the logs.
These logs are usually used as a record of all the actions the application has performed, for analytics purposes, data mining and machine learning, and of course, as a source for monitoring & alerting. So they need to be durable and not be lost. There’s special systems for them. You can use SQL databases or the like for a while. But usually something like the ELK stack, or Loggly, Splunk, or GCP/AWS internal offers is what people mostly use in production.
Moving on to the next type of log. When applications run they’re going to encounter exceptions. Some of them are unexpected, and there’s some sort of last-ditch handler which handles them. These are more of a design error than anything else. Think of somebody not accounting for the fact that an RPC might fail, or a call might return null. Some others are expected, but the application doesn’t know what to do next, so it decides to bail and just record them. For example, if an API call needs to be authenticated, but the auth service can’t be contacted. Finally, some special circumstances warrant a record like this, even though the application can recover.
These logs are used for debugging and keeping track of what’s wrong with the application. They don’t really need to live for longer than the exceptions are happening, but are usually also durable in practice. There’s special systems for them, and they usually integrate with project management tools, bug trackers, Slack etc. Custom built systems are common, but Rollbar, or GCP/AWS internal offers are also quite used.
Finally, there’s metrics. These are time series-type data generated by applications as they run, which usually measure internal properties, usually performance related. QPS, latencies etc. are common things to handle.
These metrics are used commonly as a basis for monitoring and alerting, as well as for overview dashboards for services. There’s again special systems for them. ELK is common here, TSDB which is specialized for data, as well as GCP/AWS internal offers.
There’s some common patterns here. Usually there’s a custom non-SQL datastore at the heart of these systems. It’s meant for relatively unstructured data which is generally written more than it is read. On top of this, there’s a bunch of analysis tools, integrations with other systems etc. These systems are more production oriented than development. As much as they’re included in dev setups, the integrations are iffy. Finally, as more a business issue, companies seem to be quite OK with using a third party managed service for these tasks. Much more so than something like a SQL service.