DS I - That's no app. It's a platform.
• • ☕️☕️ 9 min readA blog series on data science and engineering
Maybe you heard it was the sexiest discipline of the 21st century? I tried to warn you, but the introductory post didn’t scare you off?
Welcome to the first post in a series on data engineering, data science, and how to make these things actually work. We won’t be writing any code in this edition, we’ll just be outlining the structure of what we’re going to build over the next few posts, and why. We’ll start by talking about this idea of a 'platform’, and what that might entail, then we’ll outline what components we might want on our platform.
We’ll then code it up (using Scala, Python, JS, whatever comes to hand really) over the following posts. I won’t expect familiarity with the nuances of every language, that’s part of the learning experience I’m aiming for. If I haven’t covered something sufficiently, get me on Twitter and let me know.
Now, most blogs like this would start off by telling you to download Python, install Jupyter… then we’d go through a variety of motions, culminating in the building of a decision tree in scikit-learn, at which point your salary would increase two-fold, your beard would become thick and lustrous (ladies, this applies to you too), and you would develop a passion for obscure areas of mathematics. Here, have some wow.
Sorry…
I’m looking to do things a bit more rigorously here, this blog is about doing data science and engineering in the real world, and how to solve the issues that arise. If you obtain the advantage of a beard from reading this blog, it will simply be because you haven’t left your home or showered in a week while you try to debug some mistake I inadvertently included. While I’m sure that you want to hear about the latest Tensorflow model (probably so you can go and use the pre-trained version amirite? 😏) there are good reasons to talk about the platform first.
It often comes about that we build a platform without realising it. Most of the code we write in relation to a data science project actually has nothing to do with the specific task at hand.
An example application
For an initial example, let’s think about building a system which stores Tweets, while making them available (via an API) to other applications. This is the first bit of functionality we’ll build as a part of this series.
You: goes and downloads the Twitter app from the app store and shuts the browser tab.
Hang on… My domain is data science and engineering (unsubstantiated rumours suggest I write a blog about it), so let’s add three NFRs to ensure I can contribute something at least slightly novel. Let’s demand that the system be scalable, near real time (which I think is kind of implicit by talking about a real time source anyway, but some may disagree), and offer high availability.
So our above requirement sounds simple, but there are a few things that should tip us off to the fact that it isn’t. Firstly, we’ve outlined a need for horizontal scalability. That means that we need to be able to add and remove instances of the application without interruption of service. Secondly, we’ve outlined HA as a requirement - this means we always need sufficient instances to serve requests, and in turn monitoring, triggers and autoscaling to figure out how many instances that is. Finally, we’ve asked for storage, which neccesitates a measure of persistence. In other words, whatever we implement needs to be a horizontally autoscaling highly available distributed system - not straightforward, no matter how good you are at installing Pandas and Numpy.
The need for a platform
The platform is going to basically give us two things, a generic data plane and a generic control plane, and will include -
- A Kafka cluster + Schema registry to enable scaling and durability of data written, manage all persistent storage concerns, as well as enable rapid failovers.
- Etcd to do configuration management.
- Kubernetes to do general cluster things and manage scaling.
- Fluentd for monitoring.
How else could we hit the same requirements?
It is instructive to consider some of the ways that the architecture we’re presenting here differs from our other options. For example, distributed applications, (especially service mesh designs) will often use RESTful APIs to communicate between components. The issue is, that if a RESTful transaction fails, it isn’t clear how to proceed to avoid data loss. We might make it the sender’s responsibility and retry, but then we’ll need to consider implementing circuit breaker patterns - this becomes complicated quickly. If we use Kafka as a messaging solution, we make it the receiver’s responsibility and simply set a retention policy that will cover our maximum expected outage time.
REST specialises in synchronous unicast communication patterns, Kafka enables asynchronous multicast communication patterns.
It is nice to have all of our data management in a single place, rather than having different systems to manage transmission and storage. This allows us to centralise monitoring and configuration, from permissions, metrics on reads and writes, latency and throughput, durability via replication factors, distribution via the number of partitions of the data and retention via cleanup policies and retention times. The alternative is often to configure these individually per-application.
Throw all your components in docker containers, deploy via Kubernetes and you’ve probably delivered something they’ll call a Kappa Architecture deployed as microservices, but I’m also happy to call it a service mesh on a persistent substrate; or otherwise, as directed by marketing.
The genericity of the solution is great because, to me, storage and messaging are the two most boring parts of an application. I’d much rather just implement a single messaging and storage substrate and focus on the interesting parts like the human factors (how do people use it) and the computation (what does it think it does). Naturally, this led me to develop skills in Kafka and, due to the exigencies of capitalism, I now spend quite a lot of time working on storage and APIs.