Artificial Unintelligence.

DS I - That's no app. It's a platform.

☕️☕️ 9 min read

A blog series on data science and engineering

Maybe you heard it was the sexiest discipline of the 21st century? I tried to warn you, but the introductory post didn’t scare you off?

Welcome to the first post in a series on data engineering, data science, and how to make these things actually work. We won’t be writing any code in this edition, we’ll just be outlining the structure of what we’re going to build over the next few posts, and why. We’ll start by talking about this idea of a 'platform’, and what that might entail, then we’ll outline what components we might want on our platform.

We’ll then code it up (using Scala, Python, JS, whatever comes to hand really) over the following posts. I won’t expect familiarity with the nuances of every language, that’s part of the learning experience I’m aiming for. If I haven’t covered something sufficiently, get me on Twitter and let me know.

Now, most blogs like this would start off by telling you to download Python, install Jupyter… then we’d go through a variety of motions, culminating in the building of a decision tree in scikit-learn, at which point your salary would increase two-fold, your beard would become thick and lustrous (ladies, this applies to you too), and you would develop a passion for obscure areas of mathematics. Here, have some wow.

A man with a detachable beard.
Knowledge transfer in data science.

Sorry…

I’m looking to do things a bit more rigorously here, this blog is about doing data science and engineering in the real world, and how to solve the issues that arise. If you obtain the advantage of a beard from reading this blog, it will simply be because you haven’t left your home or showered in a week while you try to debug some mistake I inadvertently included. While I’m sure that you want to hear about the latest Tensorflow model (probably so you can go and use the pre-trained version amirite? 😏) there are good reasons to talk about the platform first.

It often comes about that we build a platform without realising it. Most of the code we write in relation to a data science project actually has nothing to do with the specific task at hand.

An example application

For an initial example, let’s think about building a system which stores Tweets, while making them available (via an API) to other applications. This is the first bit of functionality we’ll build as a part of this series.

You: goes and downloads the Twitter app from the app store and shuts the browser tab.

Hang on… My domain is data science and engineering (unsubstantiated rumours suggest I write a blog about it), so let’s add three NFRs to ensure I can contribute something at least slightly novel. Let’s demand that the system be scalable, near real time (which I think is kind of implicit by talking about a real time source anyway, but some may disagree), and offer high availability.

So our above requirement sounds simple, but there are a few things that should tip us off to the fact that it isn’t. Firstly, we’ve outlined a need for horizontal scalability. That means that we need to be able to add and remove instances of the application without interruption of service. Secondly, we’ve outlined HA as a requirement - this means we always need sufficient instances to serve requests, and in turn monitoring, triggers and autoscaling to figure out how many instances that is. Finally, we’ve asked for storage, which neccesitates a measure of persistence. In other words, whatever we implement needs to be a horizontally autoscaling highly available distributed system - not straightforward, no matter how good you are at installing Pandas and Numpy.

The need for a platform

The platform is going to basically give us two things, a generic data plane and a generic control plane, and will include -

  1. A Kafka cluster + Schema registry to enable scaling and durability of data written, manage all persistent storage concerns, as well as enable rapid failovers.
  2. Etcd to do configuration management.
  3. Kubernetes to do general cluster things and manage scaling.
  4. Fluentd for monitoring.

Why are we doing it this way? Luckily, thanks to the joys of MDX and React.JS, this blog has sidebars to deal with long and slightly sarcastic digressions on matters such as this. wow.

Why are we doing it this way?

Rather than going into specifics on each component’s purpose, I could say that we’re implementing a Kappa Architecture via microservices. Because you’re a learned reader - or maybe just because you have access to Google on your phone - you’d probably understand that whatever we’re building thus addresses the requirements around scalability, availability, near-real-timeness and storage. Whether or not you understood these things and why they work, you’d probably ask no further questions - because I intoned the name of an ancient Greek letter.

But this blog isn’t about selling you a bridge, so there are a variety of reasons why I didn’t just say mumble mumble, Kappa Architecture, stakeholder value… enhanced… mumble; the key one being that I probably have one version of a a Kappa Architecture Via Microservices in my head, and you have a different version in yours.

The issue with all of these “architectures” is that they don’t cover a sufficient set of application functionality to warrant the term - I’d think of them as design patterns. Having done the data science/engineering/big data/whatever thing for a while now, I’ve developed the (probably less than novel) opinion that basically there are only four things in the application world - data storage, messaging/APIs, human factors (which is everything from the front-end through to enterprise culture, the project manager, the business analyst, or the project’s stakeholders), and computation. They all need to be covered for something to call itself an architecture, in my view. e.g. a microservices pattern probably says that our microservices application code runs in a docker container, but relatively little about what happens when it needs to communicate. One default assumption is that this will happen via REST, but it isn’t an essential feature, and isn’t always a best practice.

Sometimes spending the additional verbiage on a real explanation of a design can save a tonne of effort down the track.

So that’s the platform. As for the app? That’s almost the easy part - we’ll use various Kafka libraries and twitter4s. There are a few others we’ll consider, but they very much sit on the utility side of things.

The Data Plane - about Kafka

Our data plane will rely mostly on Kafka, which often advertises itself as some sort of data hub type product, almost as an alternative to a database, data lake, or (at the other end) an ESB or messaging system. It can probably hit those requirements, but they aren’t really where the value lies. The easiest way to explain Kafka is to say that it offers an integrated data plane for distributed applications and allows them to persist, manage and share state. If another application wants to inspect that state, Kafka enables this - we can set the retention policies on the data for our use case and then write SQL against the data or interact with it in other ways. If we need bidirectional communication between the two applications, this is also covered, and we can set things up so that communication failures and temporary service unavailabilities on either side are recoverable.

Kafka offers guarantees around data consistency, durability, and availability and allows us to scale and monitor applications almost infinitely and 'for free’, in terms of the engineering effort required to add such features. NB; As with all things in the enterprise, in practice Kafka is often misconfigured and won’t provide any of these benefits.

In case it isn’t clear, being able to offload the concerns I’ve just mentioned is A Big Deal. Figuring out how to distribute messaging and storage is time consuming from an engineering perspective, adds zero perceived value to the user experience and enables zero functional requirements.

But you have to do it, because a 404 message doesn’t hit anyone’s requirements.

Moreover, the other two parts of the application (human factors and computation) don’t cost the same engineering effort. Hosting front ends in a scalable fashion is basically a solved problem (I mean, the internet works, right?), and if you were looking for advice on how to achieve consistency, durability and availability from the rest of your human factors elements - collectively, wetware, or non-silicon-based-considerations - I suggest an organisational psychology blog might be more your speed, although I’ve personally given up on this. Conversely, computation is arguably so hard that even psychologists aren’t arrogant enough to think they can solve it (or they have yet to hear about the problem, it’s hard to say), so there’s no point in making it a part of the platform.

How else could we hit the same requirements?

It is instructive to consider some of the ways that the architecture we’re presenting here differs from our other options. For example, distributed applications, (especially service mesh designs) will often use RESTful APIs to communicate between components. The issue is, that if a RESTful transaction fails, it isn’t clear how to proceed to avoid data loss. We might make it the sender’s responsibility and retry, but then we’ll need to consider implementing circuit breaker patterns - this becomes complicated quickly. If we use Kafka as a messaging solution, we make it the receiver’s responsibility and simply set a retention policy that will cover our maximum expected outage time.

REST specialises in synchronous unicast communication patterns, Kafka enables asynchronous multicast communication patterns.

It is nice to have all of our data management in a single place, rather than having different systems to manage transmission and storage. This allows us to centralise monitoring and configuration, from permissions, metrics on reads and writes, latency and throughput, durability via replication factors, distribution via the number of partitions of the data and retention via cleanup policies and retention times. The alternative is often to configure these individually per-application.

Throw all your components in docker containers, deploy via Kubernetes and you’ve probably delivered something they’ll call a Kappa Architecture deployed as microservices, but I’m also happy to call it a service mesh on a persistent substrate; or otherwise, as directed by marketing.

The genericity of the solution is great because, to me, storage and messaging are the two most boring parts of an application. I’d much rather just implement a single messaging and storage substrate and focus on the interesting parts like the human factors (how do people use it) and the computation (what does it think it does). Naturally, this led me to develop skills in Kafka and, due to the exigencies of capitalism, I now spend quite a lot of time working on storage and APIs.