DS II - Making a Mockery (of a Platform).
• • ☕️☕️☕️ 15 min readIn the last post we discussed a simple requirement to store tweets and make them available to other applications. It quickly became clear that it wasn’t as simple as it looked - in fact, it was so complicated that it actually called for an entire platform to be built, replete with a full data plane and separate control plane.
Mocking up the platform side of things is the subject of this post, in which we’ll discuss getting a dev environment set up. Once you land in an organisation, the first thing you often want to do is replicate their production environment as closely as possible, ideally locally on your own machine. This allows you to test your application in an environment that mirrors the one it will ultimately run in.
“But my code is already unit tested!”
This is a common objection to this approach when you present it to traditional software engineers. But if those tests are like most I’ve seen, I suspect that they might say something very deep and subtle like assert MyClass==MyClass
. This is roughly equivalent to the conversation at every bad first date you’ve ever been on. Your tests will agree with a set of statements which only a psychopathic piece of code would fail to agree with, and which tell you nothing about the inner workings of the subject under examination; much less its performance in your particular environment, how it will get along with the rest of the apps if invited to a party, whether it will try to monopolise your limited resources, or knows how to clean up after itself.
Sorry, we were talking about software weren’t we…
Even if you’ve gone to the trouble of finding a proper test framework and using some sample data (because this is about data science, remember?) the scale or velocity of that data may be completely different in the real world - this is often the problem you are trying to solve. So unit testing doesn’t really cut the mustard, and actually the local testing approach we demonstrate here isn’t very thorough either, but it doesn’t require you to go and buy resources on GCP or AWS, so that’s a plus, and you can always do that later.
Kubernetes Concepts and setup
Kubernetes has a few entities you’ll need to familiar with, notably - nodes, pods, services, controllers and containers. The documentation on the main website is somewhat unsatisfactory when it comes to describing the design succinctly, and I advise you to consult the architecture doc for detail. Summarising, containers are just Docker containers (at least for our purposes), they run in pods on nodes, which are hosts, and are exposed by services which do network related stuff like load balancing, DNS (including discovery etc.), controllers do deployment and scaling. The pod is kind of the interesting part of Kubernetes that differentiates it from Docker compose in many ways.
Kubernetes pods
Applications are deployed in pods, and each instance of the application container should be in its own pod. Pods do not handle replication. Instead, think of them as a way of packaging the application for runtime/production. A production environment might mandate that all applications use a particular set of subsidiary apps/containers. Many of these might operate separately from the application itself, and this is called the Sidecar Pattern. Some of the common things included are proxies that inspect network traffic before forwarding it to the container it was intended for, but there are a variety of use cases around application monitoring, networking, and other stuff that ops people care about and we aren’t going to talk about.
Side note: pods are the thing that enable service mesh frameworks like Istio to fly. Istio (for example) installs a hook into K8s which modifies its default pod deployment behaviour and adds additional stuff. I may write a post about this later - one interesting implication is that we could consider taking REST traffic from our existing apps and proxying it to Kafka traffic to ameliorate the deficiencies REST suffers in persistence. There are also technologies such as Cilium, which bundles in additional security and might be worth evaluating for its Kafka interop. I only mention to sketch some of the flexibility available via Kubernetes.
Controllers, services… and everything else
Examples of controllers include replica sets (which is what we’re using here), stateful sets, daemon sets, deployments and so on. These are just all different ways of getting some set of pods out onto some set of nodes in a cluster and keeping some number there under different conditions (the desired number being something that might change according to various scaling behaviours that we’ll discuss much later). The documentation is sufficent to explain what these do, so I will not cover them in depth beyond mentioning their existence.
As mentioned above; services do networking things, and in this series we are going to talk purely about ClusterIP
services, which do not expose external IPs. If you have the privelege of having customers who might be interested in your app, you would be looking at load balancing, creating static public IP addresses, and other things that would require care and thought. You should not do any of these things unless you understand the security implications. One more time. Do not do any of those things unless you understand the security implications.
If you want to add to the K8s API, you’ll be pleased to hear that its entities are customisable. Customised resources that don’t fit within the node/service/pod/container/controller paradigm are quite possible. This is an approach pursued by tools such as Kubeflow which adds various machine learning tooling to K8s. This is quite useful, and worth keeping in mind if you plan to run some such service at scale.
Defining Kubernetes entities
apiVersion: v1
kind: Service
metadata:
name: kafka-service
namespace: dev
labels:
app: kafka
phase: dev
spec:
selector:
app: kafka
ports:
- port: 9092
name: kafka
- port: 8081
name: schema-registry
- port: 3030
name: kafka-gui
- port: 2181
name: zookeeper
clusterIP: None
---
apiVersion: apps/v1
kind: ReplicaSet
metadata:
namespace: dev
name: kafka
labels:
app: kafka
phase: dev
spec:
replicas: 1
selector:
matchLabels:
app: kafka
template:
metadata:
labels:
app: kafka
spec:
containers:
- name: kafka
image: landoop/fast-data-dev
env:
- name: CONNECT_HEAP
value: "1G"
- name: ADV_HOST
value: "kafka-service.dev"
resources:
limits:
memory: "3000Mi"
cpu: "1"
As you can see to your right (or above for those poor unfortunates trying to read this long thin piece of text on mobile), Kubernetes stuff is defined in YAML (which stands - in the great recursive acronym tradition pioneered by the likes of GNU - for YAML Ain’t Markup).
Note that we’ve created -
- A
kafka-service
to handle ingress and egress to the Kafka cluster - it is aClusterIP
service, which means no access from outside the cluster, and we’ve gone with an approach which uses selectors to route traffic to the pod via K8s DNS. - A replica set with a single replica.
YAML is an irritating language to define things in; it is highly sensitive to spaces, has an awkward way of defining lists or arrays (basically a -
followed by a space). Tabs are a no go in YAML and will break everything in hard to detect ways. If that’s unclear and generally unpleasant to wrap your head around… then good, I’m pleased you’re getting a feel for the format.
While YAML looks fine visually, and the intention is quite clear, it is painful to type and prone to errors. If something does go wrong, look for the use of tabs where spaces were intended, or the absence of spaces between -
s or :
s. The only element that bears commenting on is the labels
elements - these are just a way to subset your various K8s entities for subsequent selection and manipulation. You can do things like applying services to pods based on labels, which is what is happening above.
You’ll also note the almost complete inability to abstract anything away here - if you want some common feature (e.g. a set of labels or something) across several services, you’re going to need to either use an additional tool (helm might help), write it out explicitly in each YAML file, or do something with Kubectl to add it to your request (you can do this for namespaces).
Pay attention to which particular K8s API you’re referring to in the apiVersion
, as this matters, and dictates which part of the K8s REST API your commands are directed at.
I was originally going to discuss it in the next post, but decided that presenting broken k8s configs as correct is fairly cruel to anyone just trying to find a template for Kafka. Many systems (Kafka in this case) require an environment variable added the Kubernetes manifest along the lines of ADV_HOST=kafka-service.dev
. This lets Kafka know which address it is listening on - Confluent explain it better than I can here. If you hit connectivity issues with a system running in Docker containers, checking that the app in the container knows which address to advertise itself at is a good first step in resolving them.
Kubernetes cluster security
We should first take steps to make sure our new cluster is secure. Sadly the only secure computer is one that is turned off, at the bottom of the ocean. This being incompatible with, well… basically every other requirement we have - we will have to make do with the clusterSecurity.yaml file in the repo, which ensures that;
- No priveleged mode containers run.
- No containers can run as root.
- Containers can only access NFS volumes and persistent volume claims (for stateful set deployments). In other words, they shouldn’t be touching your local storage.
apiVersion: extensions/v1beta1
kind: PodSecurityPolicy
metadata:
name: minikubesecurity
spec:
privileged: false
runAsUser:
rule: MustRunAsNonRoot
seLinux:
rule: RunAsAny
fsGroup:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
volumes:
- 'nfs'
- 'persistentVolumeClaim'
This is important because one of the major issues people hit when first using Docker in anger is around the use of the root user in the containers. This is bad practice and often a material security risk in Docker (one of the few), so we want to ensure that if we’re doing it inadvertantly in dev, we get an error immediately rather than encountering mysterious issues at deployment (where we hope that the cluster admin has disabled the capability, as we have!)
Apply your security policies using kubectl create -f clusterSecurity.yaml
.
The final point to be made on securing this cluster is to take note that none of the services expose public IP addresses. To access (for example) the landoop UIs, we’d run something like kubectl --namespace=dev port-forward service/kafka-service :3030
, which forwards the port from our local machine over SSH. This is desirable, because we have not configured security on these GUIs, and by keeping them internal to the cluster we can piggyback off kubectl’s authentication mechanisms and simplify our setup. Clearly, not appropriate for production usage.
Kubernetes Services
We now have a functioning cluster, so let’s get visibility of it using the Kubernetes Dashboard (we’ll do better than this for monitoring, but we’re still bootstrapping and need some visibility while we do so) using minikube dashboard
. On a real cluster we’d deploy pods and a service to handle this, but minikube makes things a bit simpler and lets us avoid creation of service accounts and worrying about authentication and so on.
Without any further stuffing about, let’s get it hooked up.
- Create a dev user and namespace with
kubectl apply -f dev-cluster.yaml
- Run
kubectl apply -f ${FILE}
for each of the components we need (etcd, Fluentd, Kafka) for our dev platform and watch them appear on the dashboard.
Done. There are many improvements that could be made to this setup, but for writing an application it works as a minimum viable platform. If we were deploying this in the real world, we would have a vastly expanded range of things to think about;
- How does our logging actually work? Fluentd needs somewhere to land its data, and Kafka isn’t a good choice (for reasons that we’ll cover later in the series). We could use Elasticsearch and Kibana (or something similar like Prometheus and Grafana) to visualise our logs from Fluentd.
- Authentication for all these front ends we’re creating, and then Authorization to determine what permissions people should have.
- Where are we going for level two support? If we can’t fix something ourselves who do we bring in and how much money do we have to pay them?
- What network connectivity do we need? Load balancing? CDNs?
- While we’re talking about load, what about autoscaling?
6 - 100. Everything else.