Welcome, please don't spill crumbs on the carpet.
• • ☕️ 3 min readI’ve given in. This is the result.
So people have urged me to do more self promotion for years. They assure me this will wow employers and build my 'personal brand’, so I’ve finally given in and started this dingy little monospaced beauty you see before you. Actually, the benefit from my end is mostly brushing up on my React.js navigation and especially css skills, which, as you can see, are a bit short on wow.
Oh yeah, this might also help some of you who are looking to learn about data engineering, data science, devops, machine learning and whatever buzzword they think up next. But I can’t really guarantee anything there obviously, and if a data centre catches fire because you took some of my advice, I will not be held liable. (But have you heard about how to parallelise mgcv models in R? Seriously, try dialing up 96 cores of non-linear modelling capacity, it’s sick.)
So why write a blog, genius?
Once I got started thinking about content, I realised that there aren’t many blogs out there getting into the nitty gritty of this field. There’s plenty of vendor marketing content explaining how easy it is (thanks guys, you’ve convinced management that every two month task can be achieved in a week, an expectation which means that the project usually fails), and there’s plenty of introductory material on how to download sklearn from Github.
But there’s a whole process that isn’t really talked about amidst the hype - of building platforms to get the data to and from systems, putting metadata around them, sourcing that data in real time (yes, not every process should be run in batch) and updating models and predictions to suit a real time system. Most glaringly, there isn’t much written about what to do when things inevitably go wrong, how to debug a distributed system, trace an error across a cluster, or generally think through how to tackle a problem. This is notable, because most of us only have jobs due to things going wrong - there’s a reason the machines haven’t taken over yet.
I think that these are worthwhile things to write about. One thing that hit me in the JS community is that most of the blogs and content focus on problems, and attempts to implement functionality which failed. We don’t get this very often in the data science/engineering world. Now, if that was because the whole JS language is one big problem made bearable only via sucessive layers of questionably performant shims - cough, React - it wouldn’t be me who pointed that out…
So you should expect to see quite a volume of cynicism - hopefully delivered in variety and at high velocity. This blog is about the warts and all of distributed systems engineering and making them fly the real world.
In my personal experience, reading about solving problems teaches something more useful than simply seeing the Happy Path reproduced in blog form - you may as well just git clone
. And if this blog scares the bejeezus out of you, more’s the better. If I can thin the herd that means more work for me right?! (Also, I make a great senior data scientist due to my supportive mentoring of junior staff - ask anyone.)