Your Data Science Problems are Engineering Problems
Why 87% of data science projects fail
Why should you care?
In 2011, Marc Andreessen famously wrote “Software is eating the world”—it’s true. In 2022, data science and machine learning are eating software, so you should care because it matters to any modern digital business.
If you work in fintech and want to use machine learning in your product, or you work at a startup and want to invest in an ML team, then read this post!
Thanks for reading Chaos Engineering! Subscribe for free to receive new posts and support my work.
And if you’re an executive thinking about using DS & ML to drive impact in your business, this is especially for you.
But if you don’t care about DS & ML, don’t plan to use it, or are happy with how your teams are executing then you can skip this and I’ll see you next time!
I started my career in 2012 as a Data Scientist at AIG building machine learning models for insurance. It was rather novel at the time as data science and applied machine learning was mostly called predictive analytics and it wasn’t so ubiquitous as it has become in 2022.
Back then the technology to build models and deploy them was very different.
Models were almost exclusively run on some scheduled batch job
Code wasn’t version controlled and was often shared via email (lol)
Unit tests weren’t even an afterthought
Hardware was usually on-prem and probably was your computer
As a funny aside, I once took RAM out of a Director’s computer (and borked it in the process) to finish estimating a very large regression model (thanks Steve!)
It was the wild west—pure chaos—but we got things done.
As my career began to focus on the engineering around deploying data and machine learning pipelines I realized the significant operational flaws with what we were doing (thus giving me the hindsight to call them out above).
My experience at the companies I worked for after AIG and conversations with former colleagues and industry leaders showed me that these issues weren’t isolated.
We were lucky though, as I said, we added value and got things done but this was not true for the majority of the projects we did and project failures were quite widespread among ML practitioners.
87% of Data Science Projects Fail
According to VentureBeat, 87% of data science projects never make it to production. Nearly every data scientist, data engineer, or machine learning engineer that I’ve spoken with has experienced a project failure.
When executives and startups think about data science and machine learning many underestimate the complexity because the process seems simple.
It’s reduced to: Data + Business Insights = Value…but that’s far from true.
A non-trivial difference.
Great Engineering enables Great Data Science
Garbage in garbage out.
-The Gospel of Machine Learning
No amount of artificial intelligence and algorithms can fix broken data and systems—I promise you I have tried. But the investment and business decisions executives often make do not reflect this fact.
It’s well cited that 80% of a data scientist’s time is spent cleaning data, which reflects the complexity of the code and systems around building a model.
This makes sense when you think about it.
Typically, the end result of a machine learning engineer’s labor is a single table or a CSV file that they use to build a model, which I believe is what gives the illusion of simplicity1.
This is why so many projects fail: every data source has its own design and complexity.
Which means a lot of work has to be done—i.e., code and tests have to be written—to unify it into that precious clean data Gollum’s always after.
Because it is very likely some engineers built that system your data scientist is looking at. So if they are pulling from, for example, 7 data sources they are likely trying to understand 7 different system. Even a single system could have thousands of lines of code and business logic defining how it behaves and what it does.
Expecting someone or even a small group of people to reverse engineer a system reliably and quickly is irrational if they don’t have context or access to the code of the system.
So what can you do about it?
1. Don’t hire data scientists if you have bad systems
Both parties will end up wildly disappointed so this will save everyone time and career regret. And while this sounds like a punchline, it’s mostly not.
I’ve had many colleagues/friends join companies and abruptly leave after finding out the state of things or stay and feel like they wasted years of their career neither accomplishing anything nor growing their skillset because they had executive sponsorship but were set up for failure. I’ve seen companies shut down entire teams because executives were frustrated by the timelines or lack of impact. I call this situation the Spiral of Data Doom.
It’s a hard reality to accept but this is surprisingly common, so save everyone a lot of pain and don’t hire a team that can’t possibly be effective.
2. Audit the state of your software and systems
Before you go and hire people to build models and pipelines you need to find out what’s even feasible.
I’d recommend hiring someone to consult with you on your systems, preferably someone with a technical background. This is not an executive, this is someone who will understand your infrastructure and tell you whether you can realistically accomplish any of the projects you have in mind.
Your goal should be to understand how much it will cost and how long it will take to get your systems to where you want them. Double the estimates.
3. Hire engineers to fix your bad systems
If you have a great foundation this step is not necessary but for those with legacy infrastructure this needs to happen.
You will need to find talented people to help you, which is extremely hard because if you have lots of legacy software and struggle with migration issues, odds are it will be challenging to find (and retain) talent and that will leave you in a vicious cycle of bad systems and bad execution.
I can assure you though that anything other than fixing your software will lead you to failure.
4. Start simple and with a small team of data analysts
Before you start building models or sophisticated pipelines, it’s much more valuable to understand your data.
Simple solutions to most of the things you want to do will take you quite far, so hold off on the advanced work until your infrastructure can realistically support it.
I’m always surprised at how many teams start with a bazooka instead of a hammer.
5. Scale your Data Science team proportional to your impact
Not everything is parallelizable, some things just take time, so it is actually beneficial to grow your team at a moderate pace2.
Data work is unique in that you’re trying to understand and build systems that centralize data, so parallelization is unlikely to have the impact you’re looking for because there’s a higher yield from accumulated knowledge, which takes time. So as you see good outcomes grow the team modestly.
Some Closing Thoughts
TL;DR: invest in your engineering before investing in machine learning.
If you have an existing team and think you’re heading to the Spiral of Data Doom™ there’s still hope for you.
I recommend pivoting your entire team to focus on fixing systems—two things will likely happen:
People will quit because they’re not interested in the work (FYI: it’s not fun)
If you have significant attrition you may have to rebuild the team or find a new job either outcome is still better than the Spiral of Data Doom™ since you will waste fewer years of your life
You will have delayed impact, communicate this to your stakeholders immediately
These two things will frustrate your team and business stakeholders but you will actually have a chance at success and over a longer time horizon everyone will be better off because you will have saved them years of disappointment.
In my experience, people working in this area want to have an impact and get excited to see their work out in the ether so these hard decisions will help everyone get there faster.
In this new digital era, machine learning and data science can add incredible value to your business, product, and customers but it won’t if the people doing the work can’t execute effectively.
Happy data mining. 🤠
In general, I do not recommended looking at an outcome as a measure of something’s effort