Data Science and Machine Learning Predictions
This is the time of year when everyone looks to the year ahead. Here are
five four things in data science and machine learning that are utterly and completely predictable in 2018.
Data Science Matures
In the Pleistocene Era of Data Science, there were Heroes and Hackers: lone souls working on ad hoc projects with Pig, Hive, Mahout, Java, and a few prayers. For asset management, organizations used thumb drives and email. Collaboration was a non-issue because there were few others, if any, to collaborate with.
In time, organizations hired more data scientists. Heroes and Hackers evolved into Data Science Guerrillas armed with laptops and notebooks. IT didn’t want anything to do with data science; it’s messy and complicated, so it was easier to simply pretend it didn’t exist. Responsible team leaders asked contributors to store assets on Git; some complied, some didn’t, but it hardly mattered because the Git library was a disorganized mess. For tooling, data scientists used a quodlibet of languages, packages, and notebooks, which made cross-checking and peer review problematic. Nobody could agree on a common set of tools, so collaboration was rare.
Today, data science has matured to the point that organizations expect a return on their investment. They want to see faster turnaround, and more value. Nobody cares if you won Kaggle; we want to see a minimum viable data product while we’re young.
Smart organizations adopt a collaborative model of data science. The collaborative model recognizes that the data scientist is one member of a larger team that may include business analysts, data engineers, developers, machine learning engineers, DevOps specialists, compliance specialists, security professionals, and many others all pulling together to deliver a working application.
The rise of collaborative data science leads organizations to adopt open data science platforms that do the following:
- Provide a shared platform for all data science contributors
- Facilitate the use of open data science tools (such as Python and R) at scale
- Provide self-service access to data, storage, and compute
- Support a complete pipeline from data to deployment
- Include collaborative development tools
- Ensure asset management and reproducibility
There are now multiple offerings in the market from vendors including Amazon Web Services, Anaconda, Cloudera, DataScience, Domino, Google, IBM, and Microsoft. In 2017, venture capitalists funded several startups in the category, which suggests that there is strong growth potential.
In 2018, look for more organizations to adopt a collaborative model of data science, and invest in an open data science platform.
Automated Machine Learning Gets Real
Forget the hype. No, automated machine learning does not mean you can fire your data scientists. Automated machine learning makes your data scientists more productive.
Several months ago, a data scientist explained to me why it takes him weeks to build a predictive model. “I have to run a hundred experiments to find the best model,” he complained, as he showed me his Jupyter notebooks. “That takes time. Every experiment takes a lot of programming, because there are so many different parameters. We cross-check everything manually to make sure there are no mistakes.”
After listening to this for an hour, I was ready to kill myself.
Automated machine learning does not eliminate the hard parts of a data scientists’ job, such as listening to clients, understanding the business problem, and figuring out how to craft a solution. It automates the stupid parts of the job. Like repetitive programming. The kind of stuff researchers delegate to interns and new hires.
Think of it like this. We’ve had robotic heart surgery for 20 years, but you don’t see cardiac surgeons standing by freeway exits holding signs that say Will Work for Food. If I have a heart problem, I’m not calling Watson — I’m going to see Dr. Angina down at University Hospital.
It’s the same with data science. When the CEO needs answers to really important questions, she’s not calling Watson. She’s calling the CAO or the Chief Data Scientist or whatever. Someone with skin in the game. Because when real executives delegate a task, they delegate it to someone they trust.
Organizations that want to invest in automated machine learning have plenty of commercial and open source options. Amazon Web Services, DataRobot, Google, H2O.ai, IBM, and SAS all offer automated learners; some of these are much better than others (but I’d rather hold a detailed discussion of the differences for a later post.) In the open source ecosystem, we have auto-sklearn, Auto Tune Models, Auto-Weka, machine-JS, and TPOT.
Prediction: in 2018 we’re going to see many more offerings, and more organizations will adopt the tools.
Data Scientists Discover GDPR Applies to Data Scientists
On May 25, 2018, the European Union’s General Data Protection Regulation (GDPR) takes effect. The reaction in the data science community will be something like this:
- February: nothing
- March: nothing
- April: WTF is GDPR?
- May: hair on fire
As I’ve written elsewhere, much of the commentary about GDPR misstates the likely impact on data science. There’s a lot of talk about the “right to an explanation,” which is actually a “right to human-in-the-loop decision-making.” But this provision applies to a narrow set of transactions, and affects front-office customer interactions more than data scientists.
GDPR’s greatest impact on data science practice is the obligation it imposes to avoid bias in predictive models used in decisions about consumers. In practice, this means that data science teams must survive an audit of their methods and procedures. Reproducibility and data lineage will be de rigueur.
That’s one more reason to put Heroes, Hackers, and Guerrillas behind you, and adopt a mature model of data science.
While GDPR sets out general principles, it leaves many details to the European Data Protection Board (EDPB). This secretariat will issue detailed guidance for controllers and processors – for example, on the data portability right, Data Protection Impact Assessments, certifications, and the role of Data Protection Officers. Like any regulator, EDPB will issue guidance over time, and the rules may be complex. Thus, compliance won’t be a matter of learning a few principles once; it will be an ongoing effort to understand requirements as they evolve.
Meet the new boss, your GDPR Compliance Officer. She’s up on all the latest rulings, as well as legal requirements imposed by the separate states in which your organization operates. She’s going to engage in all of your data science projects, and she’ll tell you what you need to do to comply with the regulations. You’re going to do whatever she tells you to do, or your work will never see the light of day.
(*) Yeah, I know — it’s Natalia Poklonskaya. No hidden political message there, I just like the picture.
Cloud, Blah, Blah, Blah, Blah…
Cloud is neither a great platform for data science nor a good platform. It’s the only logical platform.
Think of it like this. It makes sense for organizations to invest in IT infrastructure for workloads that are persistent, predictable, and mission-critical. Everything else should go to the cloud.
If you live in Manhattan and want to visit Grandma in Shrewsbury twice a year, you don’t buy a Tesla unless you’re filthy rich. You rent a ZipCar, or take an Uber.
Are data science workloads persistent, predictable and/or mission-critical? If you answered “none of the above” go to the head of the class. Data science projects are time-boxed and short-term. They require brief massive bursts of computing power. And they are rarely mission-critical.
I’m tempted to “predict” that data science will move to the cloud in 2018. Except that data science moved to the cloud a long time ago. I don’t have statistics, but here are some anecdotes:
- 2010: RazorFish, the digital marketing agency pulls the plug on its server and moves everything to AWS
- 2014: Data scientists at a leading US bank say they’ve moved 100% of model development to the cloud
- 2015: A leading strategy consultancy uses a Virtual Private Cloud for 100% of its data science workloads
Analytic service providers and consultants led the way into the cloud. As variable-cost organizations, they had a huge incentive to stop investing in IT infrastructure. And, they had the skills to use the cloud back when it was hard.
It’s getting easier to use the cloud, so economic logic prevails.
Yes, there are some holdouts: organizations that prohibit use of cloud, or take a go-slow approach. But they are increasingly rare.
Predicting that data science will continue moving to the cloud is like predicting that the Mississippi River will continue flowing into the Gulf of Mexico.
IBM: Four More Quarters of Decline. Oh, Wait…
I was going to predict four more quarters of declining revenue for IBM. But then the company threw a monkey wrench into the works and reported increased sales in Q4. So, let’s offer a round of golf applause for the folks at Armonk.
But remember: the U.S.S Arizona stopped sinking when it settled into the mud.
Does this mean IBM’s big investment in Watson is finally paying off? Well, no. Take a squint at the numbers. The big jump in revenue comes from the Systems business, where IBM reports a big jump in…wait for it…System Z boxes, aka mainframes. And, in the Cognitive Solutions segment, IBM says that security and transaction processing software drove the revenue increase. You know, stuff like CICS that runs on mainframes.
So, the handful of organizations that account for most of IBM’s revenue decided that it’s easier to upgrade some of their old boxes than it is to replace them wholesale with modern architecture.
Not that there’s anything wrong with that.
Why, you ask, does IBM include software for mainframe transaction processing in its “Cognitive Solutions” business unit? Good question. One theory: when IBM reorganized, the most important consideration was to make sure that each of CEO Ginny Rometty’s one-downs had a big enough fief to justify super-sized compensation. IBM had to throw the kitchen sink into “Cognitive Solutions” to make it a suitable prince-bishopric.
Which explains why “Cognitive Solutions” has a 4% growth rate. 4% isn’t a growth story. It’s a “we’re just keeping our heads above water” story. It’s tough to grow when your business is sandbagged with the dogs IBM has collected over the years. Yes, Virginia, there is still a Red Brick Warehouse.
Each quarter, IBM breathlessly announces “wins” for Watson. Scan through the 10-Qs, however, and you know what you don’t see? The words “Watson” and “revenue” in close conjunction. That’s because auditors actually care about such things as “revenue recognition” and “materiality” and keeping BS out of the financial statements. Lest they get sued. Wall Street pleads with IBM to show some results from Watson. So you figure that if IBM actually had material Watson revenue, you’d see it in the financials.
IBM reports revenue of ~$20 billion annually for the “Cognitive Solutions” business. But industry analyst IDC estimates IBM’s actual revenue from cognitive and AI software at about $160 million. Which means that the IBM cognitive story is one part reality and 125 parts window dressing.
Keep that in mind the next time an IBM executive wants to talk to you about the power of cognitive computing.
No prediction. I just enjoy snarking at IBM.