Big Analytics Roundup (August 22, 2016)

MIT Technology Review reports that Chicago’s experiment in predictive policing isn’t working. Data scientists developed a list of a few hundred people likely to commit a shooting; police, however, ignored the predictions, primarily because nobody told them what to do with individuals on the list. The report illustrates a fundamental truth about data science: no amount of insight matters unless your organization has the will and the skill to do something with it.

In an exceptionally well-written article, Databricks’ Kavitha Mariappan surveys open source tools for data scientists.

Mary Jo Foley reports on Microsoft’s Open Mind Studio, an unannounced product that will bundle CNTK, other deep learning frameworks, open source computing frameworks and other bits on a heterogeneous computing framework.

Federico Castanedo explains scalable data science with R and gets it mostly right. With R, you have three options: scale up, scale out or use R as an abstraction layer. Scaling up means hosting R on a bigger machine; scaling out means distributing R on many computers, which only works for embarrassingly parallel operations. Your third option is to use an R interface to distributed platforms, such as SparkR, Teradata Aster R, Oracle R Enterprise and Microsoft R Server.

Gartner rates Talend as a leader in its 2016 Magic Quadrant for Data Integration Tools. Talend crows; Dave Ramel reports. You can pay Gartner $1,995 for your copy, get a free copy here or look at the picture below.

Screen Shot 2016-08-22 at 10.48.08 AM

Rohit Jain summarizes the quest for database nirvana. Nirvana? Please. It’s *&^%$ software.

Summer Reading

— O’Reilly Media offers a free e-book on The Big Data Market.

— Ajit Jaokar, who is a Director at the Universidad Politécnica de Madrid’s AI for Smart Cities Lab, publishes a snippet from his book, Data Science for the Internet of Things.

— Speaking of books


— Brian Wang explains Wave Computing’s Dataflow Processing Unit, a chip that Wave claims offers 10X faster training and 100X faster inference than any existing system based on CPU accelerators.

— On, Libby Clark interviews IBM’s Diana Arroyo and Alek Slominski and gleans tips for serverless computing with Apache Mesos.

— On the Hortonworks blog, Sanjay Radia and Saumitra Buragohain explain the ins and outs of Hadoop in the cloud. Curiously, they don’t mention EMR.


— Gary Cokins summarizes arguments for and against the use of Big Data in Analytics.

— Kalev Leetaru wonders if corporate data centers are obsolete.

— Wayne Eckerson asks what can go wrong in self-service analytics.

— Cloudera’s Sean Anderson discusses the impact of Spark 2.0.

— Venture capitalist Tomasz Tunguz likes how ServiceNow measures customer churn.

— On the Hortonworks blog, Louise Matthews interviews a couple of HDP executives about HDP+MSFT. Looking at HDP’s latest financials, MSFT should just buy HDP and put it out of its misery.

— Ronald van Loon surveys machine learning from an executive perspective.

— On the BigML blog, an author who blogs as atakancetinsoy touts BigML for Google Sheets. I’ve been testing BigML lately and like what I see.

Open Source News

— ASF announces version 0.8.1 of Apache Gearpump (incubating). Gearpump is Yet Another Streaming Engine.

— SQL-on-HBase engine Apache Phoenix announces the availability of Release 4.8.0, with a few enhancements and lots of bug fixes.

— Rob High, one of 369 Chief Technology Officers at IBM, announces the availability of CognizeR, a package enabling R users to invoke IBM’s Cognitive branded APIs. (h/t/ Oliver Vagner)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.