Big Analytics Roundup (August 22, 2016)
MIT Technology Review reports that Chicago’s experiment in predictive policing isn’t working. Data scientists developed a list of a few hundred people likely to commit a shooting; police, however, ignored the predictions, primarily because nobody told them what to do with individuals on the list. The report illustrates a fundamental truth about data science: no amount of insight matters unless your organization has the will and the skill to do something with it.
In an exceptionally well-written article, Databricks’ Kavitha Mariappan surveys open source tools for data scientists.
Mary Jo Foley reports on Microsoft’s Open Mind Studio, an unannounced product that will bundle CNTK, other deep learning frameworks, open source computing frameworks and other bits on a heterogeneous computing framework.
Federico Castanedo explains scalable data science with R and gets it mostly right. With R, you have three options: scale up, scale out or use R as an abstraction layer. Scaling up means hosting R on a bigger machine; scaling out means distributing R on many computers, which only works for embarrassingly parallel operations. Your third option is to use an R interface to distributed platforms, such as SparkR, Teradata Aster R, Oracle R Enterprise and Microsoft R Server.
Gartner rates Talend as a leader in its 2016 Magic Quadrant for Data Integration Tools. Talend crows; Dave Ramel reports. You can pay Gartner $1,995 for your copy, get a free copy here or look at the picture below.
Rohit Jain summarizes the quest for database nirvana. Nirvana? Please. It’s *&^%$ software.
— Brian Wang explains Wave Computing’s Dataflow Processing Unit, a chip that Wave claims offers 10X faster training and 100X faster inference than any existing system based on CPU accelerators.
— On Linux.com, Libby Clark interviews IBM’s Diana Arroyo and Alek Slominski and gleans tips for serverless computing with Apache Mesos.
— On the Hortonworks blog, Sanjay Radia and Saumitra Buragohain explain the ins and outs of Hadoop in the cloud. Curiously, they don’t mention EMR.
— Gary Cokins summarizes arguments for and against the use of Big Data in Analytics.
— Kalev Leetaru wonders if corporate data centers are obsolete.
— Wayne Eckerson asks what can go wrong in self-service analytics.
— Cloudera’s Sean Anderson discusses the impact of Spark 2.0.
— Venture capitalist Tomasz Tunguz likes how ServiceNow measures customer churn.
— On the Hortonworks blog, Louise Matthews interviews a couple of HDP executives about HDP+MSFT. Looking at HDP’s latest financials, MSFT should just buy HDP and put it out of its misery.
— Ronald van Loon surveys machine learning from an executive perspective.
— On the BigML blog, an author who blogs as atakancetinsoy touts BigML for Google Sheets. I’ve been testing BigML lately and like what I see.
Open Source News
— ASF announces version 0.8.1 of Apache Gearpump (incubating). Gearpump is Yet Another Streaming Engine.
— SQL-on-HBase engine Apache Phoenix announces the availability of Release 4.8.0, with a few enhancements and lots of bug fixes.