Big Analytics Roundup (April 11, 2016)
Top story of the week is NVIDIA’s new DGX-1 deep learning chip; scroll down for more on that.
We have three roundups from Strata + Hadoop World, Rashomon style:
- Alex Woodie reports six takeaways: Kafka, Spark, Hadoop, Cloud, machine learning, mainframes.
- Jessica Davis recalls four things: comedian Paula Poundstone, MapR, public data sets, AI.
- Nik Rouda recaps five things: Spark, machine learning, data warehousing, user interfaces, cloud.
— H2O.ai CTO and co-founder Cliff Click departs H2O, joins Neurensic, a firm that specializes in compliance analytics. Neurensic has a team of surname-eschewing executives that is surprisingly large considering it has no visible funding.
— Machine learning startup Context Relevant announces the appointment of Joseph Polverari as CEO, replacing board member Chris Kelley, who replaced founder Stephen Purpura in July, 2015, a month after the latter wrote a meditation on failure. Kelley’s major accomplishment: firing people. Appears that Context Relevant isn’t the next unicorn.
— One of the 76 IBM executives with the title of “CTO” touts cognitive computing. My take:
— Spiderbook’s Aman Naimat examines data gleaned by trolling through billions of publicly available documents, identifies 2,680 companies that are using Hadoop at any level of maturity, and another 3,500 that are just learning. That’s out of a total universe of 500,000 companies worldwide. I’m thinking that trolling through billions of public documents may understate the actual incidence of Hadoop usage.
— Crowdflower, a data enrichment platform, surveys data scientists and publishes the results. The report does not disclose how data scientists were identified and sampled, which is key to interpreting surveys like this. Respondents report that they spend a lot of time mucking around with data, which won’t surprise anyone, since Crowdflower sells a service that helps data scientists spend less time mucking with data.
NVIDIA Unveils Deep Learning Chip
— NVIDIA announces June availability for the DGX-1, a deep learning supercomputer on a chip. The DGX-1 includes eight Tesla P100 GPUs, each of which is 12X faster than NVIDIA’s previous benchmark. For $129K you get the throughput of 250 CPU-based servers.
— NVIDIA also reveals a Deep Learning SDK with Deep Learning primitives, math libraries, tools for multi-GPU communication, a CUDA toolkit and DIGITS, a model training system. The system works with popular Deep Learning frameworks like Caffe, CNTK, TensorFlow and Theano.
— Selected media reports:
- ExtremeTech: “Is there anything a computer can’t do?”
- Gizmodo: “Stupidly powerful.”
- Engadget: “Insane.”
- TechReport: “Holy mother of GPUs.”
— MIT Technology Review interviews NVIDIA CEO Jen-Hsun Huang.
— Ian Pointer explains Structured Streaming, coming up in Spark 2.0.
— Till Rohrmann introduces Complex Event Processing (CEP) with Flink.
— Maxime Beauchemin explains Caravel, Airbnb’s data exploration platform.
— LinkedIn’s Akshay Rai explains Dr. Elephant, a newly open-sourced self-service performance tuning package for Hadoop and Spark.
— In a guest post on the Cloudera Engineering Blog, engineers from Wargaming.net explain how they built their real-time recommendation engine with Spark, Kafka, HBase and Drools.
— Katrin Leinweber et. al. explain how to analyze an assay of bacteria-induced biofilm formation the freshwater diatom Achnanthidium minutissimum with KNIME. In case you’re wondering, Achnanthidium minutissimum is a kind of algae.
— On LinkedIn, George Hill of The Cyclist nicely critiques the 2011 McKinsey Big Data report, offering a point by point assessment.
— Mauricio Prinzlau of Cloudwards.net opines, without data, that the five languages paving the future of machine learning are MATLAB/Octave, R, Python, “Java-family/C-family” and Extreme Learning Machines (ELM). What was that last one again? Personally, I’ve never seen anyone lump Java and C into a single category, but whatever.
— In InfoWorld, “internationally recognized industry expert and thought leader” David Linthicum ventures into the machine learning discussion by arguing that it’s mostly BS.
— John Dunn demonstrates his ignorance of fraud by asking if machine learning can help banks detect it. As if they haven’t been doing that for years. Also, the “hard decline” he describes at the beginning of the article is rare; most false positives produce “soft declines,”, where the merchant is asked to request identification or speak with the call center.
— In IBT, Ian Allison wonders if financial analysts will lose their jobs to intelligent trading machines. If he watched Billions, he would know that financial analysts spend their time procuring inside information.
— Timo Elliott argues that BI is dead. I have to wonder if it was ever alive.
— Confluent CTO Neha Narkhede opines on stream processing. She’s in favor of it.
— Brandon Butler interviews AWS’ Matt Wood, who chats about competing with Google and Microsoft.
— On Forbes, Robert Hof interviews Cloudera CEO Tom Reilly.
Open Source Announcements
— Flink releases version 1.0.1, a maintenance release.
— Apache Lens, a “unified analytics interface,” releases version 2.5.0 to beta.
— Airbnb open sources Caravel, a data exploration package.
— Apache Tajo announces Release 0.11.2, which should please its user.
— LinkedIn releases Dr. Elephant to open source.
— Databricks announces the agenda for Spark Summit 2016 in SFO.
— Cloudera announces Cloudera Enterprise 5.7. New analytic bits include Hive-on-Spark GA, support for the HBase-Spark module, support for Spark 1.6 and support for Impala 2.5.
— MapR announces availability of Apache Drill 1.6 as the unified SQL layer for the MapR Converged Data Platform.