Big Analytics Roundup (August 15, 2016)
In the second quarter of 2015, Hortonworks lost $1.38 for every dollar of revenue. In the second quarter of 2016, HDP lost $1.46 for every dollar of revenue. So I guess they aren’t making it up on volume.
On the Databricks blog, Jules Damji summarizes Spark news from the past two weeks.
AWS Launches Kinesis Analytics
The biggest threat to Spark Streaming doesn’t come from the likes of Flink, Storm, Samza or Apex. It comes from popular message brokers like Apache Kafka and AWS Kinesis, who can and will add analytics to move up the value chain.
Intel Freaks Out
Intel announces an agreement to acquire Nervana Systems, a 28-month-old startup working on hardware and software solutions for deep learning. Re/code reports a price tag of $408 million. The customary tech media unicorn story storm ensues. (h/t Oliver Vagner)
Intel says it plans to use Nervana’s software to improve the Math Kernel Library and market the Nervana Engine alongside the Xeon Phi processor. Nervana neon is YADLF — Yet Another Deep Learning Framework — that ranked twelfth in usage among deep learning frameworks in KDnuggets’ recent poll. According to Nervana, neon benchmarks well against Caffe; but then, so does CNTK.
Do special-purpose chips for deep learning have legs? Obviously, Intel thinks so. The headline on that recent Wired story about Google’s deep learning chip — Time for Intel to Freak Out — looks prescient. That said, the history of computing isn’t kind to special-purpose hardware; does anyone remember Thinking Machines? If Intel has any smarts at all, it will take steps to ensure that its engine works with the deep learning frameworks people actually want to use, like TensorFlow, Theano, and Caffe.
Cloud Computing Drivers
Tony Safoian describes five trends driving the growth of cloud computing: better security, machine learning and big data, containerization, mobile and IoT. Cloud security hasn’t actually improved — your data was always safer in the cloud than it was on premises. What has changed is the perception of security, and the growing sense that IT sentiments against cloud have little to do with security and a lot to do with rent-seeking and turf.
On the other points, Safoian misses the big picture — due to the costs of data movement, the cloud is best suited to machine learning and big data when data sources are also in the cloud. As organizations host an increasing number of operational applications in the cloud, it makes sense to manage and analyze the data there as well.
Machine Learning for Social Good
Microsoft offers a platform to predict scores in weather-interrupted cricket matches.
Speaking of books, I plan to publish snippets from my new book, Disruptive Analytics, every Wednesday over the next couple of months.
— Uber’s Vinoth Chandar explains why you rarely need sub-second latency for streaming analytics.
— Microsoft’s David Smith explains how to tune Apache Spark for faster analysis with Microsoft R Server.
— Databricks’ Jules Damji explains how to use SparkSession with Spark 2.0.
— On the Cloudera Engineering Blog, Devadutta Ghat et. al. explain analytics and BI on S3 with Apache Impala. Short version: you’re going to need more nodes.
— In the first of a three-part series, IBM’s Elias Abou Haydar explains how to score health data with Apache Spark.
— Basho’s Pavel Hardak explains how to use the Riak Connector for Apache Spark.
— On YouTube, Alluxio founder and CEO Haoyuan Li explains Alluxio.
— Cisco’s Saravanan Subramanian explains the features of streaming frameworks, including Spark, Flink, Storm, Samza, and Kafka Streams. A pretty good article overall, except that he omits Apache Apex, a top-level Apache project.
— Frances Perry explains what the Apache Beam has accomplished in the first six months of incubation.
— Curt Monash opines about Databricks and Spark. He notes that some people are unhappy that Databricks hasn’t open sourced 100% of its code, which is just plain silly.
— IBM’s Vijay Bommireddipalli touts IBM’s contributions to Spark 2.0.
— Mellanox’ Gillad Shainer touts the performance advantage of EDR InfiniBand versus Intel Omni-Path. Mellanox sells InfiniBand host bus adapters and network switches.(h/t Bob Muenchen)
— Kan Nishida runs a cluster analysis on R packages in Google BigQuery and produces something incomprehensible.
— Pivotal’s Jagdish Mirani argues that network-attached storage (NAS) may be a good alternative to direct-attached storage (DAS). Coincidentally, Pivotal’s parent company EMC sells NAS devices.
Open Source News
— Apache Flink announces two releases. Release 1.1.0 includes new connectors, the Table API for SQL operations, enhancements to the DataStream API, a Scala API for Complex Event Processing and a new metrics system. Release 1.1.1 fixes a dependency issue.
— Apache Kafka announces Release 0.10.0.1, with bug fixes.
— Apache Samza releases Samza 0.10.1 with new features, performance improvements, and bug fixes.
— Apache Storm delivers version 1.0.2, with bug fixes.
— AWS releases EMR 5.0, with Spark 2.0, Hive 2.1 and Tez as the default execution engine for Hive and Pig. EMR is the first Hadoop distribution to support Spark 2.0.
— Fractal Analytics partners with KNIME.