Happy Leap Day. Tachyon’s rebranding as Alluxio, release of CaffeOnSpark and GA for Google Cloud Dataproc lead the hard news this week. The Alluxio announcement has inspired big thinkers to share big thoughts. And, we have a nice crop of explainers. Scroll down to the bottom for another SQL on Hadoop benchmark.
— In SearchDataManagement, Jack Vaughn explains Spark 2.0.
— In Datanami, Alex Woodie explains Structured Streaming in Spark 2.0.
— MapR’s Jim Scott explains Spark accumulators. Jim also explains Spark Streaming.
— DataArtisans’ Fabian Hueske introduces Flink.
— In SlideShare, Julian Hyde explains streaming SQL.
— Wes McKinney explains why pandas users should be excited about Apache Arrow.
— On her blog, Paige Roberts explains Project Tungsten, complete with pictures.
— Someone from Dremio explains Drillix, which is what you get when you combine Apache Phoenix and Apache Drill. (h/t Hadoop Weekly).
— In TheNextPlatform, Timothy Prickett Morgan argues that Tachyon Caching (Alluxio) is bigger than Spark
— In SiliconAngle, Maria Deutscher opines that Alluxio (née Tachyon) could replace HDFS for Spark users.
— In The New Stack, Susan Hall speculates that Apache Arrow’s columnar data layer could accelerate Spark and Hadoop. She means Hadoop in a general way, e.g. the Hadoop ecosystem.
— On the Dataiku blog, “Caroline” interviews John Kelly, Managing Director of Berkeley Research Group and asks him questions about data science. Left unanswered: is it “Data-ikoo” or “Day-tie-koo?”
— Alpine Data Labs’ Steven Hillion ruminates on success. He’d be better off ruminating on “how to raise your next round of venture capital.”
— Max Slater-Robins opines that Microsoft is inventing the future, which is even better than winning the internet.
— In ZDNet, Andrew Brust wonders if Databricks is vying for a full analytics stack, citing the new Dashboard feature as cause for wonder. He’s just trolling.
— In Search Cloud Applications, Joel Shore opines that streaming analytics is replacing complex event processing, which makes sense. He further opines that Flink will displace Spark for streaming, which doesn’t make sense. Shore interviews IBM’s Nagui Halim about streaming here.
Open Source Announcements
— Alluxio (née Tachyon) announces Release 1.0.0. Alluxio is open source software distributed through Git under an Apache license, but is not an Apache project. Yet. Release 1.0 includes frameworks for MapReduce, Spark, Flink and Zeppelin. Daniel Gutierrez reports.
— Yahoo releases CaffeOnSpark, a distributed deep learning package. Caffe is one of the better-known deep learning packages, with a track record in image recognition. Software is available on Git. For more information, see the Wiki. Alex Handy reports; Charlie Osborne reports.
— RapidMiner China announces availability of an extension for deep learning engine DL4J. The extension is open source, and works with the open source version of RapidMiner. DL4J sponsor Skymind collaborated.
–Tachyon Nexus, the commercial venture founded to support Tachyon, the memory-centric virtual distributed storage system, announces that it has rebranded as Alluxio.
— Google announces general availability for its Cloud Dataproc managed service for Spark and Hadoop.
Health analytics vendor Health Catalyst lands a $70M Series E round.
AtScale Benchmarks SQL-on-Hadoop Engines
On the AtScale blog, Trystan Leftwich summarizes results from a benchmark test of Hive on Tez (1.2/0.7),
Cloudera Apache Impala (2.3) and Spark SQL (1.6). The AtScale team tested Impala and Spark with Parquet and Hive on Tez with ORC. For test cases, the team used TPC-H data arranged in a star schema, and ran 13 queries in each SQL engine multiple times, averaging the results.
While Hortonworks recommends ORC with Hive/Tez, there are published cases where users achieved good results with Hive/Tez on Parquet. Since the storage format has a big impact on SQL performance, I would have tested Hive/Tez on Parquet as well. AtScale did not respond to queries on this point.
- All three engines performed about the same on single-table queries, and on queries joining three small tables.
- Spark and Impala ran faster than Hive on queries joining three large tables.
- Spark ran faster than Impala on queries joining four or more tables.
The team ran the same tests with AtScale’s commercial caching technology, with significant performance improvements for all three engines.
In concurrency testing, Impala performed much better than Hive or Spark.
Details of the test available in a white paper here (registration required).