Big Analytics Roundup (May 31, 2016)
Google’s TPU announcement on May 18 continues to reverberate in the tech press. In Forbes, HPC expert Karl Freund dissects Google’s announcement, suggesting that Google is indulging in a bit of hocus-pocus to promote its managed services. Freund believes that TPUs are actually used for inference and not for model training; in other words, they replace CPUs rather than GPUs. Read the whole thing, it’s an excellent article.
Meanwhile, there’s this:
Cray announces launch of Urika-GX, a supercomputing appliance that comes pre-loaded with Hortonworks Data Platform, the Cray Graph Engine, OpenStack management tools and Apache Mesos for configuration. Inside the box: Intel Xeon Broadwell cores, 22 terabytes of memory, 35 terabytes of local SSD storage and Cray’s high performance network interconnect. Cray will ship 16, 32 or 48 nodes in a rack in the third quarter, larger configurations later in the year.
And, Cringely vivisects IBM’s analytics strategy. My take: IBM’s analytics strategy is slick at the PowerPoint level, lame at the product level. IBM has acquired a lot of assets and partially wired them together, but has developed very little, and much of what it has developed organically is junk. Anyone remember Intelligent Miner? I rest my case.
— On the Databricks blog, Sameer Agarwal, Davies Liu and Reynold Xin deliver a deep dive into Spark 2.0’s second-generation Tungsten execution engine. Tungsten analyzes a query and generates optimized bytecode at runtime. The article includes performance comparisons to Spark 1.6 across the full range of TPC-DS.
— Via Adrian Colyer, a couple of Stanford profs explain optimal strategies for cloud provisioning. Must read for anyone who uses public cloud.
— Alex Robbins explains resilient distributed datasets (RDDs), a core concept in Spark.
— On the AWS Big Data blog, Ben Snively explains how to use Spark SQL for ETL.
— Deborah Siegel and Danny Lee explain Genome Variant Analysis with ADAM and Spark in a three-part series.
— Alex Woodie explains two health sciences research projects running on Spark.
— Fabian Hueske explains Flink’s roadmap for streaming SQL.
— Hannah Augur explains Big Data terminology.
— In a sponsored piece, an anonymous blogger interviews Mesosphere’s Keith Chambers, who explains DC/OS.
— Matt Asay asks if Concord.io could topple Spark from its big data throne. The answer is no. Concord.io is exclusively for streaming, which makes it Yet Another Streaming Engine. For the three people out there who aren’t happy with Spark’s half-second latency, Concord is one option among several to check out.
— IBM’s got a fever, and the prescription is Spark.
Open Source Announcements
— The Apache Software Foundation promotes two projects from incubator status:
— Twitter donates Yet Another Streaming Engine to open source.
— Apache Kafka announces Release 0.10.0.0
— Apache Kylin announces release 1.5.2, a major release with new features, improvements and 76 bug fixes.
— Amazon Web Services announces a 2X speedup for Redshift, so your data can die faster.
— Alpine Data Labs announces Chorus 6. If Alpine’s branding confuses you, you’re not alone. Chorus is a collaboration framework originally developed by Greenplum back when Alpine and Greenplum were part of one happy family. Then EMC acquired Greenplum but not Alpine and spun off Greenplum to Pivotal; Dell acquired EMC; and Pivotal open sourced Greenplum. Meanwhile Alpine merged Chorus with its Alpine machine learning workbench and rebranded the whole thing as Chorus.