Big Analytics Roundup (February 29, 2016)

Happy Leap Day.  Tachyon’s rebranding as Alluxio, release of CaffeOnSpark and GA for Google Cloud Dataproc lead the hard news this week.  The Alluxio announcement has inspired big thinkers to share big thoughts.  And, we have a nice crop of explainers.  Scroll down to the bottom for another SQL on Hadoop benchmark.

Explainers

— In SearchDataManagement, Jack Vaughn explains Spark 2.0.

— In Datanami, Alex Woodie explains Structured Streaming in Spark 2.0.

— MapR’s Jim Scott explains Spark accumulators.   Jim also explains Spark Streaming.

— DataArtisans’ Fabian Hueske introduces Flink.

— In SlideShare, Julian Hyde explains streaming SQL.

— Wes McKinney explains why pandas users should be excited about Apache Arrow.

— On her blog, Paige Roberts explains Project Tungsten, complete with pictures.

— Someone from Dremio explains Drillix, which is what you get when you combine Apache Phoenix and Apache Drill. (h/t Hadoop Weekly).

Perspectives

— In TheNextPlatform, Timothy Prickett Morgan argues that Tachyon Caching (Alluxio) is bigger than Spark

— In SiliconAngle, Maria Deutscher opines that Alluxio (née Tachyon) could replace HDFS for Spark users.

— In The New Stack, Susan Hall speculates that Apache Arrow’s columnar data layer could accelerate Spark and Hadoop.  She means Hadoop in a general way, e.g. the Hadoop ecosystem.

— On the Dataiku blog, “Caroline” interviews John Kelly, Managing Director of Berkeley Research Group and asks him questions about data science.  Left unanswered: is it “Data-ikoo” or “Day-tie-koo?”

— Alpine Data Labs’ Steven Hillion ruminates on success.  He’d be better off ruminating on “how to raise your next round of venture capital.”

— Max Slater-Robins opines that Microsoft is inventing the future, which is even better than winning the internet.

— In ZDNet, Andrew Brust wonders if Databricks is vying for a full analytics stack, citing the new Dashboard feature as cause for wonder.  He’s just trolling.

— In Search Cloud Applications, Joel Shore opines that streaming analytics is replacing complex event processing, which makes sense.   He further opines that Flink will displace Spark for streaming, which doesn’t make sense.   Shore interviews IBM’s Nagui Halim about streaming here.

Open Source Announcements

— Alluxio (née Tachyon) announces Release 1.0.0.  Alluxio is open source software distributed through Git under an Apache license, but is not an Apache project.  Yet.  Release 1.0 includes frameworks for MapReduce, Spark, Flink and Zeppelin.  Daniel Gutierrez reports.

— Yahoo releases CaffeOnSpark, a distributed deep learning package.  Caffe is one of the better-known deep learning packages, with a track record in image recognition.  Software is available on Git.  For more information, see the Wiki.  Alex Handy reports; Charlie Osborne reports.

— RapidMiner China announces availability of an extension for deep learning engine DL4J.  The extension is open source, and works with the open source version of RapidMiner.  DL4J sponsor Skymind collaborated.

Commercial Announcements

–Tachyon Nexus, the commercial venture founded to support Tachyon, the memory-centric virtual distributed storage system, announces that it has rebranded as Alluxio.

— Google announces general availability for its Cloud Dataproc managed service for Spark and Hadoop.

Funding Announcements

Health analytics vendor Health Catalyst lands a $70M Series E round.

AtScale Benchmarks SQL-on-Hadoop Engines

On the AtScale blog, Trystan Leftwich summarizes results from a benchmark test of Hive on Tez (1.2/0.7), Cloudera Apache Impala (2.3) and Spark SQL (1.6).  The AtScale team tested Impala and Spark with Parquet and Hive on Tez with ORC.  For test cases, the team used TPC-H data arranged in a star schema, and ran 13 queries in each SQL engine multiple times, averaging the results.

While Hortonworks recommends ORC with Hive/Tez, there are published cases where users achieved good results with Hive/Tez on Parquet.  Since the storage format has a big impact on SQL performance, I would have tested Hive/Tez on Parquet as well.  AtScale did not respond to queries on this point.

Key findings:

  • All three engines performed about the same on single-table queries, and on queries joining three small tables.
  • Spark and Impala ran faster than Hive on queries joining three large tables.
  • Spark ran faster than Impala on queries joining four or more tables.

The team ran the same tests with AtScale’s commercial caching technology, with significant performance improvements for all three engines.

In concurrency testing, Impala performed much better than Hive or Spark.

Details of the test available in a white paper here (registration required).

Smart Money: More Funding for Analytics

Funding for analytic ventures remained robust in January, with 17 significant funding transactions and three acquisitions.   Key themes:

  • Outcomes-based medicine and health care
  • Vertical solutions for the energy industry
  • Solutions for risk management
  • Mobile analytics, including location-based targeting and app metrics
  • Social media sentiment analysis
  • Graph engines (and solutions based on graph engines)
  • In-memory SQL engines

All funding news via Crunchbase.

Funding

Health Catalyst led the way with $41 million in Series C funding.   Health Catalyst offers a solution stack consisting of a proprietary data warehouse optimized for electronic medical records, plus analytic applications designed to support outcomes-based health care.

Other transactions greater than $1 million include:

MemSQL, provider of a high performance in-memory distributed database, raised $35 million in a Series B round.

— Still in stealth mode, marketing analytics provider OrigamiLogic closed on $15 million in Series B funding.

— Kreditech scored $15 million in debt financing.  Kreditech uses machine learning and Big Data to offer credit scoring for microlending.

— Radius closed on $13 million in Series B funding.  Radius supports B2B targeted marketing and lead generation for small businesses.

— Smart grid analytics provider AutoGrid landed $12.8 million in Series C funding.

— GNS Healthcare leverages Bayesian Networks and Monte Carlo Simulation to deliver solutions for outcomes-based medicine to hospitals, health insurance plans, pharmaceutical companies and other entities in the health care delivery chain.  GNS completed $10 million in Series B financing.

— Simple Energy raised $6 million in Series B funding.  Simple Energy offers utilities services to improve customer interactions through microtargeting and social gaming.

— Binary Fountain, provider of software integrating social sentiment analysis with BPM, raised $5.7 million.

— 4C Insights integrates social media sentiment analysis with public data to support media planning and targeting.   The firm raised $5 million in Series B funding.

— Kontagent secured $4.8 million in venture funding.  Kontagent offers mobile analytic solutions to mobile app developers and marketers.

— Offshore analytic services provider Axtria received $4.8 million in venture funding.

— Enigma Technologies raised $4.5 million in Series A funding.  Enigma provides a platform for the analysis of public data that includes a repository and directory to sources, plus tools for search, export and simple analytics.

— Lumiata raised $4 million in Series A funding.  Lumiata leverages graph engine technology to deliver evidence-based predictions to medical practitioners.

— BI vendor Chartio received $2.2 million in venture funding

Bottlenose, purveyor of dashboard and insight tools for social sentiment analysis, raised $1.1 million in debt financing.

Geofeedia, a provider of open source location-based social media mining tools, received 1.25 million in Series A funding.

Acquisitions

There were three acquisitions of note; purchase prices were not disclosed.

— yp, the corporate successor to AT&T Interactive and AT&T Advertising Solutions, acquired Sense Networks on January 6.   Sense Networks uses predictive analytics to drive location-based behavioral targeting for mobile ad platforms.

— Pinterest acquired VisualGraph on January 6.  VisualGraph, a two-man operation, has developed a distributed in-memory visual search engine.

— Apigee, an API management company, acquired InsightsOne on January 8.   InsightsOne offers cloud-based infrastructure for predictive analytics based on Hadoop, plus an in-memory graph engine.