Big Analytics Roundup (August 1, 2016)

There are two big stories this week: Apache Spark 2.0 and Apache Mesos 1.0. There’s also a new release from Kylin, and a nice crop of explainers.

IEEE Spectrum publishes its third annual ranking of top programming languages, based on twelve metrics drawn from Google Search, Google Trends, Twitter, GitHub, Stack Overflow, Reddit, Hacker News, CareerBuilder, Dice, and the IEEE Xplore Digital Library. Among analytic languages, Python ranks third; R ranks fifth; Matlab, fourteenth; Scala, fifteenth; Julia thirty-third. SAS ranks thirty-ninth, good enough to qualify at the tail end of a NASCAR race.

Spark 2.0 General Availability

The Spark team announces general availability for Spark 2.0. My full report here.  Key new bits:

  • Improved memory management and performance.
  • Unified DataFrames and Datasets APIs.
  • SQL 2003 support.
  • Pipeline persistence for machine learning.
  • Structured Streaming, a declarative streaming API (in experimental release.)

Databricks immediately announces support for the release.

Matei Zaharia explains continuous applications, noting that real-world use cases combine streaming and static data. For example, real-time fraud detection applications leverage information about the individual transaction together with information about the customer, the merchant and the item purchased.

Matei, Tathagata Das, Michael Armbrust and Reynold Xin explain Structured Streaming.

More stories herehereherehereherehereherehere, and here.

Apache Mesos Release 1.0

The Apache Mesos team announces the availability of Mesos 1.0.

— Maria Deutscher reports.

— Timothy Prickett Morgan details Mesos vs. Kubernetes.

— Serdar Yegualp notes that Mesos is not a clone of Kubernetes, which is certainly true.

— Gabriela Motroc says Mesos 1.0 is full of surprises, which sounds ominous.

Explainers

— Kaggle Grandmaster Abhishek Thakur details best practices for predictive modeling.

— H2O.ai’s Arno Candel explains new developments in H2O.

— Kypriani Sinaris interviews Databricks’ Xiangrui Meng, who explains Spark MLlib.

— TIBCO’s Hayden Schultz explains TIBCO’s Accelerator for Apache Spark.

— Bob Grossman of the University of Chicago and the Open Data Group explains best practices for predictive model deployment.

— Allstate’s Rob Nendorf explains DevOps for Data Science.

Perspectives

— Doug Henschen blogs on Workday’s plans for Platfora.

— Andrew Psaltis argues for a unified stream processing model, touts Apache Beam.

— Martin Heller reviews Google Cloud Machine Learning and likes what he sees.

— Janakiram MSV touts Microsoft’s machine learning initiatives.

Open Source News

— Apache Kylin announces release 1.5.3, with bug fixes, improvements, and a few new features.

Commercial Announcements

— MapR announces a third place ranking in a Gartner report. Ask yourself this: who came in third at Daytona?

Spark 2.0 Released

The Apache Spark team announces the production release of Spark 2.0.0.  Release notes are here. Read below for details of the new features, together with explanations culled from Spark Summit and elsewhere.

Measured by the number of contributors, Apache Spark remains the most active open source project in the Big Data ecosystem.

The Spark team guarantees API stability for all production releases in the Spark 2.X line.

Highlights

Spark Summit: Matei Zaharia summarizes highlights of the release. Slides here.

— Webinar: Reynold Xin and Jules S. Damji introduce you to Spark 2.0.

— Reynold Xin explains technical details of Spark 2.0.

SQL Processing

Key Changes

New and updated APIs:

  • In Scala and Java, the DataFrame and DataSet APIs are unified.
  • In Python and R, DataFrame is the main programming interface (due to lack of type safety).
  • For the DataFrame API, SparkSession replaces SQLContext and HiveContext.
  • Enhancements to the Accumulator and Aggregator APIs.

Spark 2.0 supports SQL2003, and runs all 99 TPC-DS queries:

  • Native SQL parser supports ANSI SQL and HiveQL.
  • Native DDL command implementations.
  • Subquery support.
  • View canonicalization support.

Additional new features:

  • Native CSV support
  • Off-heap memory management for caching and runtime.
  • Hive-style bucketing.
  • Approximate summary statistics.

Performance enhancements:

  • Speedups of 2X-10X for common SQL and DataFrame operators.
  • Improved performance with Parquet and ORC.
  • Improvements to Catalyst query optimizer for common workloads.
  • Improved performance for window functions.
  • Automatic file coalescing for native data sources.

Explainers

Spark Summit: Andrew Or explains memory management in Spark 2.0+. Slides here.

Spark Summit: Databrick’s Michael Armbrust explains structured analysis in Spark: DataFrames, Datasets, and Streaming. Slides here.

— Jules Damji explains Spark APIs: RDDs, DataFrames, and Datasets.

— On KDnuggets, Paige Roberts explains Project Tungsten.

 Sameer Agarwal, Davies Liu, and Reynold Xin dive deeply into Spark 2.0’s second generation Tungsten engine. This paper inspired Tungsten’s design.

Spark Summit: Yin Huai dives deeply into Catalyst, the Spark optimizer. Slides here.

— On the Databricks blog, Davies Liu and Herman van Hövell explain SQL subqueries in Spark 2.0.

Spark Summit: AMPLab’s Ankur Dave explains GraphFrames for graph queries in Spark SQL. Slides here.

Spark Streaming

Key Changes

Spark 2.0 includes an experimental release of Structured Streaming.

Explainers

Spark Summit: Tathagata Das explains Structured Streaming. Slides here.

— In an O’Reilly podcast, Ben Lorica asks Michael Armbrust about Structured Streaming.

— In InfoWorld, Ian Pointer explains Structured Streaming’s significance.

Machine Learning

Key Changes

The DataFrame-based API (previously named Spark ML) is now the primary API for machine learning in Spark; the RDD-based API remains in maintenance.

ML persistence is a key new feature, enabling the user to save and load ML models and pipelines in Scala, Java, Python, and R.

Additional techniques supported vary by API:

  • DataFrames-based API: Bisecting k-means clustering, Gaussian Mixture Model (GMM), MaxAbsScaler feature transformer.
  • PySpark: LDA, GMM, Generalized linear regression
  • SparkR: Naïve Bayes, k-means clustering, and survival regression, plus new families and link functions for GLM.

Explainers

Spark Summit: Joseph Bradley previews machine learning in Spark 2.0. Slides here.

— On the Databricks blog, Joseph Bradley explains model persistence in Spark 2.0.

— Tim Hunter, Hossein Falaki, and Joseph Bradley explain approximate algorithms.

SparkR

Key Changes

SparkR now includes three user-defined functions: dapply, gapply and lapply. The first two support partition-based functions, the latter supports hyper-parameter tuning.

As noted above, the SparkR API supports additional machine learning techniques and pipeline persistence. The API also supports more DataFrame functionality, including SparkSession, window functions, plus read/write support for JDBC and CSV.

Explainers

Spark Summit: Xiangrui Meng explains the latest developments in SparkR. Slides here.

— Live webinar: Hossein Falaki and Denny Lee demonstrate exploratory analysis with Spark and R.

— UseR 2016: Hossein Falaki and Shivaram Venkataraman deliver a tutorial on SparkR.

Big Analytics Roundup (July 18, 2016)

We have lots of fresh material to read on the beach this week — most notably, the “read of the week” below, which might be better labeled as the “read of the year.”  We have another streaming engine to kick around, a slew of earnings releases in the coming week, and some new releases from GraphLab Dato Turi.

If you haven’t already completed Databricks’ Spark survey, stop reading this and go do the survey.

On Wednesday, July 20, Teradata presents results of an “independent” benchmark of SQL on Hadoop engines, including Hive, Impala, Presto, and SparkSQL. Missing from the mix: Teradata Aster.

Call for Papers

CFP is open for Apache: Big Data Europe in Seville. Conference is November 14-16; CFP closes September 9

Read of the Week

Stop building data cathedrals; instead, build data bazaars. Adrian Colyer explains.

Yet Another Streaming Engine

The folks at Concord.io benchmark their product against Spark 1.6; not surprisingly, the results favor Concord.io. In Datanami, Alex Woodie touts the results. He should read his own summary of the recent OpsClarity survey, which contained this nugget:

Screen Shot 2016-07-18 at 8.26.11 AM

In other words, the whole debate about “true streaming” versus micro-batching is irrelevant to most organizations because they don’t need subsecond performance. It’s like arguing that a Ferrari is better than a Toyota Camry because the sports car can go 180 mph. Here in Mudville, you’ll be arrested if you go that fast, so the Camry’s big trunk and rear seat leg room look pretty good.

Performance is cool. But the current spate of streaming engines will not be resolved by performance tests. Commercial support, integration, depth of features, security and stability will determine which engines survive the shakeout.

Second Quarter Earnings Roundup

Five of the top six Business Analytics software vendors tracked by IDC are public companies, with quarterly earnings reports. (SAS is privately held). Here is the outlook for earnings releases:

— Oracle’s fiscal year ends May 31. Oracle does not report analytics revenue separately. For the fiscal quarter ended May 31, 2016, Oracle reports that growth in revenue from SaaS and PaaS cloud services barely offset a 12% decline in software license revenue, for overall flat software and services revenue.

— SAP expects to release Q2 financial results on Wednesday, July 20.

— Declining giant IBM will announce another quarter of fail on Monday, July 18.

— Microsoft will announce quarterly and fiscal year-end results on Tuesday, July 19.

— Teradata, like SAP, IBM, and Microsoft, closed the second quarter on June 30, but can’t crunch the numbers until Tuesday, August 2. Keep that in mind the next time TDC tries to sell you on their fast number crunching capabilities.

Explainers

— Ravelin’s Stephen Whitworth explains how to real-time fraud detection with Google BigQuery.

— Carol McDonald explains how to use Spark’s Random Forests capability, demonstrating with a loan credit risk dataset.

— Three more papers from Adrian Colyer:

  • Ambry: LinkedIn’s scalable geo-distributed object store.
  • Spheres of influence for viral marketing.
  • Progressive skyline computation.

— On the Hortonworks blog, Roshan Naik and Sapin Amin explain how they benchmarked performance improvements in Apache Storm 1.0.

— Jules Damji explains Spark APIs: RDDs, DataFrames, and Datasets.

— Lewis Gavin offers five tips to improve the performance of Spark apps.

— Qubole’s Rajat Venkatesh explains how to optimize queries with materialized views and Quark, Qubole’s SQL abstraction layer.

— In a recorded webinar, Hossein Falaki and Denny Lee explain how to perform exploratory analysis on large datasets with Spark and R.

— On the Revolutions blog, Joe Rickert explains the capabilities of several new R packages in CRAN.

— Barath Ravichander explains how to use R with SQL.

— Microsoft’s Sheri Gilley explains the ins and outs of SQL Server, PowerBI, and R.

— Roel M. Hogervorst explains how to submit an R package to CRAN. Bob Rudis elaborates.

— The Rcpp package enables R packages to leverage C or C++ code.  Dirk Eddelbuettel reveals that more than 700 CRAN packages now use Rcpp.

Perspectives

— On KDnuggets, deep learning mavens offer predictions about deep learning.

— Daniel Gutierrez interviews MapR’s Jack Norris, who is very excited about MapR.

— Alex Woodie describes Prama, TransUnion’s open source analytics platform built on MapR and Apache Drill.

Open Source Announcements

— Basho donates Riak TS for time series analysis to open source.

— Microsoft announces Microsoft R Client, a free development tool for use with Microsoft R Open.

— Apache Atlas announces version 0.7.0 – incubating.

Commercial Announcements

— GridGain, the company behind Apache Ignite, reports a 300X sales increase in the first half of 2016, which is not too surprising since the company was in stealth mode until last January.

— Microsoft announces GA for Azure SQL Data Warehouse, which may surprise those who thought it was already GA.

GraphLab Dato Turi announces the release of GraphLab Create 2.0, Turi Distributed and Turi Predictive Services. Marketing staff works feverishly to change brand names on all documents.

Big Analytics Roundup (June 20, 2016)

Light news this week — everyone is catching up from Spark Summit, it seems. We have a nice crop of explainers, and some thoughts on IBM’s “Data Science Experience” announcement.

On his personal blog, Michael Malak recaps the Spark Summit.

Teradata releases a Spark connector for Aster, so Teradata is ready for 2014.

On KDnuggets, Gregory Piatetsky publishes a follow-up to results of his software poll, this time analyzing which tools tend to be used together.

In Datanami, Alex Woodie asks if Spark is overhyped, quoting extensively from some old guy. Woodie notes that it’s difficult to track the number of commercial vendors who have incorporated Spark into their products. Actually, it isn’t:

Screen Shot 2016-06-20 at 12.24.07 PM

And yes, there are a few holdouts in the lower left quadrants.

CFPs and Competitions

Flink Forward 2016, Berlin, September 12-14 (due June 30)

Spark Summit Europe, Brussels, October 25-27 (closing date July 1)

Parkinson’s Progression Markers Institute (PPMI) 2016 Challenge (due September 7)

IBM Data Science Experience

Unless you attended the recent Spark Summit with a bag over your head, you’re aware that IBM announced something. An IBM executive wants to know if I heard the announcement.  The answer is yes, I saw the press release and the planted stories, but IBM’s announcements are — shall we say — aspirational: IBM is announcing a concept. The service isn’t in limited release, and IBM has not revealed a date when the service will be available.

Screen Shot 2016-06-20 at 11.17.54 AM

It’s hard to evaluate a service that IBM hasn’t defined. Media reports and the press release are inconsistent — all stories mention Spark, Jupyter, RStudio and R; some stories mention H2O, others mention Cplex and other products. Insiders at IBM are in the dark about what components will be included in the first release.

Evaluating the release conceptually:

  • IBM already offers a managed service for Spark, it’s less flexible than Databricks or Qubole, and not as rich as Altiscale or Domino Data.
  • Unlike Qubole and Databricks, IBM plans to use Jupyter notebooks and RStudio rather than creating an integrated development environment of its own.
  • R and RStudio in the cloud are already available in AWS, Azure and Domino. If IBM plans to use a vanilla R distribution, it will be less capable than Microsoft’s enhanced R distribution available in Azure.
  • A managed service for H2O is a good thing, if it happens. There is no formal partnership between IBM and H2O.ai, and insiders at H2O seem surprised by IBM’s announcement. Of course, it’s already possible to implement H2O in any IaaS cloud environment, and H2O has users on AWS, Azure and Google Cloud platforms already.

Bottom line: IBM’s “Data Science Experience” is a marketing wrapper around an existing service, with the possibility of adding new services that may or may not be as good as offerings already in the marketplace. We’ll take another look when IBM actually releases something.

Explainers

— Davies Liu and Herman van Hovell explain SQL subqueries in Spark 2.0.

— On the MapR blog, Ellen Friedman explains SQL queries on mixed schema data with Apache Drill.

— Bill Chambers publishes the first of three parts on writing Spark applications in Databricks.

— In TechRepublic, Hope Reese explains machine learning to smart people. For everyone else, there’s this.

— Carla Schroder explains how Verizon Labs built a 600-node bare metal Mesos cluster in two weeks.

— On YouTube, H2O.ai’s Arno Candel demonstrates TensorFlow deep learning on an H2O cluster.

— Jessica Davis compiles a listicle of Tech Giants who embrace open source.

— Microsoft’s Dmitry Pechyoni reports results from an analysis of 600 million taxi rides using Microsoft R Server on a single instance of the Data Science Virtual Machine in Azure.

Perspectives

— InformationWeek’s Jessica Davis wonders if Microsoft will keep LinkedIn’s commitment to open source. LinkedIn’s donations to open source have less to do with its “commitment”, and more to do with its understanding that software is not its core business.

— Arthur Cole wonders if open source software will come to rule the enterprise data center as a matter of course. The answer is: it’s already happening.

Open Source Announcements

— Apache Beam (incubating) announces version 0.1.0. Key bits: SDK for Java and runners for Apache Flink, Apache Spark and Google Cloud Dataflow.

— Apache Mahout announces version 0.12.2, a maintenance release.

— Apache SystemML (incubating) announces release 0.10.0.

Commercial Announcements

— Altiscale announces the Real-Time Edition of Altiscale Insight Cloud, which includes Apache HBase and Spark Streaming.

— Databricks announces availability of its managed Spark service on AWS GovCloud (US).

— Qubole announces QDS HBase-as-a-Service on AWS.

Big Analytics Roundup (May 31, 2016)

Google’s TPU announcement on May 18 continues to reverberate in the tech press. In Forbes, HPC expert Karl Freund dissects Google’s announcement, suggesting that Google is indulging in a bit of hocus-pocus to promote its managed services.  Freund believes that TPUs are actually used for inference and not for model training; in other words, they replace CPUs rather than GPUs. Read the whole thing, it’s an excellent article.

Meanwhile, there’s this:

Cray-Urika-GX-system-software

Cray announces launch of Urika-GX, a supercomputing appliance that comes pre-loaded with Hortonworks Data Platform, the Cray Graph Engine, OpenStack management tools and Apache Mesos for configuration. Inside the box: Intel Xeon Broadwell cores, 22 terabytes of memory, 35 terabytes of local SSD storage and Cray’s high performance network interconnect. Cray will ship 16, 32 or 48 nodes in a rack in the third quarter, larger configurations later in the year.

And, Cringely vivisects IBM’s analytics strategy. My take: IBM’s analytics strategy is slick at the PowerPoint level, lame at the product level. IBM has acquired a lot of assets and partially wired them together, but has developed very little, and much of what it has developed organically is junk. Anyone remember Intelligent Miner? I rest my case.

Top Reads

— On the Databricks blog, Sameer Agarwal, Davies Liu and Reynold Xin deliver a deep dive into Spark 2.0’s second-generation Tungsten execution engine. Tungsten analyzes a query and generates optimized bytecode at runtime. The article includes performance comparisons to Spark 1.6 across the full range of TPC-DS.

— Via Adrian Colyer, a couple of Stanford profs explain optimal strategies for cloud provisioning. Must read for anyone who uses public cloud.

Explainers

— Alex Robbins explains resilient distributed datasets (RDDs), a core concept in Spark.

— On the AWS Big Data blog, Ben Snively explains how to use Spark SQL for ETL.

— Deborah Siegel and Danny Lee explain Genome Variant Analysis with ADAM and Spark in a three-part series.

— Alex Woodie explains two health sciences research projects running on Spark.

— Fabian Hueske explains Flink’s roadmap for streaming SQL.

— Hannah Augur explains Big Data terminology.

— In a sponsored piece, an anonymous blogger interviews Mesosphere’s Keith Chambers, who explains DC/OS.

Perspectives

— Matt Asay asks if Concord.io could topple Spark from its big data throne. The answer is no. Concord.io is exclusively for streaming, which makes it Yet Another Streaming Engine. For the three people out there who aren’t happy with Spark’s half-second latency, Concord is one option among several to check out.

— Accel Partners’ Jake Flomenberg trolls the Open Adoption concept, WSJ’s Steve Rosenbush swallows it hook, line and sinker. Someone should pay Andrew Lampitt a royalty.

— IBM’s got a fever, and the prescription is Spark.

Open Source Announcements

— The Apache Software Foundation promotes two projects from incubator status:

  • Apache TinkerPop, a graph computing framework.
  • Apache Zeppelin, a data science notebook.

— Twitter donates Yet Another Streaming Engine to open source.

— Apache Kafka announces Release 0.10.0.0

— Apache Kylin announces release 1.5.2, a major release with new features, improvements and 76 bug fixes.

Commercial Announcements

— Confluent announces Confluent Platform 3.0 (concurrently with Kafka 0.10). Andrew Brust reports.

— Amazon Web Services announces a 2X speedup for Redshift, so your data can die faster.

— Alpine Data Labs announces Chorus 6. If Alpine’s branding confuses you, you’re not alone. Chorus is a collaboration framework originally developed by Greenplum back when Alpine and Greenplum were part of one happy family. Then EMC acquired Greenplum but not Alpine and spun off Greenplum to Pivotal; Dell acquired EMC; and Pivotal open sourced Greenplum. Meanwhile Alpine merged Chorus with its Alpine machine learning workbench and rebranded the whole thing as Chorus.

Big Analytics Roundup (May 16, 2016)

This week we have more insight into Spark 2.0, scheduled for release just before Spark Summit 2016. (Yes, I’m going.) Also, kudos to BI-on-Hadoop startup AtScale for a new round of funding; Amazon releases YADLF (Yet Another Deep Learning Framework); and there are a number of new faces at H2O.ai.

Plus, we have an extended review of the Palantir story.

Buzzfeed on Palantir

Last week, I deemed Buzzfeed’s story on Palantir too dumb to link. (“Forget it, Jake. It’s Buzzfeed.”) Buzzfeed “news” reporter William Alden, who was all over a story about maggots in Facebook lunches, breathlessly mines a cache of “secret internal documents” and discovers:

  • Palantir expects employee turnover of around 20% for 2016.
  • Palantir lost some clients.
  • Palantir books more work than it bills.

Does Palantir have an employee turnover problem?  No. A 20% turnover rate is slightly above the 17% reported for all industries in 2015, and about on track for Silicon Valley. (There are companies in SV with 100% turnover rates.) On Glassdoor, employees give Palantir high marks.

Does Palantir have a client retention problem? Not exactly. The story cites four clients — American Express, Coca-Cola, Kimberley-Clark and Nasdaq — who engaged Palantir to conduct a pilot, then decided not to proceed with a long-term contract. In other words, lost sales and not cancelled contracts. The document Buzzfeed obtained is Palantir’s won/lost analysis, which shows that the company is attempting to learn from its lost sales.

Does Palantir have a revenue problem? No. Palantir’s 2015 revenue was up 50% from the previous year. Buzzfeed obsesses over the difference between Palantir’s bookings of $1.7 billion and its revenue of $420 million. A high book-to-bill ratio  is typical for consultancies that pursue large multi-year projects; it is a sign of strong demand for the company’s services. Under GAAP accounting, companies can accrue revenue only as work is performed, even if they bill the work in advance. Note that consulting giant Accenture’s bookings exceed its revenue for its most recent quarter.

Does Palantir have a profitability problem? Possibly. Buzzfeed reports that the company lost $80 million last year on revenue of $420 million. Consulting margins tend to be fairly high, so a loss means that Palantir is “investing” in a lot of unbillable work. It’s hard to say if these “investments” will pay off. Palantir closed another round of funding in December, 2015, so people with more and better information than Buzzfeed obviously think they will, and are backing up their belief with cash.

By the way, you know who has an actual revenue problem? Buzzfeed.

Roger Peng attempts to draw lessons for data scientists from the Buzzfeed story, without questioning its premises. He should stick to Biostatistics.

Spark 2.0

— Databricks announces preview of Apache Spark 2.0 on Databricks Community Edition.

— From last week: Reynold Xin explains what’s new in Spark 2.0.

— Dave Ramel summarizes the new features, including faster SQL; consolidation of the Dataset and DataFrame APIs; support for ANSI (2003) SQL; and Structured Streaming, an integrated view of tables and streams.

— Now that Spark 2.0 is in preview, MapR offers Spark 1.6.1.

Explainers

— Four from Adrian Colyer:

— Richard Williamson explains how to build a streaming prediction engine with Spark, MADlib, Kudu and Impala.

— On the Cloudera Vision blog, Santosh Kumar explains Hive-on-Spark.

— DataStax’ Dani Traphagen explains data processing with Spark and Cassandra.

— In ZDNet, Andrew Brust explains Microsoft’s R strategy, and gets it right.

Perspectives

— For a planted article in Linux.com, Pam Baker interviews IBM’s Mike Breslin, who answer questions nobody is asking about using Spark and Cloudant.

— Joyce Wells recaps a presentation by Booz Allen’s Jair Aguirre, who touts Apache Drill.

— Alex Woodie attends the Apache: Big Data 2016 conference and discovers open source projects.

— In Business Insider, Sam Shead describes FBLearnerFlow, a workbench for machine learning and AI.

— Leslie D’Monte describes some ways companies use machine learning in their operations.

Open Source Announcements

— Google announces release to open source of SyntaxNet, a framework for natural language understanding. Included in the release: an English parser dubbed Parsey McParseface. Journalists respond to the latter like dogs to a squirrel.

— Amazon releases yet another deep learning framework, this one branded as “Deep Scalable Sparse Tensor Network Engine (DSSTNE)” or “Destiny”. Stephanie Condon reports.

— Salesforce donates PredictionIO to Apache.

— Apache Storm announces two new maintenance releases:

  • Storm 0.10.1 has bug fixes.
  • Storm 1.0.1 has performance improvements and bug fixes.

— Apache Flink announces Release 1.0.3, with bug fixes and improved documentation.

— Apache Apex pushes a release to resolve a security issue.

Commercial Announcements

— BI-on-Hadoop startup AtScale announces an $11 million “B” round. Media coverage here.

— H2O.ai announces new hires with a strong orientation towards visualization, suggesting the company plans to add a more robust user interface to its best-in-class machine learning engine.

Big Analytics Roundup (March 21, 2016)

Minimal hard news this week, but some interesting survey results, analysis, articles, explainers and perspectives.

— On his personal blog, Will Kurt describes Bayesian reasoning in the Twilight Zone. I tried to learn Bayesian reasoning a few years ago, but it conflicted with my prior beliefs.

— Stack Overflow shares results from its 2016 Developer Survey. (h/t Thomas Ott) Key bits:

  • Most popular technologies for math and data: Python and SQL.
  • Top paying technologies: Spark and Scala.
  • Top paying tech for data scientists: Scala, Spark and Hadoop.
  • Top tech stack for data scientists: Python + R + SQL.
  • Top development environments for data scientists: (1) Vim; (2) Notepad++; (3) RStudio; (4) IPython/Jupyter.
  • Job priorities for data scientists: (1) Salary; (2) Building something that’s innovative.
  • Biggest challenge at work (all respondents): Unrealistic expectations.
  • Purchasing power of developers in South Africa: 25,713 Big Macs per year.

— MIT Technology Review summarizes a comparative analysis of the tweeps for Hillary Clinton and Donald Trump. Study authors use facial recognition to classify followers into demographic categories, with surprising findings.

— Daniel Chalef of Domino Data analyzes data from Google Trends and StackOverflow, discovers that people search for open source data science tools more than they do for commercial data science tools. For a more comprehensive look at this question, see Bob Muenchin’s blog on the popularity of analytics software. Search interest is one data point, Bob’s work with job postings offers a better picture of the actual state of the market.

— On his Databaseline blog, Ian Hellström corrals information on Apache streaming projects, including Apex, Beam, Flink, Flume, Ignite, NiFi, Samza, Spark Streaming and Storm/Trident.

Explainers

— On the Confluent blog, Jay Kreps explains Kafka Streams. Given Kafka’s dominance in the streaming data space, I suspect that we will see Confluent move upstream — no pun intended — to streaming analytics.

— This week from the morning paper:

  • Adrian Colyer explains MacroBase, an open source software project for anomaly detection in streaming data.
  • … explains social engineering attacks and potential defenses.
  • explains distributed TensorFlow with MPI. Distributed versions improve (runtime) performance, but scaleability is sublinear; with 32 nodes, performance is a little less than 12X faster than a single node.

— MapR’s Tugduall Grall explains what Spark is, what it does, and what sets it apart.

— In SlideShare, Joe Chow explains random grid search for hyperparameter optimization in H2O.

— On the Databricks blog, Denny Lee et. al. explain how to use the new GraphFrames package. They include a notebook and demonstration of GraphFrames with the airline on-time performance dataset.

— MSFT’s Jeff Stokes explains how to scale stream analytics jobs with Azure Machine Learning functions.

— On the MapR blog, Carol McDonald explains how to get started using GraphX with Scala.

Perspectives

— Jack Vaughan interviews some old guy who thinks Spark is a thing.

— In Forbes, Gil Press reviews the Forrester TechRadar Big Data report and opines about the top ten technologies. InformationWeek’s Jessica Davis reviews the same report and draws different conclusions. The great thing about punditry is you can say anything you like.

— Gabriela Motroc engages the tiresome old “Spark versus Hadoop” theme.

— Alex Woodie opines that Hadoop must evolve toward greater simplicity. While his complaint has merit, the problem with his argument is that organisms do not “evolve” to simplicity; simplicity itself is a product of design.  Pure Hadoop is simple: MapReduce and HDFS.  Hadoop has evolved to something more complex because it had to do so; every additional piece added to the ecosystem is a response to unmet needs.

— H2O.ai’s Ken Sanford, who previously worked for SAS, argues that the best data scientists run R and Python.  He’s right. Money talks: according to O’Reilly’s 2015 Data Science Salary Survey, the median salary for data scientists who use SAS is less than the median salary for data scientists who use R and Python.

— On Medium, PredictionIO’s Thomas Stone celebrates ten years of open source machine learning.

— Jessica Davis profiles nine big data and analytics startups she thinks you should watch: Confluent, H2O.ai, AtScale, Algorithmia, BedrockData, Wavefront, RJMetrics, BlueTalon, and Cazena.

— In TechCrunch, Hightail’s Mike Trigg opines that Silicon Valley’s unicorn problem will solve itself. I doubt that’s true; you can’t simultaneously argue that VCs are irrational on the upside (e.g. Groupon) but rational on the downside. If VCs are too dumb to spot companies with no sustainable competitive advantage, they are also too dumb to spot “well-run, profitable companies with proven business models and healthy balance sheets.”

— On Quora, Dato’s Carlos Guestrin opines about what’s next in machine learning.

— In Martech Advisor, Ankush Gupta Mar interviews Altiscale’s VP of Marketing, Barbara Lewis. Interesting bits about Altiscale’s Spark-as-Service offering.

— David Weldon asks if you are asking all the wrong questions about Apache Spark. He interviews Sean Suchter of Pepperdata.

— Srini Penchikala interviews the authors of Spark in Action, an upcoming book from Manning.

Teradata Watch

— Teradata CEO Mike Koehler continues to demonstrate confidence in the company’s growth prospects by selling another 350,000 shares.

— Zacks downgrades TDC to hold. On Wall Street, “hold” is code for “dump it.”

Open Source Announcements

— Three announcements from Apache projects:

  • Apex announces release 3.3.1 of the Malhar library, a maintenance release.
  • Drill announces release 1.6.0, which includes a few new features and many bug fixes. Release notes here.
  • Phoenix announces release 4.7, with ACID transaction support, better statistics, improved performance and 150+ bug fixes.

Commercial Announcements

— SAP announces general availability for SAP HANA Vora, a tool that enables HANA users to query data in Hadoop and other distributed storage platforms through Spark. In CIO, Thor Olavsrud reports.

— Dataiku announces that it has hired two new Veeps to drive expansion in North America.

— Reltio announces GA of Reltio Cloud 2016.1, with early access to Reltio Insights. Reltio offers a master data management platform-as-a-service; Reltio Insights adds Spark to the mix.

— BlueData announces that it has joined the Dell Technology Partnership Program. BlueData offers a datacenter virtualization capability that enables enterprises to build an on-premises cloud. BlueData Veep Greg Kirchoff opines about the partnership. Spoiler: he likes it.

Big Analytics Roundup (February 29, 2016)

Happy Leap Day.  Tachyon’s rebranding as Alluxio, release of CaffeOnSpark and GA for Google Cloud Dataproc lead the hard news this week.  The Alluxio announcement has inspired big thinkers to share big thoughts.  And, we have a nice crop of explainers.  Scroll down to the bottom for another SQL on Hadoop benchmark.

Explainers

— In SearchDataManagement, Jack Vaughn explains Spark 2.0.

— In Datanami, Alex Woodie explains Structured Streaming in Spark 2.0.

— MapR’s Jim Scott explains Spark accumulators.   Jim also explains Spark Streaming.

— DataArtisans’ Fabian Hueske introduces Flink.

— In SlideShare, Julian Hyde explains streaming SQL.

— Wes McKinney explains why pandas users should be excited about Apache Arrow.

— On her blog, Paige Roberts explains Project Tungsten, complete with pictures.

— Someone from Dremio explains Drillix, which is what you get when you combine Apache Phoenix and Apache Drill. (h/t Hadoop Weekly).

Perspectives

— In TheNextPlatform, Timothy Prickett Morgan argues that Tachyon Caching (Alluxio) is bigger than Spark

— In SiliconAngle, Maria Deutscher opines that Alluxio (née Tachyon) could replace HDFS for Spark users.

— In The New Stack, Susan Hall speculates that Apache Arrow’s columnar data layer could accelerate Spark and Hadoop.  She means Hadoop in a general way, e.g. the Hadoop ecosystem.

— On the Dataiku blog, “Caroline” interviews John Kelly, Managing Director of Berkeley Research Group and asks him questions about data science.  Left unanswered: is it “Data-ikoo” or “Day-tie-koo?”

— Alpine Data Labs’ Steven Hillion ruminates on success.  He’d be better off ruminating on “how to raise your next round of venture capital.”

— Max Slater-Robins opines that Microsoft is inventing the future, which is even better than winning the internet.

— In ZDNet, Andrew Brust wonders if Databricks is vying for a full analytics stack, citing the new Dashboard feature as cause for wonder.  He’s just trolling.

— In Search Cloud Applications, Joel Shore opines that streaming analytics is replacing complex event processing, which makes sense.   He further opines that Flink will displace Spark for streaming, which doesn’t make sense.   Shore interviews IBM’s Nagui Halim about streaming here.

Open Source Announcements

— Alluxio (née Tachyon) announces Release 1.0.0.  Alluxio is open source software distributed through Git under an Apache license, but is not an Apache project.  Yet.  Release 1.0 includes frameworks for MapReduce, Spark, Flink and Zeppelin.  Daniel Gutierrez reports.

— Yahoo releases CaffeOnSpark, a distributed deep learning package.  Caffe is one of the better-known deep learning packages, with a track record in image recognition.  Software is available on Git.  For more information, see the Wiki.  Alex Handy reports; Charlie Osborne reports.

— RapidMiner China announces availability of an extension for deep learning engine DL4J.  The extension is open source, and works with the open source version of RapidMiner.  DL4J sponsor Skymind collaborated.

Commercial Announcements

–Tachyon Nexus, the commercial venture founded to support Tachyon, the memory-centric virtual distributed storage system, announces that it has rebranded as Alluxio.

— Google announces general availability for its Cloud Dataproc managed service for Spark and Hadoop.

Funding Announcements

Health analytics vendor Health Catalyst lands a $70M Series E round.

AtScale Benchmarks SQL-on-Hadoop Engines

On the AtScale blog, Trystan Leftwich summarizes results from a benchmark test of Hive on Tez (1.2/0.7), Cloudera Apache Impala (2.3) and Spark SQL (1.6).  The AtScale team tested Impala and Spark with Parquet and Hive on Tez with ORC.  For test cases, the team used TPC-H data arranged in a star schema, and ran 13 queries in each SQL engine multiple times, averaging the results.

While Hortonworks recommends ORC with Hive/Tez, there are published cases where users achieved good results with Hive/Tez on Parquet.  Since the storage format has a big impact on SQL performance, I would have tested Hive/Tez on Parquet as well.  AtScale did not respond to queries on this point.

Key findings:

  • All three engines performed about the same on single-table queries, and on queries joining three small tables.
  • Spark and Impala ran faster than Hive on queries joining three large tables.
  • Spark ran faster than Impala on queries joining four or more tables.

The team ran the same tests with AtScale’s commercial caching technology, with significant performance improvements for all three engines.

In concurrency testing, Impala performed much better than Hive or Spark.

Details of the test available in a white paper here (registration required).

Spark Summit Europe Roundup

The 2015 Spark Summit Europe met in Amsterdam October 27-29.  Here is a roundup of the presentations, organized by subject areas.   I’ve omitted a few less interesting presentations, including some advertorials from sponsors.

State of Spark

— In his keynoter, Matei Zaharia recaps findings from Databricks’ Spark user survey, notes growth in summit attendance, meetup membership and contributor headcount.  (Video here). Enhancements expected for Spark 1.6:

  • Dataset API
  • DataFrame integration for GraphX, Streaming
  • Project Tungsten: faster in-memory caching, SSD storage, improved code generation
  • Additional data sources for Streaming

— Databricks co-founder Reynold Xin recaps the last twelve months of Spark development.  New user-facing developments in the past twelve months include:

  • DataFrames
  • Data source API
  • R binding and machine learning pipelines

Back-end developments include:

  • Project Tungsten
  • Sort-based shuffle
  • Netty-based network

Of these, Xin covers DataFrames and Project Tungsten in some detail.  Looking ahead, Xin discusses the Dataset API, Streaming DataFrames and additional Project Tungsten work.  Video here.

Getting Into Production

— Databricks engineer and Spark committer Aaron Davidson summarizes common issues in production and offers tips to avoid them.  Key issues: moving beyond Python performance; using Spark with R; network and CPU-bound workloads.  Video here.

— Tuplejump’s Evan Chan summarizes Spark deployment options and explains how to productionize Spark, with special attention to the Spark Job Server.  Video here.

— Spark committer and Databricks engineer Andrew Or explains how to use the Spark UI to visualize and debug performance issues.  Video here.

— Kostas Sakellis and Marcelo Vanzin of Cloudera provide a comprehensive overview of Spark security, covering encryption, authentication, delegation and authorization.  They tout Sentry, Cloudera’s preferred security platform.  Video here.

Spark for the Enterprise

— Revisting Matthew Glickman’s presentation at Spark Summit East earlier this year, Vinny Saulys reviews Spark’s impact at Goldman Sachs, noting the attractiveness of Spark’s APIs, in-memory processing and broad functionality.  He recaps Spark’s viral adoption within GS, and its broad use within the company’s data science toolkit.  His wish list for Spark: continued development of the DataFrame API; more built-in formulae; and a better IDE for Spark.  Video here.

— Alan Saldich summarizes Cloudera’s two years of experience working with Spark: a host of engineering contributions and 200+ customers (including Equifax, Barclays and a slide full of others).  Video here.  Key insights:

  • Prediction is the most popular use case
  • Hive is most frequently co-installed, followed by HBase, Impala and Solr.
  • Customers want security and performance comparable to leading relational databases combined with simplicity.

Data Sources and File Systems

— Stephan Kessler of SAP and Santiago Mola of Stratio explain Spark integration with SAP HANA Vora through the Data Sources API.  (Video unavailable).

— Tachyon Nexus’ Gene Pang offers an excellent overview of Tachyon’s memory-centric storage architecture and how to use Spark with Tachyon.  Video here.

Spark SQL and DataFrames

— Michael Armbrust, lead developer for Spark SQL, explains DataFrames.  Good intro for those unfamiliar with the feature.  Video here.

— For those who think you can’t do fast SQL without a Teradata box, Gianmario Spacagna showcases the Insight Engine, an application built on Spark.  More detail about the use case and solution here.  The application, which requires many very complex queries, runs 500 times faster on Spark than on Hive, and likely would not run at all on Teradata.  Video here.

— Informatica’s Kiran Lonikar summarizes a proposal to use GPUs to support columnar data frames.  Video here.

— Ema Orhian of Atigeo describes jaws, a restful data warehousing framework built on Spark SQL with Mesos and Tachyon support.  Video here.

Spark Streaming

— Helena Edelson, VP of Product Engineering at Tuplejump, offers a comprehensive overview of streaming analytics with Spark, Kafka, Cassandra and Akka.  Video here.

— Francois Garillot of Typesafe and Gerard Maas of virdata explain and demo Spark Streaming.    Video here.

— Iulian Dragos and Luc Bourlier explain how to leverage Mesos for Spark Streaming applications.  Video here.

Data Science and Machine Learning

— Apache Zeppelin creator and NFLabs co-founder Moon Soo Lee reviews the Data Science lifecycle, then demonstrates how Zeppelin supports development and collaboration through all phases of a project.  Video here.

— Alexander Ulanov, Senior Research Scientist at Hewlett-Packard Labs, describes his work with Deep Learning, building on MLLib’s multilayer perceptron capability.  Video here.

— Databricks’ Hossein Falaki offers an introduction to R’s strengths and weaknesses, then dives into SparkR.  He provides an overview of SparkR architecture and functionality, plus some pointers on mixing languages.  The SparkR roadmap, he notes, includes expanded MLLib functionality; UDF support; and a complete DataFrame API.  Finally, he demos SparkR and explains how to get started.  Video here.

— MLlib committer Joseph Bradley explains how to combine the strengths R, scikit-learn and MLlib.  Noting the strengths of R and scikit-learn libraries, he addresses the key question: how do you leverage software built to support single-machine workloads in a distributed computing environment?   Bradley demonstrates how to do this with Spark, using sentiment analysis as an example.  Video here.

— Natalino Busa of ING offers an introduction to real-time anomaly detection with Spark MLLib, Akka and Cassandra.  He describes different methods for anomaly detection, including distance-based and density-based techniques. Video here.

— Bitly’s Sarah Guido explains topic modeling, using Spark MLLib’s Latent Dirchlet Allocation.  Video here.

— Casey Stella describes using word2vec in MLLib to extract features from medical records for a Kaggle competition.  Video here.

— Piotr Dendek and Mateusz Fedoryszak of the University of Warsaw explain Random Ferns, a bagged form of Naive Bayes, for which they have developed a Spark package. Video here.

GeoSpatial Analytics

— Ram Sriharsha touts Magellan, an open source geospatial library that uses Spark as an engine.  Magellan, a Spark package, supports ESRI format files and GeoJSON; the developers aim to support the full suite of OpenGIS Simple Features for SQL.  Video here.

Use Cases and Applications

— Ion Stoica summarizes Databricks’ experience working with hundreds of companies, distills to two generic Spark use cases:  (1) the “Just-in-Time Data Warehouse”, bypassing IT bottlenecks inherent in conventional DW; (2) the unified compute engine, combining multiple frameworks in a single platform.  Video here.

— Apache committer and SKT engineer Yousun Jeong delivers a presentation documenting SKT’s Big Data architecture and a use case real-time analytics.  SKT needs to perform real-time analysis of the radio access network to improve utilization, as well as timely network quality assurance and fault analysis; the solution is a multi-layered appliance that combines Spark and other components with FPGA and Flash-based hardware acceleration.  Video here.

— Yahoo’s Ayman Farahat describes a collaborative filtering application built on Spark that generates 26 trillion recommendations.  Training time: 52 minutes; prediction time: 8 minutes.  Video here.

— Sujit Pal explains how Elsevier uses Spark together with Solr, OpenNLP to annotate documents at scale.  Elsevier has donated the application, called SoDA, back to open source.  Video here.

— Parkinson’s Disease affects one out of every 100 people over 60, and there is no cure.  Ido Karavany of Intel describes a project to use wearables to track the progression of the illness, using a complex stack including pebble, Android, IOS, play, Phoenix, HBase, Akka, Kafka, HDFS, MySQL and Spark, all running in AWS.   With Spark, the team runs complex computations daily on large data sets, and implements a rules engine to identify changes in patient behavior.  Video here.

— Paula Ta-Shma of IBM introduces a real-time routing use case from the Madrid bus system, then describes a solution that includes kafka, Secor, Swift, Parquet and elasticsearch for data collection; Spark SQL and MLLib for pattern learning; and a complex event processing engine for application in real time.  Video here.

Big Analytics Roundup (October 5, 2015)

Announcements timed to coincide with Strata NYC 2015 drive the news this week.  The single most interesting item for Big Analytics is the O’Reilly 2015 Data Science Survey, which warrants a post of its own.  Two key points:

  • Data scientists still use SQL, Excel, Python and R. (Doh!)
  • Data scientists who spend time in meetings and presenting results of analysis earn more than the grunts who muck around with data.

The lesson is clear: if you want to earn the big bucks, stop messing around with Zeppelin and learn PowerPoint.

There are a few interesting presentations from Strata embedded in this post, plus two that stand out:

  • Ron Kasabian of Intel and Michael Draugelis of Penn Medicine explain how they improve medical decision-making with predictive analytics.
  • Iulia Pasov and Calin-Andrei Burloiu show how they use data science to measure and prevent churn at Avira.

Paul Kent has the toughest job at SAS — promoting an initiative his boss thinks is hype.  In this sponsored presentation, Paul does a professional job presenting SAS’ Big Data story, which seems compelling.  However, the challenge for SAS in Big Data remains: name a reference customer.

Spark

Commentary

MapR’s Jim Scott takes another shot at the “will Spark replace Hadoop?” meme.  All together now:

  • Spark, like MapReduce, is a compute engine
  • Hadoop = MapReduce + HDFS + YARN + (an ecosystem of other bits)
  • Spark lacks a native file system, so it can never replace Hadoop
  • Coming in 2016:  Hadoop = Spark + HDFS + YARN + (an ecosystem…)
  • Also possible: Spark + Cassandra, Spark + MongoDB, Spark + Druid, Spark + (your database here)

On the Intersog blog, Jenny Richards gets it right by focusing on the differences between Spark and MapReduce.

Spark Maintenance Release

The Spark team announces Spark 1.5.1, a maintenance release with about 80 bug fixes.  On the Databricks blog, Reynold Xin explains Spark version numbers work.  Short version: top level numbers correspond to API compatibility, dot releases include features and enhancements, double-dot releases have bug fixes.

Spark Use Cases and Success Stories

At Strata NYC 2015, Databricks’ Reynold Xin describes  “sketching” with Spark (aka exploratory analysis and feature engineering).

Also at Strata, Edd Dumbill presents the business case for Spark, Kafka “and friends.”

On the MongoDB blog, Mat Keep interviews Thiago Cardoso, co-founder and CTO of Hekima, a social media analytics startup.  Hekima uses Spark, Hadoop and (you guessed it) MongoDB.

On SmartDataCollective, the ubiquitous Jim Scott describes what he calls use cases for Spark: exploratory analytics; machine learning; real-time dashboards and ETL.

At Big Data Dat LA 2015, ESRI’s Adam Mollenkops explains how to apply GeoSpatial Analytics with Spark. Video here.

Spark Integration

A slew of software vendors announce integration with Spark.

–Talend announces Release 6, which leverages Spark and Spark Streaming.  Stories here, herehere, here, here, here, and here.

Syncsort announces integration with Spark and Kafka.

–TIBCO announces Spotfire integration with Spark through the Spark SQL and SparkR APIs.  More here, here and here.

–Dataiku announces integration of its Data Science Studio (DSS) software with Spark.  DSS offers a commercially licensed visual workbench enabling the user to build pipelines integrating a number of data sources and formats.  Analytic functionality is modest.

–SnapLogic announces pending release of what it calls SnapLogic Elastic Integration Platform, which includes components branded as “Sparkplex” and Spark “Snaps”.  The former includes a code generator that translates user requests from the SnapLogic visual pipeline designer into Spark code.  The latter is SnapLogic branding for “prebuilt connectors”, of which there are many.  SnapLogic claims that Snaps are plug and play, so it’s a “snap” to convert your pipeline from MapReduce to Spark.  Stories here, here, here, here, and here.

Spark as a Service

In case you’re not happy with offerings from Databricks, Qubole, Amazon Web Services, Google, BlueData and MemSQL, Altiscale announces you-know-what.

SQL/OLAP

Apache Drill

Dremio, a startup led by MapR’s Drill gurus, lands a $10 million A round.

At the Hadoop Meeting NYC, the folks from Dremio present Drill use cases and a roadmap.

Polymath Abhishek Tiwari reflects on Drill.

On the MapR blog, Joseph Blue explains how to identify a data breach with Drill.

AtScale

adds support for Microsoft PowerBI to its BI on Hadoop story.

Presto

On ZDNet, Natalie Gagliordi touts Teradata’s embrace of Presto.  (Note to ZDNet’s headline editor: Presto is not an Apache project).  She correctly notes that Presto is faster than Hive-on-MapReduce, but that’s a low bar; just about everything is faster than Hive releases prior to 0.13, but Hive-on-Tez competes well with Drill, Impala, Presto and Spark SQL.  That’s a problem for Presto, because a challenger has to be outstanding at something.  Once again, it appears that Teradata is betting on the wrong horse.

Machine Learning

SparkR

Databricks’ Hossein Falaki delivers a presentation at Strata on supercharging R with Spark; slides here.  Spark’s R API is incomplete; as of Spark 1.5.1 it supports DataFrames operations (including SQL queries) and generalized linear models.  That’s better than nothing, but R users who need to do serious machine learning now need to look elsewhere.

Streaming Analytics

Apache Flink

On the MapR blog, Ellen Friedman introduces you to Flink.

Data Artisans’ Robert Metzger delivers a presentation about the architecture of Flink’s streaming runtime at ApacheCon Europe.

Spark Streaming

At Strata NY 2015, Databricks’ Tathagata Das describes the new bits in Spark Streaming.