The Year in SQL Engines

As an addendum to my year-end review of machine learning and deep learning, I offer this survey of SQL engines. SQL is the most widely used language for data science according to O’Reilly’s 2016 Data Science Salary Survey. Most projects require at least some SQL operations, and many need nothing but SQL.

This review covers six open source leaders: Hive, Impala, Spark SQL, Drill, HAWQ, and Presto; plus, for completeness, Calcite, Kylin, Phoenix, Tajo, and Trafodion. Omitted: two commercial options, Oracle Big Data SQL and IBM Big SQL, which IBM has not yet rebranded as “Watson SQL.”

(A reader asks: What about Druid? My response: erm. On inspection, I agree that Druid belongs in this category, so check it out.)

I use the term ‘SQL Engine’ loosely. Hive, for example, is not an engine; it’s a framework that uses the MapReduce, Tez, or Spark engines to run queries. And it doesn’t run SQL; it runs HiveQL, an SQL-like language that closely approximates SQL. ‘SQL-in-Hadoop’ is also inapt; while Hive and Impala work primarily with Hadoop, Spark, Drill, HAWQ, and Presto also work with a wide variety of other data storage systems.

Unlike relational databases, SQL engines operate independently of the data storage system. In contrast, relational databases bundle the query engine and storage into a single tightly coupled system, which permits certain types of optimization. Uncoupling them, on the other hand, provides greater flexibility, though at the potential loss of performance.

Figure 1, below, shows the relative popularity of the leading SQL engines according to DB-Engines, a website maintained by the Austrian consultancy Solid IT. DB-engines computes a monthly popularity score for more than 200 database systems. The score reflects search engine queries; mentions in online discussions; job offers; mentions in professional profiles, and tweets.

Figure 1

screen-shot-2017-01-31-at-1-04-43-pm
Source: DB-Engines, January 2017 http://db-engines.com/en/ranking

Although Impala, Spark SQL, Drill, Hawq, and Presto consistently beat Hive on measures such as runtime performance, concurrency, and throughput, Hive remains the most popular (at least by the DB-Engines metric). There are three reasons why that is so:

— Hive is the default option for SQL in Hadoop, supported in every distribution. The others align with specific vendors and cater to niche users.

— Hive has closed the performance gap to the other engines. Most of the Hive alternatives launched in 2012 when analysts would rather kill themselves than wait for a Hive query to finish. But while Impala, Spark, Drill, et.al. ran away like rabbits back then, Hive just kept chugging along, tortoise-like, with incremental improvements. Today, while Hive is not the fastest choice, it’s a lot better than it was five years ago.

— While bleeding-edge speed is cool, most organizations know that the world does not end if a junior marketing manager has to wait ten seconds to find out if the chicken wings outperformed the buffalo burgers in the Duxbury restaurant last Tuesday.

As you can see in Figure 2, below, the top SQL engines compete well for user interest compared to leading commercial data warehouse appliances.

Figure 2

screen-shot-2017-01-31-at-2-27-15-pm
Source: DB-Engines, January 2017 http://db-engines.com/en/ranking

The best measure of health for an open source project is the size of its active developer community. Hive and Presto have the largest base of contributors, as shown in Figure 3, below. (Data for Spark SQL is unavailable.)

Figure 3

screen-shot-2017-01-31-at-2-52-27-pm
Source: Open Hub https://www.openhub.net/

In 2016, ClouderaHortonworks, Kognitio, and Teradata waded into the Battle of the Benchmarks Tony Baer summarizes. I’m sure that you will be shocked to learn that the vendor’s preferred SQL engine outperformed the others in each of these studies, which begs the question: are benchmarks bullshit?

AtScale‘s biannual benchmark is not BS. AtScale, a BI startup, markets software that brokers between BI front ends and SQL backends. The company’s software is engine-neutral — it seeks to run on as many as possible — and its broad experience in BI gives the testing a real-world flavor.

AtScale’s key findings from its most recent round, which included Hive, Impala, Spark SQL, and Presto:

— All four engines successfully ran AtScale’s BI benchmark queries.

— Each engine has its own performance “sweet spot” depending on data volume, query complexity, and concurrent users.

– Impala and Spark SQL outperform the others in queries against small data sets

– On large data sets, Impala and Spark SQL handle complex joins better than the others

– Impala and Presto demonstrate the best results in concurrency tests

— All engines showed 2X-4X performance gains in the six months since AtScale’s previous benchmark.

Alex Woodie reports on the test results; Andrew Oliver analyzes.

Let’s dive into the individual projects.

Apache Hive

Apache Hive was the first SQL framework in the Hadoop ecosystem. Engineers at Facebook introduced Hive in 2007 and donated the code to the Apache Software Foundation in 2008; in September 2010, Hive graduated to top-level Apache project status. Every major player in the Hadoop ecosystem distributes and supports Hive, including Cloudera, MapR, Hortonworks, and IBM. Amazon Web Services offers a modified version of Hive as a cloud service in Elastic MapReduce (EMR).

Early releases of Hive used MapReduce to run queries. Complex queries required multiple passes through the data, which impaired performance. As a result, Hive was not suitable for interactive analysis. Led by Hortonworks, the Stinger initiative markedly enhanced Hive’s performance, notably through the use of Apache Tez, an application framework that delivers streamlined MapReduce code. Tez and ORCfile, a new storage format, produced a significant speedup for Hive queries.

Cloudera Labs spearheaded a parallel project to re-engineer Hive’s back end to run on Apache Spark. After an extended beta, Cloudera released Hive-on-Spark to general availability in early 2016.

More than 100 individuals contributed to Hive in 2016. The team announced Hive 2.0 in February and Hive 2.1 in June. Hive 2.0 includes improvements to several improvements to Hive-on-Spark, plus performance, usability, supportability and stability enhancements. Hive 2.1 includes Hive LLAP (“Live Long and Process”), which combines persistent query servers and optimized in-memory caching for high performance. The team claims a 25X speedup.

In September, the Hivemall project entered the Apache Incubator, as I noted in Part Two of my machine learning year-end roundup. Originally developed by Treasure Data and donated to the Apache Software Foundation, Hivemall is a scalable machine learning library implemented as a collection of Hive UDFs designed to run in Hive, Pig or Spark SQL with MapReduce, Tez or Spark. The team plans an initial release in Q1 2017.

Apache Impala

Cloudera launched Impala, an open source MPP SQL engine, in 2012, as a high-performance alternative to Hive. Impala works with HDFS and HBase, and it leverages Hive metadata; however, it bypasses MapReduce to run queries. Mike Olson, Cloudera’s Chief Strategy Officer,

Mike Olson, Cloudera’s Chief Strategy Officer, argued in late 2013 that Hive’s architecture was fundamentally flawed. In Olson’s view, developers could only deliver high-performance SQL with a whole new approach, exemplified by Impala. In 2014 Cloudera released a series of benchmarks in January, May, and September. In these tests, Impala showed progressive improvement in query runtime, and significantly outperformed Hive on Tez, Spark SQL, and Presto. In addition to running fast, Impala performed particularly well in concurrency, throughput, and scalability.

In 2015, Cloudera donated Impala to the Apache Software Foundation, where it entered the Apache Incubator program. Cloudera, MapR, Oracle and Amazon Web Services distribute Impala;  Cloudera, MapR, and Oracle provide commercial build and installation support.

Impala made steady progress in the Apache Incubator in 2016. The team cleaned up the code, ported it to Apache infrastructure and delivered Release 2.7.0, its first Apache release in October. The new version includes performance and scalability improvements, as well as some other minor enhancements.

In September, Cloudera published results of a study that compared Impala to Amazon Web Services’ Redshift columnar database. The report is interesting reading, though subject to the usual caveats about vendor benchmarks.

Spark SQL

Spark SQL is a Spark component for structured data processing. The Apache Spark team launched Spark SQL in 2014 and absorbed Shark, an early Hive-on-Spark project. It quickly became the most widely used Spark module.

Spark SQL users can run SQL queries, read data from Hive, or use it as means to create Spark Datasets and DataFrames. (Datasets are distributed collections of data; DataFrames are Datasets organized into named columns.) The Spark SQL interface provides Spark with information about the structure of the data and operations to be performed; Spark’s Catalyst optimizer uses this information to construct an efficient query.

In 2015, Spark’s machine learning developers introduced the ML API, a package that leveraged Spark DataFrames instead of the lower-level Spark RDD API. This approach proved to be attractive and fruitful; in 2016, with Release 2.0, the Spark team placed the RDD-based API in maintenance mode. The DataFrames API is now the primary interface for Spark machine learning.

Also in 2016, the team released Structured Streaming, in an Alpha release as of Spark 2.1.0. Structured Streaming is a stream processing engine built on Spark SQL. Users can query streaming data sources in the same manner as static sources, and they can combine streaming and static sources in a single query. Spark SQL runs the query continuously and updates results as streaming data arrives. Structured Streaming delivers exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs.

Apache Drill

In 2012, a group led by MapR, one of the leading Hadoop distributors, proposed to build an open-source version of Google’s Dremel, a distributed system for interactive ad-hoc analysis. They named the project Apache Drill. Drill languished in the Apache Incubator for more than two years, finally graduating in late 2014. The team delivered its 1.0 release in 2015.

MapR distributes and supports Apache Drill.

More than 50 individuals contributed to Drill in 2016. The team delivered five dot releases in 2016. Key enhancements include:

  • Web authentication
  • Support for the Apache Kudu columnar database
  • Support for HBase 1.x
  • Dynamic UDF support

Two key Drill contributors left MapR to start Dremio in 2015; the startup remains in stealth mode.

Apache HAWQ

Pivotal Software introduced HAWQ as a commercially licensed high-performance SQL engine in 2012 and attempted to market it with minimal success. Changing strategy, Pivotal donated the project to Apache in June 2015, and it entered the Apache Incubator program in September 2015.

Fifteen months later, HAWQ remains in the Incubator. The team released HAWQ 2.0.0.0 in December, with a load of bug fixes. I suspect the project will graduate in 2017.

One small point in HAWQ’s favor is its support for Apache MADlib, the machine-learning-in-SQL project that is also still in the Incubator. The combination of HAWQ and MADlib should be a nice consolation to the folks who bought Greenplum and wonder what the hell happened.

Presto

Facebook engineers initiated the Presto project in 2012 as a fast interactive alternative to Hive. Rolled out in 2013, the software successfully supported more than a thousand Facebook users and more than 30,000 queries per day on petabytes of data. Facebook released Presto to open source in 2013.

Presto supports ANSI SQL queries across a range of data sources, including Hive, Cassandra, relational databases or proprietary file systems (such as Amazon Web Services’ S3.)  Presto queries can federate data from multiple sources.  Users can submit queries from C, Java, Node.js, PHP, Python, R and Ruby.

Airpal, a web-based query tool developed by Airbnb, offers users the ability to submit queries to Presto through a browser. Qubole provides a managed service for Presto. AWS delivers a Presto service on EMR.

In June 2015, Teradata announced plans to develop and support the project.  Under an announced three-phase program, Teradata proposed to integrate Presto into the Hadoop ecosystem, enable operation under YARN and enhance connectivity through ODBC and JDBC. Teradata offers its own distribution of Presto, complete with a data sheet. In June, Teradata announced the certification of Information Builders, Looker, Qlik, Tableau, and ZoomData, with MicroStrategy and Microsoft Power BI on the way.

Presto is a very active project, with a vast and vibrant contributor community. The team cranks out releases faster than Miki Sudo eats hot dogs — I count 42 releases in 2016. Teradata hasn’t bothered to summarize what’s new, and I don’t plan to sift through 42 sets of release notes, so let’s just say it’s better.

Other Apache Projects

There are five other SQL-ish projects in the Apache ecosystem.

Apache Calcite

Apache Calcite is an open source framework for building databases. It includes:

— A SQL parser, validator and JDBC driver

— Query optimization tools, including a relational algebra API, rule-based planner, and a cost-based query optimizer.

Apache Hive uses Calcite for cost-based query optimization, while Apache Drill and Apache Kylin use the SQL parser.

The Calcite team pushed out five releases in 2016, with bug fixes and new adapters for Cassandra, Druid, and Elasticsearch.

Apache Kylin

Apache Kylin is an OLAP engine with a SQL interface. Developed by eBay and donated to Apache, Kylin graduated to top-level status in 2015.

A startup named Kyligence launched in 2016; it offers commercial support and a data warehousing product called KAP, FWIW. While the company has no funding listed in Crunchbase, a source tells me that it has strong backing and a large office in Shanghai.

Apache Phoenix

Apache Phoenix is a SQL framework that runs on HBase and bypasses MapReduce. Salesforce developed the software and donated it to Apache in 2013. The project graduated to top-level status in May 2014. Hortonworks includes Phoenix in the Hortonworks Data Platform. Since the leading SQL engines all work with HBase, it’s not clear why we need Phoenix.

Apache Tajo

Apache Tajo is a fast SQL data warehousing framework introduced in 2011 by Gruter, a Big Data infrastructure company, and donated to Apache in 2013. Tajo graduated to top level status in 2014. The project has attracted little interest from prospective users and contributors outside of Gruter’s primary market in South Korea. Other than a brief mention by Gartner’s Nick Heudecker, the project isn’t on anyone’s dashboard.

Apache Trafodion

Apache Trafodion is another SQL-on-HBase project, conceived by HP Labs, which tells you pretty much all you need to know. HP launched Trafodion in June 2014, a month after Apache Phoenix graduated to production. Six months later, it dawned on HP executives that there might be limited commercial potential for another SQL-on-HBase engine — I can see the facepalms — so they donated the project to Apache, where it entered the Incubator in May 2015.

Trafodion promises to be a transactional database if it ever gets out of incubation. Unfortunately, there are lots of options in that space, and the only competitive benefit the development team can articulate seems to be “it’s open source, so it’s cheap.”

Big Analytics Roundup (August 15, 2016)

In the second quarter of 2015, Hortonworks lost $1.38 for every dollar of revenue. In the second quarter of 2016, HDP lost $1.46 for every dollar of revenue. So I guess they aren’t making it up on volume.

On the Databricks blog, Jules Damji summarizes Spark news from the past two weeks.

AWS Launches Kinesis Analytics

Amazon Web Services announces the availability of Amazon Kinesis Analytics, an SQL interface to streaming data. AWS’ Ryan Nienhuis explains how to use it in the first of a two-part series.

The biggest threat to Spark Streaming doesn’t come from the likes of Flink, Storm, Samza or Apex. It comes from popular message brokers like Apache Kafka and AWS Kinesis, who can and will add analytics to move up the value chain.

Intel Freaks Out

Intel announces an agreement to acquire Nervana Systems, a 28-month-old startup working on hardware and software solutions for deep learning. Re/code reports a price tag of $408 million. The customary tech media unicorn story storm ensues. (h/t Oliver Vagner)

Intel says it plans to use Nervana’s software to improve the Math Kernel Library and market the Nervana Engine alongside the Xeon Phi processor. Nervana neon is YADLF — Yet Another Deep Learning Framework — that ranked twelfth in usage among deep learning frameworks in KDnuggets’ recent poll. According to Nervana, neon benchmarks well against Caffe; but then, so does CNTK.

Do special-purpose chips for deep learning have legs? Obviously, Intel thinks so. The headline on that recent Wired story about Google’s deep learning chip — Time for Intel to Freak Out — looks prescient. That said, the history of computing isn’t kind to special-purpose hardware; does anyone remember Thinking Machines? If Intel has any smarts at all, it will take steps to ensure that its engine works with the deep learning frameworks people actually want to use, like TensorFlow, Theano, and Caffe.

Cloud Computing Drivers

Tony Safoian describes five trends driving the growth of cloud computing: better security, machine learning and big data, containerization, mobile and IoT. Cloud security hasn’t actually improved — your data was always safer in the cloud than it was on premises. What has changed is the perception of security, and the growing sense that IT sentiments against cloud have little to do with security and a lot to do with rent-seeking and turf.

On the other points, Safoian misses the big picture — due to the costs of data movement, the cloud is best suited to machine learning and big data when data sources are also in the cloud. As organizations host an increasing number of operational applications in the cloud, it makes sense to manage and analyze the data there as well.

Machine Learning for Social Good

Microsoft offers a platform to predict scores in weather-interrupted cricket matches.

Shameless Commerce

In a podcast, Ben Lorica interviews John Akred on the use of agile techniques in data science. Hey, someone should write a book about that.

Speaking of books, I plan to publish snippets from my new book, Disruptive Analytics, every Wednesday over the next couple of months.

DA Cover

Explainers

— Uber’s Vinoth Chandar explains why you rarely need sub-second latency for streaming analytics.

— Microsoft’s David Smith explains how to tune Apache Spark for faster analysis with Microsoft R Server.

— Databricks’ Jules Damji explains how to use SparkSession with Spark 2.0.

— On the Cloudera Engineering Blog, Devadutta Ghat et. al. explain analytics and BI on S3 with Apache Impala. Short version: you’re going to need more nodes.

— In the first of a three-part series, IBM’s Elias Abou Haydar explains how to score health data with Apache Spark.

— Basho’s Pavel Hardak explains how to use the Riak Connector for Apache Spark.

— On YouTube, Alluxio founder and CEO Haoyuan Li explains Alluxio.

— Pat Ferrel explains the roadmap for Mahout. According to OpenHUB, Mahout shows a slight uptick in developer activity, from zero to two active contributors.

— Cisco’s Saravanan Subramanian explains the features of streaming frameworks, including Spark, Flink, Storm, Samza, and Kafka Streams. A pretty good article overall, except that he omits Apache Apex, a top-level Apache project.

— Frances Perry explains what the Apache Beam has accomplished in the first six months of incubation.

Perspectives

— Curt Monash opines about Databricks and Spark. He notes that some people are unhappy that Databricks hasn’t open sourced 100% of its code, which is just plain silly.

— IBM’s Vijay Bommireddipalli touts IBM’s contributions to Spark 2.0.

— Mellanox’ Gillad Shainer touts the performance advantage of EDR InfiniBand versus Intel Omni-Path. Mellanox sells InfiniBand host bus adapters and network switches.(h/t Bob Muenchen)

— Kan Nishida runs a cluster analysis on R packages in Google BigQuery and produces something incomprehensible.

— Pivotal’s Jagdish Mirani argues that network-attached storage (NAS) may be a good alternative to direct-attached storage (DAS). Coincidentally, Pivotal’s parent company EMC sells NAS devices.

Open Source News

— Apache Flink announces two releases. Release 1.1.0 includes new connectors, the Table API for SQL operations, enhancements to the DataStream API, a Scala API for Complex Event Processing and a new metrics system. Release 1.1.1 fixes a dependency issue.

— Apache Kafka announces Release 0.10.0.1, with bug fixes.

— Apache Samza releases Samza 0.10.1 with new features, performance improvements, and bug fixes.

— Apache Storm delivers version 1.0.2, with bug fixes.

Commercial Announcements

— AWS releases EMR 5.0, with Spark 2.0, Hive 2.1 and Tez as the default execution engine for Hive and Pig. EMR is the first Hadoop distribution to support Spark 2.0.

— Fractal Analytics partners with KNIME.

— MapR announces a $50 million venture round led by the Australian Government Future Fund.

Big Analytics Roundup (August 8, 2016)

So, Apple acquires Turi for $200 million. Hopefully, Apple did not pay for brand equity.

Bridget Botelho argues that businesses must either disrupt or be disrupted, and outlines the role of machine learning. Someone should write a book about that.

Conference Announcements

— Flink Forward announces the schedule for its second annual event, to be held September 12-14 in Berlin.

— Databricks announces the agenda for Spark Summit Europe 2016 in Brussels (October 25-27)

Apple Buys GraphLab Dato Turi

Geekwire breaks the story, reporting a purchase price of $200 million. According to TechCrunch, Turi notified customers that its products would no longer be available. Apple adds Turi to the portfolio of machine learning startups it has acquired in the past year, including Emotient, Perceptio, and VocalIQ. More reporting here.

GraphLab started in 2009 as an open source project led by Carlos Guestrin of Carnegie Mellon. (According to OpenHub Guestrin never contributed any code.) In May 2013, Guestrin raised $6.75M to start an eponymous venture to provide commercial support for GraphLab. In October 2014, GraphLab announced the availability of GraphLab Create, a commercially licensed software product. Contributions to the open source project actually ended in 2013; while the code remains on GitHub, the project is dead.

GraphLab changed its name to Dato in January 2015. They should have googled the name; at the time, the top links in a search included Dato Foland, a gay porn star, and Datto Inc, a data backup and recovery company in Connecticut. The latter proved problematic; Datto sued, forcing Dato to rebrand as Turi earlier this month.

Turi’s open source SFrame project remains for those who think introducing another file system into the mix is a smart thing to do.

Teradata: 9 Straight Quarters of Declining Product Revenue

For the second quarter of 2016, declining data warehouse giant Teradata reports an 11% decline in product revenue compared to Q2 2015. (Product revenue includes revenue from licensing software and hardware — boxes with the Teradata brand.) Maintenance revenue increased slightly, which means that customers aren’t pulling the plug on Teradata databases as fast as they did last year. Consulting revenue declined by 1%, which casts doubt on TDC’s stated strategy to become a services powerhouse.

Screen Shot 2016-08-08 at 10.38.16 AM

Count me as skeptical about the merits of that plan. Teradata’s consulting revenue remains highly correlated with product revenue; in other words, if Teradata can’t sell its boxes, it’s not going to sell billable hours for consultants to implement those boxes. Teradata is not a credible competitor in the market for consulting-led solutions; companies like Oracle, IBM and SAS have a twenty-year head start.

Since Teradata performed better than “expectations”, Wall Street rewarded the stock with a bounce above $30.  It’s a dead-cat bounce. As the Wall Street Journal notes, companies routinely game analyst expectations. TDC currently trades at 32 times trailing earnings, well above its peers; moreover, its peers are growing rather than declining.

Explainers

— Kaarthik Sivashanmugam explains how to develop Apache Spark applications in .NET with Mobius.

— On the Cloudera Engineering blog, Devadutta Ghat et. al. explain the latest performance improvements in Impala 2.6.

— Parsey McParseface now has 40 cousins. On the Google Research Blog, Chris Alberti et. al. explain.

— Ujjwal Ratan explains how to use Amazon Machine Learning to predict patient readmission.

Perspectives

— Curt Monash offers his assessment of Spark. Highlights:

  • Spark replaces MapReduce, in particular for data transformation.
  • Spark is becoming the default platform for machine learning.
  • Spark SQL is OK as an adjunct for other analysis.
  • Spark Streaming is doing well, but there are challengers. (See below).
  • Databricks’ managed service for Spark has more than 200 subscribers.

— Serdar Yegulalp deploys the tired old “pure streaming versus microbatch” argument to claim that Apache Apex, Heron, Apache Flink and Onyx are “contenders” versus Spark. Someone should show him this graph:

Screen Shot 2016-07-18 at 8.26.11 AM

— In Datanami, Alex Woodie profiles Flink.

— Vance McCarthy touts MapR’s Spyglass Initiative for analytics on the MapR Converged Data Platform.

— Trevor Jones describes Microsoft Azure’s big data tools.

— Sam Dean champions Sparkling Water, H2O’s interface to Spark.

Commercial Announcements

— Dataiku announces the release of Data Science Studio 3.1, with five machine learning back ends and a visual coding interface (which it labels “code-free”).  Dave Ramel reports.

— John Snow Labs announces it will deliver curated data in Parquet format.

— Lexalytics announces the availability of its Semantria text analytics software on Azure.

Big Analytics Roundup (May 16, 2016)

This week we have more insight into Spark 2.0, scheduled for release just before Spark Summit 2016. (Yes, I’m going.) Also, kudos to BI-on-Hadoop startup AtScale for a new round of funding; Amazon releases YADLF (Yet Another Deep Learning Framework); and there are a number of new faces at H2O.ai.

Plus, we have an extended review of the Palantir story.

Buzzfeed on Palantir

Last week, I deemed Buzzfeed’s story on Palantir too dumb to link. (“Forget it, Jake. It’s Buzzfeed.”) Buzzfeed “news” reporter William Alden, who was all over a story about maggots in Facebook lunches, breathlessly mines a cache of “secret internal documents” and discovers:

  • Palantir expects employee turnover of around 20% for 2016.
  • Palantir lost some clients.
  • Palantir books more work than it bills.

Does Palantir have an employee turnover problem?  No. A 20% turnover rate is slightly above the 17% reported for all industries in 2015, and about on track for Silicon Valley. (There are companies in SV with 100% turnover rates.) On Glassdoor, employees give Palantir high marks.

Does Palantir have a client retention problem? Not exactly. The story cites four clients — American Express, Coca-Cola, Kimberley-Clark and Nasdaq — who engaged Palantir to conduct a pilot, then decided not to proceed with a long-term contract. In other words, lost sales and not cancelled contracts. The document Buzzfeed obtained is Palantir’s won/lost analysis, which shows that the company is attempting to learn from its lost sales.

Does Palantir have a revenue problem? No. Palantir’s 2015 revenue was up 50% from the previous year. Buzzfeed obsesses over the difference between Palantir’s bookings of $1.7 billion and its revenue of $420 million. A high book-to-bill ratio  is typical for consultancies that pursue large multi-year projects; it is a sign of strong demand for the company’s services. Under GAAP accounting, companies can accrue revenue only as work is performed, even if they bill the work in advance. Note that consulting giant Accenture’s bookings exceed its revenue for its most recent quarter.

Does Palantir have a profitability problem? Possibly. Buzzfeed reports that the company lost $80 million last year on revenue of $420 million. Consulting margins tend to be fairly high, so a loss means that Palantir is “investing” in a lot of unbillable work. It’s hard to say if these “investments” will pay off. Palantir closed another round of funding in December, 2015, so people with more and better information than Buzzfeed obviously think they will, and are backing up their belief with cash.

By the way, you know who has an actual revenue problem? Buzzfeed.

Roger Peng attempts to draw lessons for data scientists from the Buzzfeed story, without questioning its premises. He should stick to Biostatistics.

Spark 2.0

— Databricks announces preview of Apache Spark 2.0 on Databricks Community Edition.

— From last week: Reynold Xin explains what’s new in Spark 2.0.

— Dave Ramel summarizes the new features, including faster SQL; consolidation of the Dataset and DataFrame APIs; support for ANSI (2003) SQL; and Structured Streaming, an integrated view of tables and streams.

— Now that Spark 2.0 is in preview, MapR offers Spark 1.6.1.

Explainers

— Four from Adrian Colyer:

— Richard Williamson explains how to build a streaming prediction engine with Spark, MADlib, Kudu and Impala.

— On the Cloudera Vision blog, Santosh Kumar explains Hive-on-Spark.

— DataStax’ Dani Traphagen explains data processing with Spark and Cassandra.

— In ZDNet, Andrew Brust explains Microsoft’s R strategy, and gets it right.

Perspectives

— For a planted article in Linux.com, Pam Baker interviews IBM’s Mike Breslin, who answer questions nobody is asking about using Spark and Cloudant.

— Joyce Wells recaps a presentation by Booz Allen’s Jair Aguirre, who touts Apache Drill.

— Alex Woodie attends the Apache: Big Data 2016 conference and discovers open source projects.

— In Business Insider, Sam Shead describes FBLearnerFlow, a workbench for machine learning and AI.

— Leslie D’Monte describes some ways companies use machine learning in their operations.

Open Source Announcements

— Google announces release to open source of SyntaxNet, a framework for natural language understanding. Included in the release: an English parser dubbed Parsey McParseface. Journalists respond to the latter like dogs to a squirrel.

— Amazon releases yet another deep learning framework, this one branded as “Deep Scalable Sparse Tensor Network Engine (DSSTNE)” or “Destiny”. Stephanie Condon reports.

— Salesforce donates PredictionIO to Apache.

— Apache Storm announces two new maintenance releases:

  • Storm 0.10.1 has bug fixes.
  • Storm 1.0.1 has performance improvements and bug fixes.

— Apache Flink announces Release 1.0.3, with bug fixes and improved documentation.

— Apache Apex pushes a release to resolve a security issue.

Commercial Announcements

— BI-on-Hadoop startup AtScale announces an $11 million “B” round. Media coverage here.

— H2O.ai announces new hires with a strong orientation towards visualization, suggesting the company plans to add a more robust user interface to its best-in-class machine learning engine.

Big Analytics Roundup (May 2, 2016)

Movidius ups the ante for trade show trinkets by releasing what journos describe as supercomputing, neural computing power, vision processing, deep learning, and artificial intelligence on a USB drive.  Roundup here.

Movidius-Fathom-Key-Product-shot

Last November, IBM’s Paul Zikopoulos snarked at Cloudera for not supporting SparkR. Cloudera’s Sean Owen, responding to a query in the Cloudera Community, notes that SparkR “does not work with other resource managers,” and does not work unless R is installed on the data nodes. Sean also notes that Cloudera cannot redistribute R because it is under GPL license. Data scientist Iraklis Tsatsoulis explains how to make SparkR work in Cloudera. Cloudera’s response isn’t completely satisfactory — the GPL license does not prohibit Cloudera from redistributing R, for example — but it is based on actual working experience with the product, which IBM clearly does not have.

Turning to important matters, a group at the Technical University of Munich has a machine learning engine that predicts who will die in Game of Thrones. Not very well, it seems; they blew it on Roose Bolton. Oops, spoiler.

Screen Shot 2016-05-02 at 1.21.19 PM

Explainers

— Adrian Colyer explains GeePS, a Deep Learning framework for clusters of GPUs. Put that on a thumb drive and we can talk.

— On the Altiscale blog Professor Jimmy Lin compares local installations, virtual machine, IaaS providers and Altiscale’s Hadoop-as-a-Service offering for teaching students about Big Data. Spoiler: he likes Altiscale.

— Two benchmarks from the Cloudera Engineering Blog:

  • Devadutta Ghat et.al. explain results from benchmarking Impala 2.5 with TPC queries. They claim an average speedup of 4.35X over Impala 2.3 for TPC-DS.
  • Allstate’s Don Drake explains results of a test comparing Spark 1.6 performance with Avro and Parquet, with CSV as a baseline. Drake ran a multi-step benchmark with a narrow table and a wide table. Results: the Spark job ran faster with Parquet than Avro, markedly so for the wide data set, which makes sense since it’s columnar. Also, performance with CSV sucked.

— Three items from MapR’s Converge blog:

  • Nick Amato explains how to predict Airbnb listing prices with scikit-learn and Spark.
  • Mathieu Dumoulin explains Deep Learning with the CaffeOnSpark package.
  • Nicolas A Perez explains how to do Twitter sentiment analysis with Spark Streaming.

— Corentin Kerisit explains RDD partitioning in Spark.

Perspectives

— An anonymous blogger at CBInsights notes that big tech companies are paying big bucks for AI companies, so if you’re running a startup make sure you put AI in the name.

— Alexander Wissner-Gross weighs in on the “datasets versus algorithms” debate. My take: data trumps algorithms.

— Google streams engineer Tyler Akidau discusses streaming systems versus batch processing, which is like asking Mr. Fox for his perspective on chickens.

— David Weldon continues his series of interviews with people at Strata + Hadoop: Ravi Dharnikota of SnapLogic, who heard a lot of talk about streaming, Spark and data lakes.

— Alan Earls touts Amazon Machine Learning without understanding it.

Jack Vaughan interviews eBay’s Debashis Saha, who discusses Kylin and other stuff.

Open Source Announcements

— The Apache Software Foundation announces that Apache Apex has graduated to top level status. Apex, for streaming analytics, is the open source version of DataTorrent. Jessica Davis reports.

— North Bridge and Black Duck release their tenth annual survey of people who like open source.

— Apache Flink 1.0.2 ships with bug fixes and a new capability to integrate with RocksDB. So now, you can have Flink on Rocks.

Commercial Announcements

— Google’s DeepMind AI unit announces that they will use TensorFlow instead of Torch for all future work.

— Three guys exit Pivotal, start a company named SnappyData, land a tiny “A” round from Pivotal and GE Digital and propose to build something like GemFire, but on Spark. More here.

— Levyx announces a small “A” round. Levyx offers a version of Spark optimized to run on solid state/Flash memory.

— Tiny consulting firm Xentaurs announces a partnership with Mesosphere. And not just any partnership; it’s a strategic partnership. Actually, they just joined the DC/OS community.

Big Analytics Roundup (April 25, 2016)

Mesosphere wins the internet this week with its announcement that it has open sourced DC/OS, its datacenter virtualization project built around Apache Mesos. While not an “analytics” project per se, DC/OS has the potential to transform how organizations provision and deploy their analytics platforms.

In a nutshell, Apache Mesos distributes workloads across physical IT resources. DC/OS adds a container orchestration platform; installation, management and monitoring tools; and improvements to networking, security, load balancing, security and other areas. For more details about DC/OS and why it matters, read this white paper by Benjamin Hindman and Edward Hsu of Mesosphere.

Mesosphere has assembled an alliance of 61 launch partners, including tech vendors, systems integrators and potential users. Big brands include Accenture, Capgemini, Cisco, EMC, HPE, Microsoft, MapR, Microsoft and Verizon. Notable startups include Alluxio, Canonical, Confluent, Lightbend and MemSQL.

Analysts chime in:

  • Gavin Clarke thinks Google forced Mesosphere’s hand by open sourcing Kubernetes.
  • Mike Wheatley, notes that many of the components were already open source.
  • On TechCrunch, Frederic Lardinois reports and comments.
  • In Computerworld, John Ribeiro reports.
  • Janakiram MSV wonders if DC/OS will emerge as an alternative to Kubernetes.
  • Sam Dean surveys the project and interviews Ben Hindman.
  • George Leopold notes the scope of the DC/OS ecosystem.
  • Joao Lima reports.

DC/OS ships with more than 30 open source packages ready to install as DC/OS services. Notable among them: Cassandra, Elasticsearch, Kafka, MemSQL, Spark, Storm and Zeppelin.

Explainers

— Andrie de Vries explains how he scraped CRAN to trace the growth in R packages.

— On the Cloudera Engineering blog, David Alves explains how to use Impala and Kudu for analytic workloads.

— Michael Hunger and William Lyon explain how they analyzed the Panama Papers with Neo4j.

— On the Microsoft Azure blog, Liam Cavanagh explains how to optimize document search in Azure.

— Adrian Colyer of the morning paper summarizes five papers on word vectors, reviews Global Vectors for Word Representation, delivers an overview of Deep Learning and covers ImageNet classification with deep convolutional neural networks.

— Mario Inchiosa and Roni Burd explain how Microsoft R Server delivers an R interface to Spark in HDInsight.

Perspectives

— In MIT Technology review, Tom Simonite interviews Google’s Jeff Dean, contributor to Spanner, Translate, BigTable, MapReduce, Google Brain. LevelDB and TensorFlow. They discuss the future of machine learning.

— David Weldon went to Strata and interviewed some people:

  • Ali Hodroj of GigaSpaces, a cloud enabling company. Hodroj is bullish on cloud.
  • H2O.ai’s Arno Candel, who is surprised that so many people are talking about Spark.
  • Nikita Ivanov of GridGain, who says that people are excited about in-memory computing.
  • DataArtisans’ Kostas Tzoumas, who thinks that more people would use Flink if they were better educated.

— Alex Woodie touts Apache Beam, the open source implementation of Google’s Cloud Dataflow, which aspires to unify everything.

— James Nunns surveys ten Big Analytics startups: Confluent, H2O.ai, AtScale, Interana, Tamr, Wavefront, BlueTalon, Cazena, DataTorrent and Databricks.

— In Silicon Angle, Wikibon’s Paul Gillin interviews Wikibon’s George Gilbert, who is bullish on Spark.

— John Leonard ruminates on Hadoop, noting the proliferation of cute animal logos, and the challenges of the open source business model.

— Sam Dean notices that there are quite a few new open source tools for machine learning.

— Jack Vaughan summarizes the educational challenges posed by machine learning.

Commercial Announcements

— Dataiku announces availability of Data Science Studio on Microsoft Azure.

— GridGain announces availability of a support package for Apache Ignite that includes its Professional Edition — essentially the same as Apache Ignite, with more frequent maintenance releases and some LGPL libraries.

— MemSQL announces closing on a $36 million “C” round. All existing investors participated, plus two new investors.

Big Analytics Roundup (February 29, 2016)

Happy Leap Day.  Tachyon’s rebranding as Alluxio, release of CaffeOnSpark and GA for Google Cloud Dataproc lead the hard news this week.  The Alluxio announcement has inspired big thinkers to share big thoughts.  And, we have a nice crop of explainers.  Scroll down to the bottom for another SQL on Hadoop benchmark.

Explainers

— In SearchDataManagement, Jack Vaughn explains Spark 2.0.

— In Datanami, Alex Woodie explains Structured Streaming in Spark 2.0.

— MapR’s Jim Scott explains Spark accumulators.   Jim also explains Spark Streaming.

— DataArtisans’ Fabian Hueske introduces Flink.

— In SlideShare, Julian Hyde explains streaming SQL.

— Wes McKinney explains why pandas users should be excited about Apache Arrow.

— On her blog, Paige Roberts explains Project Tungsten, complete with pictures.

— Someone from Dremio explains Drillix, which is what you get when you combine Apache Phoenix and Apache Drill. (h/t Hadoop Weekly).

Perspectives

— In TheNextPlatform, Timothy Prickett Morgan argues that Tachyon Caching (Alluxio) is bigger than Spark

— In SiliconAngle, Maria Deutscher opines that Alluxio (née Tachyon) could replace HDFS for Spark users.

— In The New Stack, Susan Hall speculates that Apache Arrow’s columnar data layer could accelerate Spark and Hadoop.  She means Hadoop in a general way, e.g. the Hadoop ecosystem.

— On the Dataiku blog, “Caroline” interviews John Kelly, Managing Director of Berkeley Research Group and asks him questions about data science.  Left unanswered: is it “Data-ikoo” or “Day-tie-koo?”

— Alpine Data Labs’ Steven Hillion ruminates on success.  He’d be better off ruminating on “how to raise your next round of venture capital.”

— Max Slater-Robins opines that Microsoft is inventing the future, which is even better than winning the internet.

— In ZDNet, Andrew Brust wonders if Databricks is vying for a full analytics stack, citing the new Dashboard feature as cause for wonder.  He’s just trolling.

— In Search Cloud Applications, Joel Shore opines that streaming analytics is replacing complex event processing, which makes sense.   He further opines that Flink will displace Spark for streaming, which doesn’t make sense.   Shore interviews IBM’s Nagui Halim about streaming here.

Open Source Announcements

— Alluxio (née Tachyon) announces Release 1.0.0.  Alluxio is open source software distributed through Git under an Apache license, but is not an Apache project.  Yet.  Release 1.0 includes frameworks for MapReduce, Spark, Flink and Zeppelin.  Daniel Gutierrez reports.

— Yahoo releases CaffeOnSpark, a distributed deep learning package.  Caffe is one of the better-known deep learning packages, with a track record in image recognition.  Software is available on Git.  For more information, see the Wiki.  Alex Handy reports; Charlie Osborne reports.

— RapidMiner China announces availability of an extension for deep learning engine DL4J.  The extension is open source, and works with the open source version of RapidMiner.  DL4J sponsor Skymind collaborated.

Commercial Announcements

–Tachyon Nexus, the commercial venture founded to support Tachyon, the memory-centric virtual distributed storage system, announces that it has rebranded as Alluxio.

— Google announces general availability for its Cloud Dataproc managed service for Spark and Hadoop.

Funding Announcements

Health analytics vendor Health Catalyst lands a $70M Series E round.

AtScale Benchmarks SQL-on-Hadoop Engines

On the AtScale blog, Trystan Leftwich summarizes results from a benchmark test of Hive on Tez (1.2/0.7), Cloudera Apache Impala (2.3) and Spark SQL (1.6).  The AtScale team tested Impala and Spark with Parquet and Hive on Tez with ORC.  For test cases, the team used TPC-H data arranged in a star schema, and ran 13 queries in each SQL engine multiple times, averaging the results.

While Hortonworks recommends ORC with Hive/Tez, there are published cases where users achieved good results with Hive/Tez on Parquet.  Since the storage format has a big impact on SQL performance, I would have tested Hive/Tez on Parquet as well.  AtScale did not respond to queries on this point.

Key findings:

  • All three engines performed about the same on single-table queries, and on queries joining three small tables.
  • Spark and Impala ran faster than Hive on queries joining three large tables.
  • Spark ran faster than Impala on queries joining four or more tables.

The team ran the same tests with AtScale’s commercial caching technology, with significant performance improvements for all three engines.

In concurrency testing, Impala performed much better than Hive or Spark.

Details of the test available in a white paper here (registration required).

2015 in Big Analytics

Looking back at 2015, a few stories stand out:

  • Steady progress for Spark, punctuated by two big announcements.
  • Solid growth in cloud-based machine learning, led by Microsoft.
  • Expanding options for SQL and OLAP on Hadoop.

In 2015, the most widely read post on this blog was Spark is Too Big to Fail, published in April.  I wrote this post in response to a growing chorus of snark about Spark written by folks who seemed to know little about the project and its goals.

IBM Embraces Spark

IBM’s commitment to Spark, announced on Jun 15, lit up the crowds gathered in San Francisco for the Spark Summit.  IBM brings a number of things to Spark: deep pockets to build a community, extensive technical resources and a large customer base.  It also brings a clutter of aging and partially integrated products, an army of suits and no less than 164 Vice Presidents whose titles include the words “Big Data.”

When IBM announced its Spark initiative I joked that somewhere in the bowels of IBM, someone will want to put Spark on a mainframe.  Color me prophetic.

It’s too early to tell what substantive contributions IBM will make to Spark.  Unlike Mesosphere, Typesafe, Tencent, Palantir, Cloudera, Hortonworks, Huawei, Shopify, Netflix, Intel, Yahoo, Kixer, UC Berkeley and Databricks, IBM did not help test Release 1.5 in September.  This is a clear miss, given the scope of IBM’s resources and the volume of hype it puts out about its commitment to the project.

All that said, IBM brings respectability, and the assurance that Spark is ready for prime time.  This is priceless.  Since IBM’s announcement, we haven’t heard a peep from the folks who were snarking at Spark earlier this year.

Cloudera Announces “One Platform” Initiative

In September, Cloudera announced its One Platform initiative to unify Spark and Hadoop, an announcement that surprised everyone who thought Spark and Hadoop were already pretty well integrated.  As with the IBM announcement, the symbolism matters.  Some analysts took this announcement to mean that Cloudera is replacing MapReduce with Spark, which isn’t exactly true.  It’s fairer to say that in Cloudera’s vision, Hadoop users will rely more on Spark in the future than they do today, but MapReduce is not dead.

The “One Platform” positioning has more to do with Cloudera moving to stem the tide of folks who use Spark outside of Hadoop.  According to Databricks’ recent Spark user survey, only 40% use Spark under YARN, with the rest running in a freestanding cluster or on Mesos.  It’s an understandable concern for Cloudera; I’ve never heard a fish seller suggest that we should eat less fish.  But if Cloudera thinks “One Platform” will stem that tide, it is mistaken.  It all boils down to use cases, and there are many use cases for Spark that don’t need Hadoop’s baggage.

Microsoft Builds Credibility in Analytics

In 2015, Microsoft took some big steps to demonstrate that it offers serious solutions for analytics.  The acquisition of Revolution Analytics, announced in January, was the first step; in one move, Microsoft acquired a highly skilled team and valuable software assets.  Since the acquisition, Microsoft has rolled Revolution’s enhanced R distribution into SQL Server and Azure, opening both platforms to the large and growing R community.

Microsoft’s other big move, in February, was the official launch of Azure Machine Learning (AML).   First released in beta in June 2014, AML is both easy to use and powerful.  The UI is simple to understand, and documentation is excellent; built-in analytic functionality is very rich, and the tool is extensible with custom R or Python scripts.  Microsoft’s trial user program is generous, and clearly designed to encourage adoption and use.

Azure Machine Learning contrasts markedly with Amazon Machine Learning.  Amazon’s offering remains a skeleton, with minimal functionality and an API only a developer could love.  Microsoft is clearly making a play for the data science market as a way to leapfrog Amazon.  If analytic capabilities are driving your choice of cloud platform, Azure is by far your best option.

SQL Engines Proliferate

At the beginning of 2015, there were two main options for SQL on Hadoop: Hive for batch SQL and Impala for interactive SQL.  Spark SQL was still in Alpha; Drill was a curiosity; and Presto was something used at Facebook.

Several things happened during the year:

  • Hive on Tez established rough performance parity with the fast SQL engines.
  • Spark SQL went to general release, stabilized, and rolled out the DataFrames API.
  • MapR promoted Drill, and invested in improvements to the software.  Also, MapR’s Drill team spun off and started Dremio to provide commercial support.
  • Cloudera donated Impala to open source, and Pivotal donated Hawq.
  • Teradata placed its chips on Presto.

While it’s great to see so many options emerge, Hive continues to win actual evaluations.  Given Hive’s large user and contributor base and existing stock of programs, it’s unclear how much traction Hive alternatives have now that Hive on Tez offers competitive performance.  Obviously, Cloudera doesn’t think Impala offers a competitive advantage anymore, or they would not have donated the assets to Apache.

The other big news in SQL is TPC’s release of a benchmarking standard for decision support with Big Data.

OLAP on Hadoop Gets Real

For folks seeking to perform dimensional analysis in Hadoop, 2015 delivered not one but two options.  The open source option, Apache Kylin, originally an eBay project, just recently graduated to Apache top level status.  Adoption is limited at present, but any project used by eBay and Baidu is worth a look.

The commercial option is AtScale, a company that emerged from stealth in April.  Unlike BI-on-Hadoop vendors like Datameer and Pentaho, AtScale provides a dimensional layer designed to work with existing BI tools.  It’s a nice value proposition for companies that have already invested big time in BI tools, and don’t want to add another UI to the mix.

Funding for Machine Learning

H2O.ai’s recently announced B round is significant for a couple of reasons.  First, it validates H2O.ai’s true open source business model; second, it confirms the continued growth and expansion of the user base for H2O as well as H2O.ai’s paid subscription base.

Like Sherlock Holmes’ dog that did not bark, two companies are significant because they did not procure funding in 2015:

  • Skytree, whose last funding round closed in April 2013, churned its executive team and rebranded a couple of times.  It finally listed some new customers; interestingly, some are investors and others are affiliated with members of Skytree’s Board.
  • Alpine Data Labs, last funded in November 2013, struggled to distance itself from the Pivotal ecosystem.  Designed to run on Greenplum, Alpine offers limited functionality on Hadoop, which makes it unclear how this company survives.

Palantir continued to suck up capital like a whale feeding on krill.

Google TensorFlow

Google open sourced TensorFlow, so now we have sixteen open source Deep Learning frameworks instead of just fifteen.

Big Analytics Roundup (August 31, 2015)

Top stories for the penultimate week of summer: an excellent SQL-on-Hadoop benchmark; a couple of stories about Gelly, Flink’s graph engine; Apache Ignite goes top-level; a preview of Spark 1.5; and new stuff from RStudio.

Also, on Slideshare, evil mad scientist Paco Nathan presents on “Uber for Education.”

SQL on Hadoop

I missed this story in June, but better late than never.  The folks at Allegro.tech, a Warsaw-based collaborative, published results of an excellent benchmark of SQL-on-Hadoop technologies.  Scope of the analysis included Hive on MapReduce (the “control”), Hive on Tez, Presto, Impala, Drill and Spark SQL.  (The authors note that they wanted to evaluate Hive on Spark, but could not make it work.)

The Allegro team first evaluated Kerberos support, YARN deployment and query fault tolerance, the available UI, JDBC support, UDF and view support as well as support for each of CSV, JSON, AVRO and Parquet formats.  For benchmarking, they used 11 HiveQL queries testing a mix of typical analytic tasks.

Some key findings:

  • Hive on Tez: ran all queries with stable and satisfactory performance
  • Spark SQL: better than average performance overall, but could not run two queries
  • Presto: convenient to use, but performance was disappointing
  • Impala: fastest overall, but could not run one of the queries
  • Drill: very fast, but could not run three queries

Apache Flink/Data Artisans

On Slideshare, Vasia Kalavri presents on overview of Gelly, Flink’s graph engine.  More about Gelly here.

Apache Ignite/GridGain

The Apache Software Foundation promotes Ignite to top-level project status.  SD Times reports.  Ignite is a high-performance integrated and distributed in-memory platform.  Ignite is the open source version of GridGain‘s commercial product.

Apache Lens

ASF also promotes Lens to top-level status.  Apache Lens is a “Unified Analytics Platform”, whatever that is.  (h/t Hadoop Weekly)

Apache Spark/Databricks

Patrick Wendell of Databricks presented a preview of Spark 1.5 last Thursday.    Spark 1.5 will be available in mid-September (exact timing depends on Apache voting process).  Developers from more than 50 companies contributed to the build.  A preview is available in Databricks now.  Key enhancements:

  • Execution concepts will be exposed: tracking memory usage, visualizing DataFrame execution tree
  • Project Tungsten will be on by default: binary processing for memory management, code generation for CPU efficiency
  • Performance optimizations in SQL/DataFrames: Metadata discovery, predicate pushdown in Parquet, outer joins and window functions
  • First class UDAF support
  • Improved interoperability with Hive
  • Read Parquet files encoded by Hive, Impala, Pig, Avro, Thrift, Spark SQL
  • Additional Python interfaces for Spark Streaming
  • R bindings for linear models
  • Python bindings for Power Iteration Clustering
  • New algorithms and transforms for ML Pipelines

There will also be some new packages available concurrently with the 1.5 release, including support for AWS Redshift, Magellan support for spatial analytics and a convex solver package.

On Datanami, George Leopold covers the story.

Alex Woodie interviews some Spark users and discovers that they often use it together with Hadoop.

Jessica Twentyman notes that Spark looks set to replace MapReduce, inquires into the pace, scope and scale of replacement.  She finds a lot of smart people who are optimistic and a few who urge caution, citing Spark’s immaturity.

Darryl Taft explains how Spark transforms Big Data processing and development.  Spoiler: it’s faster.

In readwrite, Peter Schlampp provides six reasons that Apache Spark isn’t flickering out, thereby answering a question nobody is asking.  For the record, his reasons are: advanced analytics, simplification, support for multiple languages, faster results, Hadoop distribution agnosticism and high-growth adoption.

On the Cloudera blog, Jeff Palmucci of TripAdvisor describes how his team uses Spark.

Google Cloud

announces a new release of BigQuery with UDF support.

H2O.ai

On HomeAI, Arno Candel presents a Deep Learning Webinar.

RStudio

RStudio adds a new starter plan for shinyapps.io, a cloud service for Shiny apps.  Roger Oberg reports.