Big Analytics Roundup (July 18, 2016)

We have lots of fresh material to read on the beach this week — most notably, the “read of the week” below, which might be better labeled as the “read of the year.”  We have another streaming engine to kick around, a slew of earnings releases in the coming week, and some new releases from GraphLab Dato Turi.

If you haven’t already completed Databricks’ Spark survey, stop reading this and go do the survey.

On Wednesday, July 20, Teradata presents results of an “independent” benchmark of SQL on Hadoop engines, including Hive, Impala, Presto, and SparkSQL. Missing from the mix: Teradata Aster.

Call for Papers

CFP is open for Apache: Big Data Europe in Seville. Conference is November 14-16; CFP closes September 9

Read of the Week

Stop building data cathedrals; instead, build data bazaars. Adrian Colyer explains.

Yet Another Streaming Engine

The folks at Concord.io benchmark their product against Spark 1.6; not surprisingly, the results favor Concord.io. In Datanami, Alex Woodie touts the results. He should read his own summary of the recent OpsClarity survey, which contained this nugget:

Screen Shot 2016-07-18 at 8.26.11 AM

In other words, the whole debate about “true streaming” versus micro-batching is irrelevant to most organizations because they don’t need subsecond performance. It’s like arguing that a Ferrari is better than a Toyota Camry because the sports car can go 180 mph. Here in Mudville, you’ll be arrested if you go that fast, so the Camry’s big trunk and rear seat leg room look pretty good.

Performance is cool. But the current spate of streaming engines will not be resolved by performance tests. Commercial support, integration, depth of features, security and stability will determine which engines survive the shakeout.

Second Quarter Earnings Roundup

Five of the top six Business Analytics software vendors tracked by IDC are public companies, with quarterly earnings reports. (SAS is privately held). Here is the outlook for earnings releases:

— Oracle’s fiscal year ends May 31. Oracle does not report analytics revenue separately. For the fiscal quarter ended May 31, 2016, Oracle reports that growth in revenue from SaaS and PaaS cloud services barely offset a 12% decline in software license revenue, for overall flat software and services revenue.

— SAP expects to release Q2 financial results on Wednesday, July 20.

— Declining giant IBM will announce another quarter of fail on Monday, July 18.

— Microsoft will announce quarterly and fiscal year-end results on Tuesday, July 19.

— Teradata, like SAP, IBM, and Microsoft, closed the second quarter on June 30, but can’t crunch the numbers until Tuesday, August 2. Keep that in mind the next time TDC tries to sell you on their fast number crunching capabilities.

Explainers

— Ravelin’s Stephen Whitworth explains how to real-time fraud detection with Google BigQuery.

— Carol McDonald explains how to use Spark’s Random Forests capability, demonstrating with a loan credit risk dataset.

— Three more papers from Adrian Colyer:

  • Ambry: LinkedIn’s scalable geo-distributed object store.
  • Spheres of influence for viral marketing.
  • Progressive skyline computation.

— On the Hortonworks blog, Roshan Naik and Sapin Amin explain how they benchmarked performance improvements in Apache Storm 1.0.

— Jules Damji explains Spark APIs: RDDs, DataFrames, and Datasets.

— Lewis Gavin offers five tips to improve the performance of Spark apps.

— Qubole’s Rajat Venkatesh explains how to optimize queries with materialized views and Quark, Qubole’s SQL abstraction layer.

— In a recorded webinar, Hossein Falaki and Denny Lee explain how to perform exploratory analysis on large datasets with Spark and R.

— On the Revolutions blog, Joe Rickert explains the capabilities of several new R packages in CRAN.

— Barath Ravichander explains how to use R with SQL.

— Microsoft’s Sheri Gilley explains the ins and outs of SQL Server, PowerBI, and R.

— Roel M. Hogervorst explains how to submit an R package to CRAN. Bob Rudis elaborates.

— The Rcpp package enables R packages to leverage C or C++ code.  Dirk Eddelbuettel reveals that more than 700 CRAN packages now use Rcpp.

Perspectives

— On KDnuggets, deep learning mavens offer predictions about deep learning.

— Daniel Gutierrez interviews MapR’s Jack Norris, who is very excited about MapR.

— Alex Woodie describes Prama, TransUnion’s open source analytics platform built on MapR and Apache Drill.

Open Source Announcements

— Basho donates Riak TS for time series analysis to open source.

— Microsoft announces Microsoft R Client, a free development tool for use with Microsoft R Open.

— Apache Atlas announces version 0.7.0 – incubating.

Commercial Announcements

— GridGain, the company behind Apache Ignite, reports a 300X sales increase in the first half of 2016, which is not too surprising since the company was in stealth mode until last January.

— Microsoft announces GA for Azure SQL Data Warehouse, which may surprise those who thought it was already GA.

GraphLab Dato Turi announces the release of GraphLab Create 2.0, Turi Distributed and Turi Predictive Services. Marketing staff works feverishly to change brand names on all documents.

Big Analytics Roundup (July 11, 2016)

Light news this week. We have results from an interesting survey on fast data, an excellent paper from Facebook and a nice crop of explainers.

From one dumb name to another.  Dato loses trademark dispute, rebrands as Turi. They should have googled it first.

Screen Shot 2016-07-07 at 6.25.48 AM

Wikibon’s George Gilbert opines on the state of Big Data performance benchmarks. Spoiler: he thinks that most of the benchmarks published to date are BS.

Databricks releases the third eBook in their technical series: Lessons for Large-Scale Machine Learning Deployments in Apache Spark.

The State of Fast Data

OpsClarity, a startup in the applications monitoring space, publishes a survey of 4,000 respondents conducted among a convenience sample of IT folk attending trade shows and the like. Most respondents self-identify as developers, data architects or DevOps professionals. For a copy of the report, go here.

As with any survey based on a convenience sample, results should be interpreted with a grain of salt. There are some interesting findings, however.  Key bits:

  • In the real world, real time is slow. Only 27% define “real-time” as “less than 30 seconds.”  The rest chose definitions in the minutes and even hours.
  • Batch rules today. 89% report using batch processing. However, 68% say they plan to reduce batch and increase stream.
  • Apache Kafka is the most popular message broker, which is not too surprising since Kafka Summit was one of the survey venues.
  • Apache Spark is the most popular data processing platform, chosen by 70% of respondents.
  • HDFS, Cassandra, and Elasticsearch are the most popular data sinks.
  • A few diehards (9%) do not use open source software. 47% exclusively use open source.
  • 40% host data pipelines in the cloud; 32% on-premises; the rest use a hybrid architecture.

It should surprise nobody that people who attend Kafka Summit and the like plan to increase investments in stream processing. What I find interesting is the way respondents define “real-time”.

Alex Woodie summarizes the report. (Fixed broken link).

Top Read of the Week

Guoqiang Jerry Chen, et. al. explain real-time data processing at Facebook. Adrian Colyer summarizes.

Explainers

— Jake Vanderplas explains why Python is slow.

— On Wikibon, Ralph Finos explains key terms in cloud computing. Good intro.

— A blogger named Janakiram MSV describes all of the Apache streaming projects. Two corrections: Kafka Streams is a product of Confluent (corrected) and not part of Apache Kafka, and Apache Beam is an abstraction layer that runs on top of either batch or stream processing engines.

— Srini Penchikala explains how Netflix orchestrates its machine learning workflow with Spark, Python, R, and Docker.

— Kiuk Chung explains how to generate recommendations at scale with Spark and DSSTNE, the open source deep learning engine developed by Amazon.

— Madison J. Myers explains how to get started with Apache SystemML.

— Hossein Falaki and Shivaram Venkataraman explain how to use SparkR.

— Philippe de Cuzey explains how to migrate from Pig to Spark. For Pig diehards, there is also Spork.

— In a video, Evan Sparks explains what KeystoneML does.

— John Russell explains what pbdR is, and why you should care (if you use R).

— In a two-part post, Pavel Tupitsyn explains how to get started with Apache Ignite.NET. Part two is here.

— Manny Puentes of Altitude Digital explains how to invest in a big data platform.

Perspectives

— Beau Cronin summarizes four forces shaping AI: data, compute resources, software, and talent. My take: with the cost of data, computing and software collapsing, talent is the key bottleneck.

— Greg Borenstein argues for interactive machine learning. It’s an interesting argument, but not a new argument.

— Ben Taylor, Chief Data Scientist at HireVue, really does not care for Azure ML.

— Raj Kosaraju opines on the impact of machine learning on everyday life.

— An anonymous blogger at CBInsights lists ten well-funded startups developing AI tech.

— The folks at icrunchdata summarize results from the International Symposium on Biomedical Imaging, where an AI system proved nearly as accurate as human pathologists in diagnosing cancer cells.

Open Source Announcements

— Yahoo Research announces the release of Spark ADMM, a framework for solving arbitrary separable convex optimization problems with Alternating Direction Method of Multipliers. Not surprisingly given the name, it runs on Spark.

Commercial Announcements

— Talend announces plans for an IPO. The filing discloses that last year Talend lost 28 cents for every dollar in revenue, which is slightly better than the 35 cents lost in 2015. At that rate, Talend may break even in 2020, if nothing else happens in the interim.

Big Analytics Roundup (June 27, 2016)

We have announcements from BlueData, Databricks, and DataStax this week, plus a nice crop of explainers. Also, a bit of catch-up, something from May that I missed: Bob Hayes publishes an interesting summary of his recent survey of data scientists. Includes an infographic and slides.

Thiemo Fetzer asks: did the weather affect the Brexit vote? Spoiler: he says no.

Presented without comment: Medical Information Records, Inc, says it uses Microsoft Azure Cloud to reduce postoperative nausea and vomiting.

CFPs and Competitions

Flink Forward 2016, Berlin, September 12-14 (due June 30)

Spark Summit Europe, Brussels, October 25-27 (closing date July 1)

Parkinson’s Progression Markers Institute (PPMI) 2016 Challenge (due September 7)

Explainers

— On the Databricks blog, Denny Lee and Jules Damji explain key Spark terms.

— Adrian Colyer explains chatbots.

— Aaron Schumacher explains how to get started with TensorFlow.

— Allan Engelhardt explains Microsoft’s R-based analytics capabilities.

— ThinkReactive’s Deenar Torasker explains visualization using HTML5, SVG, CSS, D3 and Javascript InfoVis Toolkit.

— Brandon Butler explains what’s inside Cisco’s Tetration analytics platform.

— On the BlueData blog, Anant Chintamaneni explains BlueData in the public cloud.

— Manjeet Chayel explains how to analyze streaming data from Kinesis with Spark Streaming and Zeppelin.

— More Spark Streaming: on the Cloudera blog, Jam Kunigk explains how to detect web traffic anomalies with Flume, Spark Streaming, and Impala.

Perspectives

— Robert Hof interviews Hortonworks’ CEO Rob Bearden but does not ask him about the company’s market value, currently about a third of what it was at IPO.

— GridGain plants a piece by CEO Dmitriy Setrakyan suggesting that Apache Spark and Apache Ignite work well together.

— Dave Ramel touts something called Koverse, which does everything.

— In The Register, Billy MacInnes assesses the new IBM. He doesn’t approve.

— In a surprisingly ill-informed piece, Serdar Yegulalp argues that four languages pose a challenge to Python: Swift, Go, Julia, and R.

— In Forbes, Bernard Marr summarizes a study which proposes to explain why analytics investments have yet to pay off. The study does not live up to its premise, as it fails to show that analytics investments have not paid off.

— Srini Penchikala reviews Big Data Analytics with Spark and interviews the author.

Commercial Announcements

— Databricks announces a strategic partnership agreement and investment from In-Q-Tel, a not-for-profit organization that supports the U.S. Intelligence Community.

— DataStax announces DataStax Enterprise 5.0, with new stuff. I don’t see anything really exciting not previously announced.

— BlueData announces availability of its EPIC Big-Data-as-a-Service on public cloud — AWS, Azure, Google and “other”.

Big Analytics Roundup (April 25, 2016)

Mesosphere wins the internet this week with its announcement that it has open sourced DC/OS, its datacenter virtualization project built around Apache Mesos. While not an “analytics” project per se, DC/OS has the potential to transform how organizations provision and deploy their analytics platforms.

In a nutshell, Apache Mesos distributes workloads across physical IT resources. DC/OS adds a container orchestration platform; installation, management and monitoring tools; and improvements to networking, security, load balancing, security and other areas. For more details about DC/OS and why it matters, read this white paper by Benjamin Hindman and Edward Hsu of Mesosphere.

Mesosphere has assembled an alliance of 61 launch partners, including tech vendors, systems integrators and potential users. Big brands include Accenture, Capgemini, Cisco, EMC, HPE, Microsoft, MapR, Microsoft and Verizon. Notable startups include Alluxio, Canonical, Confluent, Lightbend and MemSQL.

Analysts chime in:

  • Gavin Clarke thinks Google forced Mesosphere’s hand by open sourcing Kubernetes.
  • Mike Wheatley, notes that many of the components were already open source.
  • On TechCrunch, Frederic Lardinois reports and comments.
  • In Computerworld, John Ribeiro reports.
  • Janakiram MSV wonders if DC/OS will emerge as an alternative to Kubernetes.
  • Sam Dean surveys the project and interviews Ben Hindman.
  • George Leopold notes the scope of the DC/OS ecosystem.
  • Joao Lima reports.

DC/OS ships with more than 30 open source packages ready to install as DC/OS services. Notable among them: Cassandra, Elasticsearch, Kafka, MemSQL, Spark, Storm and Zeppelin.

Explainers

— Andrie de Vries explains how he scraped CRAN to trace the growth in R packages.

— On the Cloudera Engineering blog, David Alves explains how to use Impala and Kudu for analytic workloads.

— Michael Hunger and William Lyon explain how they analyzed the Panama Papers with Neo4j.

— On the Microsoft Azure blog, Liam Cavanagh explains how to optimize document search in Azure.

— Adrian Colyer of the morning paper summarizes five papers on word vectors, reviews Global Vectors for Word Representation, delivers an overview of Deep Learning and covers ImageNet classification with deep convolutional neural networks.

— Mario Inchiosa and Roni Burd explain how Microsoft R Server delivers an R interface to Spark in HDInsight.

Perspectives

— In MIT Technology review, Tom Simonite interviews Google’s Jeff Dean, contributor to Spanner, Translate, BigTable, MapReduce, Google Brain. LevelDB and TensorFlow. They discuss the future of machine learning.

— David Weldon went to Strata and interviewed some people:

  • Ali Hodroj of GigaSpaces, a cloud enabling company. Hodroj is bullish on cloud.
  • H2O.ai’s Arno Candel, who is surprised that so many people are talking about Spark.
  • Nikita Ivanov of GridGain, who says that people are excited about in-memory computing.
  • DataArtisans’ Kostas Tzoumas, who thinks that more people would use Flink if they were better educated.

— Alex Woodie touts Apache Beam, the open source implementation of Google’s Cloud Dataflow, which aspires to unify everything.

— James Nunns surveys ten Big Analytics startups: Confluent, H2O.ai, AtScale, Interana, Tamr, Wavefront, BlueTalon, Cazena, DataTorrent and Databricks.

— In Silicon Angle, Wikibon’s Paul Gillin interviews Wikibon’s George Gilbert, who is bullish on Spark.

— John Leonard ruminates on Hadoop, noting the proliferation of cute animal logos, and the challenges of the open source business model.

— Sam Dean notices that there are quite a few new open source tools for machine learning.

— Jack Vaughan summarizes the educational challenges posed by machine learning.

Commercial Announcements

— Dataiku announces availability of Data Science Studio on Microsoft Azure.

— GridGain announces availability of a support package for Apache Ignite that includes its Professional Edition — essentially the same as Apache Ignite, with more frequent maintenance releases and some LGPL libraries.

— MemSQL announces closing on a $36 million “C” round. All existing investors participated, plus two new investors.

Big Analytics Roundup (March 21, 2016)

Minimal hard news this week, but some interesting survey results, analysis, articles, explainers and perspectives.

— On his personal blog, Will Kurt describes Bayesian reasoning in the Twilight Zone. I tried to learn Bayesian reasoning a few years ago, but it conflicted with my prior beliefs.

— Stack Overflow shares results from its 2016 Developer Survey. (h/t Thomas Ott) Key bits:

  • Most popular technologies for math and data: Python and SQL.
  • Top paying technologies: Spark and Scala.
  • Top paying tech for data scientists: Scala, Spark and Hadoop.
  • Top tech stack for data scientists: Python + R + SQL.
  • Top development environments for data scientists: (1) Vim; (2) Notepad++; (3) RStudio; (4) IPython/Jupyter.
  • Job priorities for data scientists: (1) Salary; (2) Building something that’s innovative.
  • Biggest challenge at work (all respondents): Unrealistic expectations.
  • Purchasing power of developers in South Africa: 25,713 Big Macs per year.

— MIT Technology Review summarizes a comparative analysis of the tweeps for Hillary Clinton and Donald Trump. Study authors use facial recognition to classify followers into demographic categories, with surprising findings.

— Daniel Chalef of Domino Data analyzes data from Google Trends and StackOverflow, discovers that people search for open source data science tools more than they do for commercial data science tools. For a more comprehensive look at this question, see Bob Muenchin’s blog on the popularity of analytics software. Search interest is one data point, Bob’s work with job postings offers a better picture of the actual state of the market.

— On his Databaseline blog, Ian Hellström corrals information on Apache streaming projects, including Apex, Beam, Flink, Flume, Ignite, NiFi, Samza, Spark Streaming and Storm/Trident.

Explainers

— On the Confluent blog, Jay Kreps explains Kafka Streams. Given Kafka’s dominance in the streaming data space, I suspect that we will see Confluent move upstream — no pun intended — to streaming analytics.

— This week from the morning paper:

  • Adrian Colyer explains MacroBase, an open source software project for anomaly detection in streaming data.
  • … explains social engineering attacks and potential defenses.
  • explains distributed TensorFlow with MPI. Distributed versions improve (runtime) performance, but scaleability is sublinear; with 32 nodes, performance is a little less than 12X faster than a single node.

— MapR’s Tugduall Grall explains what Spark is, what it does, and what sets it apart.

— In SlideShare, Joe Chow explains random grid search for hyperparameter optimization in H2O.

— On the Databricks blog, Denny Lee et. al. explain how to use the new GraphFrames package. They include a notebook and demonstration of GraphFrames with the airline on-time performance dataset.

— MSFT’s Jeff Stokes explains how to scale stream analytics jobs with Azure Machine Learning functions.

— On the MapR blog, Carol McDonald explains how to get started using GraphX with Scala.

Perspectives

— Jack Vaughan interviews some old guy who thinks Spark is a thing.

— In Forbes, Gil Press reviews the Forrester TechRadar Big Data report and opines about the top ten technologies. InformationWeek’s Jessica Davis reviews the same report and draws different conclusions. The great thing about punditry is you can say anything you like.

— Gabriela Motroc engages the tiresome old “Spark versus Hadoop” theme.

— Alex Woodie opines that Hadoop must evolve toward greater simplicity. While his complaint has merit, the problem with his argument is that organisms do not “evolve” to simplicity; simplicity itself is a product of design.  Pure Hadoop is simple: MapReduce and HDFS.  Hadoop has evolved to something more complex because it had to do so; every additional piece added to the ecosystem is a response to unmet needs.

— H2O.ai’s Ken Sanford, who previously worked for SAS, argues that the best data scientists run R and Python.  He’s right. Money talks: according to O’Reilly’s 2015 Data Science Salary Survey, the median salary for data scientists who use SAS is less than the median salary for data scientists who use R and Python.

— On Medium, PredictionIO’s Thomas Stone celebrates ten years of open source machine learning.

— Jessica Davis profiles nine big data and analytics startups she thinks you should watch: Confluent, H2O.ai, AtScale, Algorithmia, BedrockData, Wavefront, RJMetrics, BlueTalon, and Cazena.

— In TechCrunch, Hightail’s Mike Trigg opines that Silicon Valley’s unicorn problem will solve itself. I doubt that’s true; you can’t simultaneously argue that VCs are irrational on the upside (e.g. Groupon) but rational on the downside. If VCs are too dumb to spot companies with no sustainable competitive advantage, they are also too dumb to spot “well-run, profitable companies with proven business models and healthy balance sheets.”

— On Quora, Dato’s Carlos Guestrin opines about what’s next in machine learning.

— In Martech Advisor, Ankush Gupta Mar interviews Altiscale’s VP of Marketing, Barbara Lewis. Interesting bits about Altiscale’s Spark-as-Service offering.

— David Weldon asks if you are asking all the wrong questions about Apache Spark. He interviews Sean Suchter of Pepperdata.

— Srini Penchikala interviews the authors of Spark in Action, an upcoming book from Manning.

Teradata Watch

— Teradata CEO Mike Koehler continues to demonstrate confidence in the company’s growth prospects by selling another 350,000 shares.

— Zacks downgrades TDC to hold. On Wall Street, “hold” is code for “dump it.”

Open Source Announcements

— Three announcements from Apache projects:

  • Apex announces release 3.3.1 of the Malhar library, a maintenance release.
  • Drill announces release 1.6.0, which includes a few new features and many bug fixes. Release notes here.
  • Phoenix announces release 4.7, with ACID transaction support, better statistics, improved performance and 150+ bug fixes.

Commercial Announcements

— SAP announces general availability for SAP HANA Vora, a tool that enables HANA users to query data in Hadoop and other distributed storage platforms through Spark. In CIO, Thor Olavsrud reports.

— Dataiku announces that it has hired two new Veeps to drive expansion in North America.

— Reltio announces GA of Reltio Cloud 2016.1, with early access to Reltio Insights. Reltio offers a master data management platform-as-a-service; Reltio Insights adds Spark to the mix.

— BlueData announces that it has joined the Dell Technology Partnership Program. BlueData offers a datacenter virtualization capability that enables enterprises to build an on-premises cloud. BlueData Veep Greg Kirchoff opines about the partnership. Spoiler: he likes it.

Big Analytics Roundup (October 12, 2015)

Dell and Silver Lake Partners announce plans to buy EMC for $67 billion, a transaction that is a big deal in the tech world, and mildly interesting for analytics.  Dell acquired StatSoft in 2014,but nothing before or since suggests that Dell knows how to position and sell analytics.  StatSoft is lost inside Dell, and will be even more lost inside Dell/EMC.

EMC acquired Greenplum in 2010; at the time, GP was a credible competitor to Netezza, Aster and Vertica.  It turns out, however, that EMC’s superstar sales reps, accustomed to pushing storage boxes, struggled to sell analytic appliances.  Moreover, with the leading data warehouse appliances vertically integrated with hardware vendors, Greenplum was out there in the middle of nowhere peddling an appliance that isn’t really an appliance.

EMC shifted the Greenplum assets to its Pivotal Software unit, which subsequently open sourced the software it could not sell and exited the Hadoop distribution business under the ODP fig leaf.  Alpine Data Labs, which used to be tied to Greenplum like bears to honey, figured out a year ago that it could not depend on Greenplum for growth, and has diversified its platform support.

What’s left of Pivotal Software is a consulting business, which is fine — all of the big tech companies have consulting arms.  But I doubt that the software assets — Greenplum, Hawq and MADLib — have legs.

In other news, the Apache Software Foundation announces three interesting software releases:

  • Apache AccumuloRelease 1.6.4, a maintenance release.
  • Apache Ignite: Release 1.4.0, a feature release with SSL and log4j2 support, faster JDBC driver implementation and more.
  • Apache Kafka: Release 0.8.2.2, a maintenance release.

Spark

On the MapR blog, Jim Scott takes a “Spark is a fighter jet” metaphor and flies it until the wings fall off.

Spark Performance

Dave Ramel summarizes a paper he thinks is too long for you to read.  That paper, here, written by scientists affiliated with IBM and several universities, reports on detailed performance tests for MapReduce and Spark across four different workloads.  As I noted in a separate blog post, Ramel’s comment that the paper “calls into question” Spark’s record-setting performance on GraySort is wrong.

Spark Appliances

Ordinarily I don’t link sponsored content, but this article from Numascale is interesting.  Numascale, a Norwegian company, offers analytic appliances with lots of memory; there’s an R appliance, a Spark appliance and a database appliance with MonetDB.

Spark on Amazon EMR

On Slideshare, Amazon’s Jonathan Fritz and Manjeet Chayel summarize best practices for data science with Spark on EMR.  The presentation includes an overview of Spark DataFrames, a guide to running Spark on Amazon EMR, customer use cases, tips for optimizing performance and a plug for Zeppelin notebooks.

Use Cases

In Datanami, Alex Woodie describes how Uber uses Spark and Hadoop.

Stitch Fix offers personalized style recommendations to its customers.  Jas Khela describes how the startup uses Spark.   (h/t Hadoop Weekly)

SQL/OLAP/BI

Apache Drill

MapR’s Neeraja Rentachintala, Director of Product Management, rethinks SQL for Big Data.  Without a trace of irony, he explains how to bring SQL to NoSQL datastores.

Apache Hawq

On the Pivotal Big Data blog, Gavin Sherry touts Apache Hawq and Apache MADLib.  Hawq is a SQL engine that federates queries across Greenplum Database and Hadoop; MADLib is a machine learning library.   MADLib was always open source; Hawq, on the other hand, is a product Pivotal tried to sell but failed to do so.  In Datanami, George Leopold reports.

In CIO Today, Jennifer LeClaire speculates that Pivotal is “taking on” Oracle’s traditional database business with this move, which is a colossal pile of horse manure.

At Apache Big Data Europe, Caleb Welton explains Hawq’s architecture in a deep dive.  The endorsement from GE;s Jeffrey Immelt is a bit rich considering GE’s ownership stake in Pivotal, but the rest of the deck is solid.

Apache Phoenix

At Apache Big Data Europe, Nick Dimiduk delivers an overview of Phoenix, a relational database layer for HBase.  Phoenix includes a query engine that transforms SQL into native HBase API calls, a metadata repository and a JDBC driver.  SQL support is broad enough to run TPC benchmark queries.  Dimiduk also introduces Apache Calcite, a Query parser, compiler and planner framework currently in incubation.

Data Blending

On Forbes, Adrian Bridgewater touts the data blending capabilities of ClearStory Data and Alteryx without explaining why data blending is a thing.

Presto

On the AWS Big Data Blog, Songzhi Liu explains how to use Presto and Airpal on EMR.  Airpal is a web-based query tool developed by Airbnb that runs on top of Presto.

Machine Learning

Apache MADLib

MADLib is an open source project for machine learning in SQL.  Developed by people affiliated with Greenplum, MADLib has always been an open source project, but is now part of the Apache community.  Machine learning functionality is quite rich.  Currently, MADLib supports PostgreSQL, Greenplum database and Apache Hawq.  In theory, the software should be able to run in any SQL engine that supports UDFs; since Oracle, IBM and Teradata all have their own machine learning stories, I doubt that we will see MADLib running on those platforms. (h/t Hadoop Weekly)

Apache Spark (SparkR)

On the Databricks blog, Eric Liang and Xiangrui Meng review additions to the R interface in Spark 1.5, including support for Generalized Linear Models.

Apache Spark (MLLib)

On the Cloudera blog, Jose Cambronero explains what he did this summer, which included running K-S tests in Spark.

Apache Zeppelin

At Apache Big Data Europe, Datastax’ Duy Hai Doan explains why you should care about Zeppelin’s web-based notebook for interactive analytics.

H2O and Spark (Sparkling Water)

In a guest post on the Cloudera blog, Michal Malohlava, Amy Wang, and Avni Wadhwa of H2O.ai explain how to create an integrated machine learning pipeline using Spark MLLib, H2O and Sparkling Water, H2O’s interface with Spark.

How Yahoo Does Deep Learning on Spark

Cyprien Noel, Jun Shi, Andy Feng and the Yahoo Big ML Team explain how Yahoo does Deep Learning with Caffe on Spark.  Yahoo adds GPU nodes to its Hadoop clusters; each GPU node has 10X the processing power of a commodity Hadoop node.  The GPU nodes connect to the rest of the cluster through Ethernet, while Infiniband provides high-speed connectivity among the GPUs.

Screen Shot 2015-10-06 at 10.32.39 AM

Caffe is an open source Deep Learning framework developed by the Berkeley Vision and Learning Center (BVLC).  In Yahoo’s implementation, Spark assembles the data from HDFS, launches multiple Caffe learners running on the GPU nodes, then saves the resulting model back to HDFS. (h/t Hadoop Weekly)

Streaming Analytics

Apache Flink

On the MapR blog, Henry Saputra recaps an overview of Flink’s stream and graph processing from a recent Meetup.

Apache Kafka

Cloudera’s Gwen Shapira presents Kafka “worst practices”: true stories that happened to Kafka clusters. (h/t Hadoop Weekly)

Apache Spark Streaming

On the MapR blog, Jim Scott offers a guide to Spark Streaming.

Big Analytics Roundup (August 31, 2015)

Top stories for the penultimate week of summer: an excellent SQL-on-Hadoop benchmark; a couple of stories about Gelly, Flink’s graph engine; Apache Ignite goes top-level; a preview of Spark 1.5; and new stuff from RStudio.

Also, on Slideshare, evil mad scientist Paco Nathan presents on “Uber for Education.”

SQL on Hadoop

I missed this story in June, but better late than never.  The folks at Allegro.tech, a Warsaw-based collaborative, published results of an excellent benchmark of SQL-on-Hadoop technologies.  Scope of the analysis included Hive on MapReduce (the “control”), Hive on Tez, Presto, Impala, Drill and Spark SQL.  (The authors note that they wanted to evaluate Hive on Spark, but could not make it work.)

The Allegro team first evaluated Kerberos support, YARN deployment and query fault tolerance, the available UI, JDBC support, UDF and view support as well as support for each of CSV, JSON, AVRO and Parquet formats.  For benchmarking, they used 11 HiveQL queries testing a mix of typical analytic tasks.

Some key findings:

  • Hive on Tez: ran all queries with stable and satisfactory performance
  • Spark SQL: better than average performance overall, but could not run two queries
  • Presto: convenient to use, but performance was disappointing
  • Impala: fastest overall, but could not run one of the queries
  • Drill: very fast, but could not run three queries

Apache Flink/Data Artisans

On Slideshare, Vasia Kalavri presents on overview of Gelly, Flink’s graph engine.  More about Gelly here.

Apache Ignite/GridGain

The Apache Software Foundation promotes Ignite to top-level project status.  SD Times reports.  Ignite is a high-performance integrated and distributed in-memory platform.  Ignite is the open source version of GridGain‘s commercial product.

Apache Lens

ASF also promotes Lens to top-level status.  Apache Lens is a “Unified Analytics Platform”, whatever that is.  (h/t Hadoop Weekly)

Apache Spark/Databricks

Patrick Wendell of Databricks presented a preview of Spark 1.5 last Thursday.    Spark 1.5 will be available in mid-September (exact timing depends on Apache voting process).  Developers from more than 50 companies contributed to the build.  A preview is available in Databricks now.  Key enhancements:

  • Execution concepts will be exposed: tracking memory usage, visualizing DataFrame execution tree
  • Project Tungsten will be on by default: binary processing for memory management, code generation for CPU efficiency
  • Performance optimizations in SQL/DataFrames: Metadata discovery, predicate pushdown in Parquet, outer joins and window functions
  • First class UDAF support
  • Improved interoperability with Hive
  • Read Parquet files encoded by Hive, Impala, Pig, Avro, Thrift, Spark SQL
  • Additional Python interfaces for Spark Streaming
  • R bindings for linear models
  • Python bindings for Power Iteration Clustering
  • New algorithms and transforms for ML Pipelines

There will also be some new packages available concurrently with the 1.5 release, including support for AWS Redshift, Magellan support for spatial analytics and a convex solver package.

On Datanami, George Leopold covers the story.

Alex Woodie interviews some Spark users and discovers that they often use it together with Hadoop.

Jessica Twentyman notes that Spark looks set to replace MapReduce, inquires into the pace, scope and scale of replacement.  She finds a lot of smart people who are optimistic and a few who urge caution, citing Spark’s immaturity.

Darryl Taft explains how Spark transforms Big Data processing and development.  Spoiler: it’s faster.

In readwrite, Peter Schlampp provides six reasons that Apache Spark isn’t flickering out, thereby answering a question nobody is asking.  For the record, his reasons are: advanced analytics, simplification, support for multiple languages, faster results, Hadoop distribution agnosticism and high-growth adoption.

On the Cloudera blog, Jeff Palmucci of TripAdvisor describes how his team uses Spark.

Google Cloud

announces a new release of BigQuery with UDF support.

H2O.ai

On HomeAI, Arno Candel presents a Deep Learning Webinar.

RStudio

RStudio adds a new starter plan for shinyapps.io, a cloud service for Shiny apps.  Roger Oberg reports.