Big Analytics Roundup (May 16, 2016)

This week we have more insight into Spark 2.0, scheduled for release just before Spark Summit 2016. (Yes, I’m going.) Also, kudos to BI-on-Hadoop startup AtScale for a new round of funding; Amazon releases YADLF (Yet Another Deep Learning Framework); and there are a number of new faces at H2O.ai.

Plus, we have an extended review of the Palantir story.

Buzzfeed on Palantir

Last week, I deemed Buzzfeed’s story on Palantir too dumb to link. (“Forget it, Jake. It’s Buzzfeed.”) Buzzfeed “news” reporter William Alden, who was all over a story about maggots in Facebook lunches, breathlessly mines a cache of “secret internal documents” and discovers:

  • Palantir expects employee turnover of around 20% for 2016.
  • Palantir lost some clients.
  • Palantir books more work than it bills.

Does Palantir have an employee turnover problem?  No. A 20% turnover rate is slightly above the 17% reported for all industries in 2015, and about on track for Silicon Valley. (There are companies in SV with 100% turnover rates.) On Glassdoor, employees give Palantir high marks.

Does Palantir have a client retention problem? Not exactly. The story cites four clients — American Express, Coca-Cola, Kimberley-Clark and Nasdaq — who engaged Palantir to conduct a pilot, then decided not to proceed with a long-term contract. In other words, lost sales and not cancelled contracts. The document Buzzfeed obtained is Palantir’s won/lost analysis, which shows that the company is attempting to learn from its lost sales.

Does Palantir have a revenue problem? No. Palantir’s 2015 revenue was up 50% from the previous year. Buzzfeed obsesses over the difference between Palantir’s bookings of $1.7 billion and its revenue of $420 million. A high book-to-bill ratio  is typical for consultancies that pursue large multi-year projects; it is a sign of strong demand for the company’s services. Under GAAP accounting, companies can accrue revenue only as work is performed, even if they bill the work in advance. Note that consulting giant Accenture’s bookings exceed its revenue for its most recent quarter.

Does Palantir have a profitability problem? Possibly. Buzzfeed reports that the company lost $80 million last year on revenue of $420 million. Consulting margins tend to be fairly high, so a loss means that Palantir is “investing” in a lot of unbillable work. It’s hard to say if these “investments” will pay off. Palantir closed another round of funding in December, 2015, so people with more and better information than Buzzfeed obviously think they will, and are backing up their belief with cash.

By the way, you know who has an actual revenue problem? Buzzfeed.

Roger Peng attempts to draw lessons for data scientists from the Buzzfeed story, without questioning its premises. He should stick to Biostatistics.

Spark 2.0

— Databricks announces preview of Apache Spark 2.0 on Databricks Community Edition.

— From last week: Reynold Xin explains what’s new in Spark 2.0.

— Dave Ramel summarizes the new features, including faster SQL; consolidation of the Dataset and DataFrame APIs; support for ANSI (2003) SQL; and Structured Streaming, an integrated view of tables and streams.

— Now that Spark 2.0 is in preview, MapR offers Spark 1.6.1.

Explainers

— Four from Adrian Colyer:

— Richard Williamson explains how to build a streaming prediction engine with Spark, MADlib, Kudu and Impala.

— On the Cloudera Vision blog, Santosh Kumar explains Hive-on-Spark.

— DataStax’ Dani Traphagen explains data processing with Spark and Cassandra.

— In ZDNet, Andrew Brust explains Microsoft’s R strategy, and gets it right.

Perspectives

— For a planted article in Linux.com, Pam Baker interviews IBM’s Mike Breslin, who answer questions nobody is asking about using Spark and Cloudant.

— Joyce Wells recaps a presentation by Booz Allen’s Jair Aguirre, who touts Apache Drill.

— Alex Woodie attends the Apache: Big Data 2016 conference and discovers open source projects.

— In Business Insider, Sam Shead describes FBLearnerFlow, a workbench for machine learning and AI.

— Leslie D’Monte describes some ways companies use machine learning in their operations.

Open Source Announcements

— Google announces release to open source of SyntaxNet, a framework for natural language understanding. Included in the release: an English parser dubbed Parsey McParseface. Journalists respond to the latter like dogs to a squirrel.

— Amazon releases yet another deep learning framework, this one branded as “Deep Scalable Sparse Tensor Network Engine (DSSTNE)” or “Destiny”. Stephanie Condon reports.

— Salesforce donates PredictionIO to Apache.

— Apache Storm announces two new maintenance releases:

  • Storm 0.10.1 has bug fixes.
  • Storm 1.0.1 has performance improvements and bug fixes.

— Apache Flink announces Release 1.0.3, with bug fixes and improved documentation.

— Apache Apex pushes a release to resolve a security issue.

Commercial Announcements

— BI-on-Hadoop startup AtScale announces an $11 million “B” round. Media coverage here.

— H2O.ai announces new hires with a strong orientation towards visualization, suggesting the company plans to add a more robust user interface to its best-in-class machine learning engine.

Big Analytics Roundup (April 11, 2016)

Top story of the week is NVIDIA’s new DGX-1 deep learning chip; scroll down for more on that.

We have three roundups from Strata + Hadoop World, Rashomon style:

  • Alex Woodie reports six takeaways: Kafka, Spark, Hadoop, Cloud, machine learning, mainframes.
  • Jessica Davis recalls four things: comedian Paula Poundstone, MapR, public data sets, AI.
  • Nik Rouda recaps five things: Spark, machine learning, data warehousing, user interfaces, cloud.

— H2O.ai CTO and co-founder Cliff Click departs H2O, joins Neurensic, a firm that specializes in compliance analytics. Neurensic has a team of surname-eschewing executives that is surprisingly large considering it has no visible funding.

— Machine learning startup Context Relevant announces the appointment of Joseph Polverari as CEO, replacing board member Chris Kelley, who replaced founder Stephen Purpura in July, 2015, a month after the latter wrote a meditation on failure. Kelley’s major accomplishment: firing people. Appears that Context Relevant isn’t the next unicorn.

— One of the 76 IBM executives with the title of “CTO” touts cognitive computing. My take:

Screen Shot 2016-04-10 at 7.52.54 AM

— Forrester publishes its 2016 “Wave” for Big Data Streaming Analytics. You can go here and buy it for $2,495, get a free copy here, or just look at the picture below.

Screen Shot 2016-04-10 at 3.52.54 PM

— Spiderbook’s Aman Naimat examines data gleaned by trolling through billions of publicly available documents, identifies 2,680 companies that are using Hadoop at any level of maturity, and another 3,500 that are just learning. That’s out of a total universe of 500,000 companies worldwide. I’m thinking that trolling through billions of public documents may understate the actual incidence of Hadoop usage.

— Crowdflower, a data enrichment platform, surveys data scientists and publishes the results. The report does not disclose how data scientists were identified and sampled, which is key to interpreting surveys like this. Respondents report that they spend a lot of time mucking around with data, which won’t surprise anyone, since Crowdflower sells a service that helps data scientists spend less time mucking with data.

NVIDIA Unveils Deep Learning Chip

— NVIDIA announces June availability for the DGX-1, a deep learning supercomputer on a chip. The DGX-1 includes eight Tesla P100 GPUs, each of which is 12X faster than NVIDIA’s previous benchmark. For $129K you get the throughput of 250 CPU-based servers.

— NVIDIA also reveals a Deep Learning SDK with Deep Learning primitives, math libraries, tools for multi-GPU communication, a CUDA toolkit and DIGITS, a model training system. The system works with popular Deep Learning frameworks like Caffe, CNTK, TensorFlow and Theano.

— Selected media reports:

— MIT Technology Review interviews NVIDIA CEO Jen-Hsun Huang.

Explainers

— Ian Pointer explains Structured Streaming, coming up in Spark 2.0.

— Till Rohrmann introduces Complex Event Processing (CEP) with Flink.

— Maxime Beauchemin explains Caravel, Airbnb’s data exploration platform.

— LinkedIn’s Akshay Rai explains Dr. Elephant, a newly open-sourced self-service performance tuning package for Hadoop and Spark.

— In a guest post on the Cloudera Engineering Blog, engineers from Wargaming.net explain how they built their real-time recommendation engine with Spark, Kafka, HBase and Drools.

— Katrin Leinweber et. al. explain how to analyze an assay of bacteria-induced biofilm formation the freshwater diatom Achnanthidium minutissimum with KNIME. In case you’re wondering, Achnanthidium minutissimum is a kind of algae.

Perspectives

— On LinkedIn, George Hill of The Cyclist nicely critiques the 2011 McKinsey Big Data report, offering a point by point assessment.

— Mauricio Prinzlau of Cloudwards.net opines, without data, that the five languages paving the future of machine learning are MATLAB/Octave, R, Python, “Java-family/C-family” and Extreme Learning Machines (ELM). What was that last one again? Personally, I’ve never seen anyone lump Java and C into a single category, but whatever.

— In InfoWorld, “internationally recognized industry expert and thought leader” David Linthicum ventures into the machine learning discussion by arguing that it’s mostly BS.

— John Dunn demonstrates his ignorance of fraud by asking if machine learning can help banks detect it. As if they haven’t been doing that for years. Also, the “hard decline” he describes at the beginning of the article is rare; most false positives produce “soft declines,”, where the merchant is asked to request identification or speak with the call center.

— In IBT, Ian Allison wonders if financial analysts will lose their jobs to intelligent trading machines. If he watched Billions, he would know that financial analysts spend their time procuring inside information.

— Timo Elliott argues that BI is dead. I have to wonder if it was ever alive.

— Confluent CTO Neha Narkhede opines on stream processing. She’s in favor of it.

— Brandon Butler interviews AWS’ Matt Wood, who chats about competing with Google and Microsoft.

— On Forbes, Robert Hof interviews Cloudera CEO Tom Reilly.

Open Source Announcements

— Qubole releases SQL optimizer Quark to open source.

— Flink releases version 1.0.1, a maintenance release.

— Apache Lens, a “unified analytics interface,” releases version 2.5.0 to beta.

— Airbnb open sources Caravel, a data exploration package.

— Apache Tajo announces Release 0.11.2, which should please its user.

— LinkedIn releases Dr. Elephant to open source.

Commercial Announcements

— Databricks announces the agenda for Spark Summit 2016 in SFO.

— Cloudera announces Cloudera Enterprise 5.7. New analytic bits include Hive-on-Spark GA, support for the HBase-Spark module, support for Spark 1.6 and support for Impala 2.5.

— MapR announces availability of Apache Drill 1.6 as the unified SQL layer for the MapR Converged Data Platform.

Spark Summit Europe Roundup

The 2015 Spark Summit Europe met in Amsterdam October 27-29.  Here is a roundup of the presentations, organized by subject areas.   I’ve omitted a few less interesting presentations, including some advertorials from sponsors.

State of Spark

— In his keynoter, Matei Zaharia recaps findings from Databricks’ Spark user survey, notes growth in summit attendance, meetup membership and contributor headcount.  (Video here). Enhancements expected for Spark 1.6:

  • Dataset API
  • DataFrame integration for GraphX, Streaming
  • Project Tungsten: faster in-memory caching, SSD storage, improved code generation
  • Additional data sources for Streaming

— Databricks co-founder Reynold Xin recaps the last twelve months of Spark development.  New user-facing developments in the past twelve months include:

  • DataFrames
  • Data source API
  • R binding and machine learning pipelines

Back-end developments include:

  • Project Tungsten
  • Sort-based shuffle
  • Netty-based network

Of these, Xin covers DataFrames and Project Tungsten in some detail.  Looking ahead, Xin discusses the Dataset API, Streaming DataFrames and additional Project Tungsten work.  Video here.

Getting Into Production

— Databricks engineer and Spark committer Aaron Davidson summarizes common issues in production and offers tips to avoid them.  Key issues: moving beyond Python performance; using Spark with R; network and CPU-bound workloads.  Video here.

— Tuplejump’s Evan Chan summarizes Spark deployment options and explains how to productionize Spark, with special attention to the Spark Job Server.  Video here.

— Spark committer and Databricks engineer Andrew Or explains how to use the Spark UI to visualize and debug performance issues.  Video here.

— Kostas Sakellis and Marcelo Vanzin of Cloudera provide a comprehensive overview of Spark security, covering encryption, authentication, delegation and authorization.  They tout Sentry, Cloudera’s preferred security platform.  Video here.

Spark for the Enterprise

— Revisting Matthew Glickman’s presentation at Spark Summit East earlier this year, Vinny Saulys reviews Spark’s impact at Goldman Sachs, noting the attractiveness of Spark’s APIs, in-memory processing and broad functionality.  He recaps Spark’s viral adoption within GS, and its broad use within the company’s data science toolkit.  His wish list for Spark: continued development of the DataFrame API; more built-in formulae; and a better IDE for Spark.  Video here.

— Alan Saldich summarizes Cloudera’s two years of experience working with Spark: a host of engineering contributions and 200+ customers (including Equifax, Barclays and a slide full of others).  Video here.  Key insights:

  • Prediction is the most popular use case
  • Hive is most frequently co-installed, followed by HBase, Impala and Solr.
  • Customers want security and performance comparable to leading relational databases combined with simplicity.

Data Sources and File Systems

— Stephan Kessler of SAP and Santiago Mola of Stratio explain Spark integration with SAP HANA Vora through the Data Sources API.  (Video unavailable).

— Tachyon Nexus’ Gene Pang offers an excellent overview of Tachyon’s memory-centric storage architecture and how to use Spark with Tachyon.  Video here.

Spark SQL and DataFrames

— Michael Armbrust, lead developer for Spark SQL, explains DataFrames.  Good intro for those unfamiliar with the feature.  Video here.

— For those who think you can’t do fast SQL without a Teradata box, Gianmario Spacagna showcases the Insight Engine, an application built on Spark.  More detail about the use case and solution here.  The application, which requires many very complex queries, runs 500 times faster on Spark than on Hive, and likely would not run at all on Teradata.  Video here.

— Informatica’s Kiran Lonikar summarizes a proposal to use GPUs to support columnar data frames.  Video here.

— Ema Orhian of Atigeo describes jaws, a restful data warehousing framework built on Spark SQL with Mesos and Tachyon support.  Video here.

Spark Streaming

— Helena Edelson, VP of Product Engineering at Tuplejump, offers a comprehensive overview of streaming analytics with Spark, Kafka, Cassandra and Akka.  Video here.

— Francois Garillot of Typesafe and Gerard Maas of virdata explain and demo Spark Streaming.    Video here.

— Iulian Dragos and Luc Bourlier explain how to leverage Mesos for Spark Streaming applications.  Video here.

Data Science and Machine Learning

— Apache Zeppelin creator and NFLabs co-founder Moon Soo Lee reviews the Data Science lifecycle, then demonstrates how Zeppelin supports development and collaboration through all phases of a project.  Video here.

— Alexander Ulanov, Senior Research Scientist at Hewlett-Packard Labs, describes his work with Deep Learning, building on MLLib’s multilayer perceptron capability.  Video here.

— Databricks’ Hossein Falaki offers an introduction to R’s strengths and weaknesses, then dives into SparkR.  He provides an overview of SparkR architecture and functionality, plus some pointers on mixing languages.  The SparkR roadmap, he notes, includes expanded MLLib functionality; UDF support; and a complete DataFrame API.  Finally, he demos SparkR and explains how to get started.  Video here.

— MLlib committer Joseph Bradley explains how to combine the strengths R, scikit-learn and MLlib.  Noting the strengths of R and scikit-learn libraries, he addresses the key question: how do you leverage software built to support single-machine workloads in a distributed computing environment?   Bradley demonstrates how to do this with Spark, using sentiment analysis as an example.  Video here.

— Natalino Busa of ING offers an introduction to real-time anomaly detection with Spark MLLib, Akka and Cassandra.  He describes different methods for anomaly detection, including distance-based and density-based techniques. Video here.

— Bitly’s Sarah Guido explains topic modeling, using Spark MLLib’s Latent Dirchlet Allocation.  Video here.

— Casey Stella describes using word2vec in MLLib to extract features from medical records for a Kaggle competition.  Video here.

— Piotr Dendek and Mateusz Fedoryszak of the University of Warsaw explain Random Ferns, a bagged form of Naive Bayes, for which they have developed a Spark package. Video here.

GeoSpatial Analytics

— Ram Sriharsha touts Magellan, an open source geospatial library that uses Spark as an engine.  Magellan, a Spark package, supports ESRI format files and GeoJSON; the developers aim to support the full suite of OpenGIS Simple Features for SQL.  Video here.

Use Cases and Applications

— Ion Stoica summarizes Databricks’ experience working with hundreds of companies, distills to two generic Spark use cases:  (1) the “Just-in-Time Data Warehouse”, bypassing IT bottlenecks inherent in conventional DW; (2) the unified compute engine, combining multiple frameworks in a single platform.  Video here.

— Apache committer and SKT engineer Yousun Jeong delivers a presentation documenting SKT’s Big Data architecture and a use case real-time analytics.  SKT needs to perform real-time analysis of the radio access network to improve utilization, as well as timely network quality assurance and fault analysis; the solution is a multi-layered appliance that combines Spark and other components with FPGA and Flash-based hardware acceleration.  Video here.

— Yahoo’s Ayman Farahat describes a collaborative filtering application built on Spark that generates 26 trillion recommendations.  Training time: 52 minutes; prediction time: 8 minutes.  Video here.

— Sujit Pal explains how Elsevier uses Spark together with Solr, OpenNLP to annotate documents at scale.  Elsevier has donated the application, called SoDA, back to open source.  Video here.

— Parkinson’s Disease affects one out of every 100 people over 60, and there is no cure.  Ido Karavany of Intel describes a project to use wearables to track the progression of the illness, using a complex stack including pebble, Android, IOS, play, Phoenix, HBase, Akka, Kafka, HDFS, MySQL and Spark, all running in AWS.   With Spark, the team runs complex computations daily on large data sets, and implements a rules engine to identify changes in patient behavior.  Video here.

— Paula Ta-Shma of IBM introduces a real-time routing use case from the Madrid bus system, then describes a solution that includes kafka, Secor, Swift, Parquet and elasticsearch for data collection; Spark SQL and MLLib for pattern learning; and a complex event processing engine for application in real time.  Video here.

Big Analytics Roundup (April 27, 2015)

In the news this week: ODP, Spark Summit and a culinary FAIL from IBM Watson.

MapR to ODP: Get Lost

On the MapR blog, CEO John Schroeder describes ODP as “a Hortonworks marketing vehicle that provides a graceful market exit for Greenplum Pivotal,”  thus voicing thoughts shared by everyone not employed by Hortonworks and Pivotal.  (Additional coverage here.)  Schroeder notes that ODP adds a redundant layer of opaque pay-to-play governance, solves problems that don’t need solving and misdefines the Hadoop core in ways that serve the interests of Hortonworks.

Other than that, he’s for it.

In Datanami, Alex Woodie covers the “debate”, writing that ODP’s launch “effectively split the Hadoop community down the middle.”  Eighteen paragraphs later, he notes that Cloudera and MapR support 75% of the Hadoop implementations.  In other words, on one side we have Hadoop’s leaders and, on the other we have ODP.

Spark Summit 2015 Posts Agenda

The organizers of Spark Summit 2015, to be held in San Francisco June 15-17, have posted the agenda.   Keynotes are still TBD.  On the first two days there will be three tracks, one each targeting developers, data scientists and people like me who care mostly about applications.  Among the presenters: NBC Universal, Netflix, Capital One, Beth Israel Deaconess, Edmunds.com, Shopify, OpenTable, AutoTrader, Uber, UnderArmour, Thomson Reuters, Salesforce.com and Duke University, thus demonstrating that Spark really is enterprise-ready.

Predixion Lands Cash?

Predixion Software announces a “D” Round, does not disclose amount.  In other words, they’re still negotiating.

The “C” round 22 months ago drew $21 million.

Applications of Note

Bots that report on other bots.

Apache Spark Updates

At ComputerWeekly.com, Lindsay Clarke profiles Spark, gets it right.

Arush Kharbanda delivers an excellent guide to Spark Streaming for opensource.com.

The bloggers at Sematext say they see Spark Streaming displacing Storm.  Hortonworks, are you listening?

On the Databricks blog:

  • Reynold Xin summarizes recent Spark performance improvements.
  • Ion Stoica and Vida Ha demonstrate analysis of Apache Access logs with Databricks Cloud.
  • Daniel Darabos of Lynx Analytics touts LynxKite, a graph analytics solution that leverages Spark.

Kay Ewbank writes a positive review of Learning Spark, the recently released book by Holden Karau, et. al.

Kay Ousterhout et. al. test three workloads in Spark, conclude that performance is CPU-bound and not disk or network bound.  (Republished in The Morning Paper).

Other Updates

The R Core Team has announced availability of R 3.2.0.

For those so inclined, the Mahout team has posted a guide to building an app in Mahout.

Google adds stream processing capabilities to BigQuery.

MapR releases on-demand training for Apache Drill.

Microsoft releases a free ebook on Azure Machine Learning.  It’s nicely written.

Big Analytics Roundup (April 13, 2015)

This week:  Microsoft closes on the acquisition of Revolution Analytics, plus lots of cloud news driven by the AWS Summit in San Francisco.

But the top item for the week is this History of Hadoop, from Marko Bonaci.

Update:  OK, the top item is actually this piece from Dave McClure on unicorns and dinosaurs.

Amazon Web Services

If you thought Amazon would let Microsoft own the cloud-based machine learning space, think again.  Amazon introduces Amazon Machine Learning. (h/t Oliver Vagner)

Apache Drill

In Big Data Quarterly, Jim Scott offers an excellent summary of Apache Drill and its significance for the Hadoop ecosystem

Apache Mahout

The Mahout team announces Release 0.10, which includes a distributed algebraic optimizer, a Scala API and the Spark interface.  The team has optimistically re-branded these capabilities as Samsara, which suggests that we can escape from Mahout by following the Buddhist path.

Apache Spark

Advanced Analytics with Spark, the new book by Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills, is now available.

Writing in insideBIGDATA, MemSQL CEO Eric Frenkiel champions Spark working together with MemSQL.

Cloud

Writing in ITBusinessEdge Arthur Cole says analytics is heading toward the cloud.  Newsflash: analytics is already in the cloud, big time.  There are organizations today that run most or all of their advanced analytics in the cloud, and the most sophisticated have done so for years.

Cloud is eating the analytics world because predictive modeling requires large-scale computing power in short bursts; organizations that scale up on-premises computing power to meet peak requirements will own a lot of unused server capacity.  Moreover, cloud enables analysts to radically reduce cycle time and build better models with massively parallel test-and-learn operations.

In an InfoWorld piece headlined Big Data is All About the Cloud Matt Asay argues that Big Data is about other things, too, like streaming and dedicated task clusters.  He interviews Matt Wood of Amazon Web Services, who thinks cloud is a good thing.

Databricks

Databricks announces that it is now an Amazon Web Services Advanced Technology Partner.

On the Databricks blog, Andy Konwinski recaps Spark Summit East.

Informatica

News of the company’s plan to go private produces a slew of overwrought articles about “generational shifts” in data integration like this one from Alex Woodie in Datanami.   Venture capitalists pay for potential and Wall Street pays for growth, but private owners want recurring revenue and profit margins; hence, private ownership is the best model for firms that are well along in the hype cycle, past the “Trough of Disillusionment” and well into the “Slope of Enlightenment”.  It shouldn’t surprise anyone that SnapLogic, Alteryx, ClearstoryData, Trifacta and Paxata all have higher growth rates than Informatica; after all, 1+1 equals 100% growth.  Nevertheless, the total revenue of those companies amounts to rounding error on Informatica’s 10-K, so grave-dancing seems premature.

Gartner_Hype_Cycle.svg

Microsoft

Microsoft closes on its acquisition of Revolution Analytics (previously discussed here, here and here.)   Financial terms are undisclosed, so we will just have to troll through MSFT’s next 10-Q to confirm rumors about the price.  Additional coverage here and here.  Dave Rich, CEO of Revolution Analytics, assumes the role of General Manager, Advanced Analytics for Microsoft.

Big Analytics Roundup (March 23, 2015)

This week, Spark Summit East produced a deluge of news and analysis on Apache Spark and Databricks.  Also in the news: a couple of ventures landed funding, SAP released software and SAS soft-launched something new for SAS Visual Analytics.

Analytic Startups

Venture Capital Dispatch on WSJ.D reports that Andreeson Horowitz has invested $7.5 million in AMPLab spinout Tachyon Nexus.  Tachyon Nexus supports the eponymous Tachyon project, a memory-centric storage layer that runs underneath Apache Spark or independently.

Social media mining venture Dataminr pulls $130 million in “D” round financing, demonstrating that the real money in analytics is in applications, not algorithms.

Apache Flink

On the Flink project blog, Fabian Hueske posts an excellent article that describes how joins work in Flink.

Apache Spark

ADTMag rehashes the tired debate about whether Spark and Hadoop are “friends” or “foes”.  Sounds like teens whispering in the hallways of Silicon Valley High.  Spark works with HDFS, and it works with other datastores; it all depends on your use case.  If that means a little less buzz for Hadoop purists, get over it.

To that point, Matt Kalan explains how to use Spark with MongoDB on the Databricks blog.

A paper published by a team at Berkeley summarizes results from Spark benchmark testing, draws surprising conclusions.

In other commentary about Spark:

  • TechCrunch reports on the growth of Spark.
  • TechRepublic wonders if anything can dim Spark.
  • InfoWorld lists five reasons to use Spark for Big Data.

In VentureBeat, Sharmila Mulligan relates how ClearStory Data’s big bet on Spark paid off without explaining the nature of the payoff.  ClearStory has a nice product, but it seems a bit too early for a victory lap.

On the Spark blog, Justin Kestelyn describes exactly-once Spark Streaming with Apache Kafka, a new feature in Spark 1.3.

Databricks

Doug Henschen chides Ion Stoica for plugging Databricks Cloud at Spark Summit East, hinting darkly that some Big Data vendors are threatened by Spark and trying to plant FUD about it.  Vendors planting FUD about competitors that threaten them: who knew that people did such things?  It’s not clear what revenue model Henschen thinks Databricks should pursue; as Hortonworks’ numbers show, “contributing to open source” alone is not a viable business model.  If those Big Data vendors are unhappy that Databricks Cloud competes with what they offer, there is nothing to stop them from embracing Spark and standing up their own cloud service.

In other news:

  • On the Databricks blog, the folks from Uncharted Software describe PanTera, cool visualization software that runs in Databricks Cloud.
  • Rob Marvin of SD Times rounds up new product announcements from Spark Summit East.
  • In PCWorld, Joab Jackson touts the benefits of Databricks Cloud.
  • ConsumerElectronicsNet recaps Databricks’ announcement of the Jobs feature for Databricks Cloud, plus other news from Spark Summit East.
  • On ZDNet, Toby Wolpe reviews the new Jobs feature for production workloads in Databricks Cloud.
  • On the Databricks blog, Abi Mehta announces that Tresata’s TEAK application for AML will be implemented on Databricks Cloud.  Media coverage here, here and here.

Geospatial

MemSQL announced geospatial capabilities for its distributed in-memory NewSQL database.

J. Andrew Rogers asks why geospatial databases are hard to build, then answers his own question.

RapidMiner

Butler Analytics publishes a favorable review of RapidMiner.

SAP

SAP released a new on-premises version of Lumira Edge for visualization, adding to the list of software that is not as good as Tableau.  SAP also released Predictive Analytics 2.0, a product that marries the toylike SAP Predictive Analytics with KXEN InfiniteInsight, a product acquired in 2013.  According to SAP, Predictive Analytics 2.0 is a “single, unified analytics product” with two work environments, which sounds like SAP has bundled two different code bases into a marketing bundle with a common datastore.  Going for a “three-fer”, SAP also adds Lumira Edge to the bundle.

SAS

American Banker reports that SAS has “launched” SAS Transaction Monitoring Optimization for AML scenario testing; in this case, “launch”, means marketing collateral is available.  The product is said to run on top of SAS Visual Analytics, which itself runs on top of SAS LASR Server, SAS’ “other” distributed in-memory platform.

Spark Summit East: A Report (Updated)

Updated with links to slides where available.  Some links are broken, conference organizers have been notified.

Spark Summit East 2015 met on March 18 and 19 at the Sheraton Times Square in New York City.  Conference organizers announced another sellout (like the last two Spark Summits on the West Coast).

Competition for speaking slots at Spark events is heating up.  There were 170 submissions for 30 speaking slots at this event, compared to 85 submissions for 50 slots at Spark Summit 2014.  Compared to the last Spark Summit, presentations in the Applications Track, which I attended, were more polished, and demonstrate real progress in putting Spark to work.

The “father” of Spark, Matei Zaharia, kicked off the conference with a review of Spark progress in 2014 and planned enhancements for 2015.  Highlights of 2014 include:

  • Growth in contributors, from 150 to 500
  • Growth in the code base, from 190K lines to 370K lines
  • More than 500 known production instances at the close of 2014

Spark remains the most active project in the Hadoop ecosystem.

Also, in 2014, a team at Databricks smashed the Daytona GreySort record for petabyte-scale sorting.  The previous record, set in 2013, used MapReduce running on 2,100 machines to complete the task in 72 minutes.  The new record, set by Databricks with Spark running in the cloud, used 207 machines to complete the task in 23 minutes.

Key enhancements projected for 2015 include:

  • DataFrames, which are similar to frames in R, already released in Spark 1.3
  • R interface, which currently exists as SparkR, an independent project, targeted to be merged into Spark 1.4 in June
  • Enhancements to machine learning pipelines, which are sequences of tasks linked together into a process
  • Continued expansion of smart interfaces to external data sources, pushing logic into the sources
  • Spark packages — a repository for third-party packages (comparable to CRAN)

Databricks CEO Ion Stoica followed with a pitch for Databricks Cloud, which included brief testimonials from myfitnesspal, Automatic, Zoomdata, Uncharted Software and Tresata.

Additional keynoters included Brian Schimpf of Palantir, Matthew Glickman of Goldman Sachs and Peter Wang of Continuum Analytics.

Spark contributors presented detailed views on the current state of Spark:

  • Michael Armbrust, Spark SQL lead developer presented on the new DataFrames API and other enhancements to Spark SQL.
  • Tathagata Das delivered a talk on the current state and future of Spark Streaming.
  • Joseph Bradley covered MLLib, focusing on the Pipelines capability added in Spark 1.2
  • Ankur Dave offered an overview of GraphX, Spark’s graph engine.

Several observations from the Applications track:

(1) Geospatial applications had a strong presence.

  • Automatic, Tresata and Uncharted all showed live demonstrations of marketable products with geospatial components running on Spark
  • Mansour Raad of ESRI followed his boffo performance at Strata/Hadoop World last October with a virtuoso demonstration of Spark with massive spatial and temporal datasets and the ESRI open source GIS stack

(2) Spark provides a great platform for recommendation engines.

  • Comcast uses Spark to serve personalized recommendations based on analysis of billions of machine-generated events
  • Gilt Groupe uses Spark for a similar real-time application supporting flash sale events, where products are available for a limited time and in limited quantities
  • Leah McGuire of Salesforce described her work building a recommendation system using Spark

(3) Spark is gaining credibility in retail banking.

  • Sandy Ryza of Cloudera presented on Value At Risk (VAR) computations in Spark, a critical element in Basel reporting and stress testing
  • Startup Tresata demonstrated its application for Anti Money Laundering, which is built on a social graph built in Spark

(4) Spark has traction in the life sciences

  • Jeremy Freeman of HHMI Janelia Research Center, a regular presenter at Spark Summits, covered Spark’s unique capability for streaming machine learning.
  • David Tester of Novartis presented plans to build a trillion-edge graph for genomic integration
  • Timothy Danforth of Berkeley’s AMPLab delivered a presentation on next-generation genomics with Spark and ADAM
  • Kevin Mader of ETH Zurich spoke about turning big hairy 3D images into simple, robust, reproducible numbers without resorting to black boxes or magic

Also in the applications track: presenters from Baidu, myfitnesspal and Shopify.

Big Analytics Roundup (March 16, 2015)

Big Analytics news and analysis from around the web.  Featured this week: a new Spark release, Spark Summit East, H2O, FPGA chips, Machine Learning, RapidMiner, SQL on Hadoop and Chemistry Cat.

A reminder to readers that Spark Summit East is coming up March 18-19.

Alteryx

  • On the Alteryx Blog, Michael Snow plugs Alteryx and Qlik for predictive analytics.
  • And again, the same combo for spatial analytics.
  • Adam Riley blogs on testing Alteryx macros.

Apache Spark

For an overview, see the Apache Spark Page.

  • The Spark team announces availability of Spark 1.3.0.  Release notes here.  Highlights of the new release include the DataFrames API, Spark SQL graduates from Alpha, new algorithms in MLLib and Spark Streaming, a direct Kafka API for Spark Streaming, plus additional enhancements and bug fixes.  More on this release separately.
  • On Slideshare, Matei Zaharia outlines the 2015 roadmap for Apache Spark.
  • Also on Slideshare, Reynold Xin and Matei review lessons learned from running large Spark clusters.
  • In advance of Spark Summit, O’Reilly offers discounts on Spark video training and books.
  • Sandy Ryza, co-author of Advanced Analytics With Sparkwrites on tuning Spark jobs, on the Cloudera Engineering blog
  • Databricks announces that advertising automation vendor Sharethrough has selected Spark and Databricks Cloud to process Terabyte scale clickstream data.  Case study published here.
  • Holden Karau publishes a Spark testing procedure on Git.
  • On RedMonk, Donnie Berkholz summarizes growing awareness and interest in Spark.

Buzzwords

  • In Wired, Patrick McFadin hits the trifecta with Apache Spark, NoSQL databases and IoT.

H2O

High Performance Computing

  • Datanami reports that a Ryft One FPGA chip (with limited functionality) offers throughput equivalent to 100-200 Spark nodes.  More coverage here.   Ryft’s Christian Shrauder blogs about FGPA.

Machine Learning

  • Ching and Daniel propose using Random Matrix Theory to analyze highly dimensional social media data.
  • Cheng-Tao Chu offers seven ways to mess up your next machine learning project.
  • AMPLab‘s Jiannen Wang blogs on human-in-the-loop machine learning.  Someone should write a book about that.

RapidMiner

SQL on Hadoop

  • On the Pivotal blog, a podcast about Hawq.
  • The Apache Software Foundation announces release 0.10 of Apache Tajo; Silicon Angle reports with a backgrounder.
  • TechWorld reports that AirBNB has open-sourced Airpal, an application that runs on Facebook’s PrestoDB.  According to the story, Airpal is an application that “allows…non-technical employees to work like data scientists”, which suggests that TechWorld thinks data scientists do nothing but SQL.
  • Splice Machine has updated FAQs for its RDBMS-on-Hadoop.

Zementis

Apache Spark for Big Analytics (Updated for Spark Summit and Release 1.0.1)

Updated and bumped July 10, 2014.

For a powerpoint version on Slideshare, go here.

Introduction

Apache Spark is an open source distributed computing framework for advanced analytics in Hadoop.  Originally developed as a research project at UC Berkeley’s AMPLab, the project achieved incubator status in Apache in June 2013 and top-level status in February 2014.  According to one analyst, Apache Spark is among the five key Big Data technologies, together with cloud, sensors, AI and quantum computing.

Organizations seeking to implement advanced analytics in Hadoop face two key challenges.  First, MapReduce 1.0 must persist intermediate results to disk after each pass through the data; since most advanced analytics tasks require multiple passes through the data, this requirement adds latency to the process.

A second key challenge is the plethora of analytic point solutions in Hadoop.  These include, among others, Mahout for machine learning; Giraph, and GraphLab for graph analytics; Storm and S4 for streaming; or HiveImpala and Stinger for interactive queries.  Multiple independently developed analytics projects add complexity to the solution; they pose support and integration challenges.

Spark directly addresses these challenges.  It supports distributed in-memory processing, so developers can write iterative algorithms without writing out a result set after each pass through the data.  This enables true high performance advanced analytics; for techniques like logistic regression, project sponsors report runtimes in Spark 100X faster than what they are able to achieve with MapReduce.

Second, Spark offers an integrated framework for analytics, including:

A closely related project, Shark, supports fast queries in Hadoop.  Shark runs on Spark and the two projects share a common heritage, but Shark is not currently included in the Apache Spark project.  The Spark project expects to absorb Shark into Spark SQL as of Release 1.1 in August 2014.

Spark’s core is an abstraction layer called Resilient Distributed Datasets, or RDDs.  RDDs are read-only partitioned collections of records created through deterministic operations on stable data or other RDDs.  RDDs include information about data lineage together with instructions for data transformation and (optional) instructions for persistence.  They are designed to be fault tolerant, so that if an operation fails it can be reconstructed.

For data sources, Spark works with any file stored in HDFS, or any other storage system supported by Hadoop (including local file systems, Amazon S3, Hypertable and HBase).  Hadoop supports text files, SequenceFiles and any other Hadoop InputFormat.  Through Spark SQL, the Spark user can import relational data from Hive tables and Parquet files.

Analytic Features

Spark’s machine learning library, MLLib, is rapidly growing.   In Release 1.0.0 (the latest release) it includes:

  • Linear regression
  • Logistic regression
  • k-means clustering
  • Support vector machines
  • Alternating least squares (for collaborative filtering)
  • Decision trees for classification and regression
  • Naive Bayes classifier
  • Distributed matrix algorithms (including Singular Value Decomposition and Principal Components Analysis)
  • Model evaluation functions
  • L-BFGS optimization primitive

Linear regression, logistic regression and support vector machines all use a gradient descent optimization algorithm, with options for L1 and L2 regularization.  MLLib is part of a larger machine learning project (MLBase), which includes an API for feature extraction and an optimizer (currently in development with planned release in 2014).

In March, the Apache Mahout project announced that it will shift development from MapReduce to Spark.  Mahout no longer accepts projects built on MapReduce; future projects leverage a DSL for linear algebra implemented on Spark.  The Mahout team will maintain existing MapReduce projects.  There is as yet no announced roadmap to migrate existing projects from MapReduce to Spark.

Spark SQL, currently in Alpha release, supports SQL, HiveQL, and Scala. The foundation of Spark SQL is a type of RDD, SchemaRDD, an object similar to a table in a relational database. SchemaRDDs can be created from an existing RDD, Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive.

GraphX, Spark’s graph engine, combines the advantages of data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark framework.  It enables users to interactively load, transform, and compute on massive graphs.  Project sponsors report performance comparable to Apache Giraph, but in a fault tolerant environment that is readily integrated with other advanced analytics.

Spark Streaming offers an additional abstraction called discretized streams, or DStreams.  DStreams are a continuous sequence of RDDs representing a stream of data.  The user creates DStreams from live incoming data or by transforming other DStreams.  Spark receives data, divides it into batches, then replicates the batches for fault tolerance and persists them in memory where they are available for mathematical operations.

Currently, Spark supports programming interfaces for Scala, Java and Python;  MLLib algorithms support sparse feature vectors in all three languages.  For R users, Berkeley’s AMPLab released a developer preview of SparkR in January 2014

There is an active and growing developer community for Spark: 83 developers contributed to Release 0.9, and 117 developers contributed to Release 1.0.0.  In the past six months, developers contributed more commits to Spark than to all of the other Apache analytics projects combined.   In 2013, the Spark project published seven double-dot releases, including Spark 0.8.1 published on December 19; this release included YARN 2.2 support, high availability mode for cluster management, performance optimizations and improvements to the machine learning library and Python interface.  So far in 2014, the Spark team has released 0.9.0 in February; 0.9.1, a maintenance release, in April; and 1.0.0 in May.

Release 0.9 includes Scala 2.10 support, a configuration library, improvements to Spark Streaming, the Alpha release for GraphX, enhancements to MLLib and many other enhancements).  Release 1.0.0 features API stability, integration with YARN security, operational and packaging improvements, the Alpha release of Spark SQL, enhancements to MLLib, GraphX and Streaming, extended Java and Python support, improved documentation and many other enhancements.

Distribution

Spark is now available in every major Hadoop distribution.  Cloudera announced immediate support for Spark in February 2014; Cloudera partners with Databricks.  (For more on Cloudera’s support for Spark, go here).  In April, MapR announced that it will distribute Spark; Hortonworks and Pivotal followed in May.

Hortonworks’ approach to Spark focuses more narrowly on its machine learning capabilities, as the firm continues to promote Storm for streaming analytics and Hive for SQL.

IBM’s commitment to Spark is unclear.  While BigInsights is a certified Spark distribution and IBM was a Platinum sponsor of the 2014 Spark Summit, there are no references to Spark in BigInsights marketing and technical materials.

In May, NoSQL database vendor Datastax announced plans to integrate Apache Cassandra with the Spark core engine.  Datastax will partner with Databricks on this project; availability expected summer 2014.

At the 2014 Spark Summit, SAP announced its support for Spark.  SAP offers what it characterizes as a “smart integration”, which appears to represent Spark objects in HANA as virtual tables.

On June 26, Databricks announced its Certified Spark Distribution program, which recognizes vendors committed to supporting the Spark ecosystem.   The first five vendors certified under this program are Datastax, Hortonworks, IBM, Oracle and Pivotal.

At the 2014 Spark Summit, Cloudera, Dell and Intel announced plans to deliver a Spark appliance.

Ecosystem

In April, Databricks announced that it licensed the Simba ODBC engine, enabling BI platforms to interface with Spark.

Databricks offers a certification program for Spark; participants currently include:

In May, Databricks and Concurrent Inc announced a strategic partnership.  Concurrent plans to add Spark support to its Cascading development environment for Hadoop.

Community

In December, the first Spark Summit attracted more than 450 participants from more than 180 companies.  Presentations covered a range of applications such as neuroscienceaudience expansionreal-time network optimization and real-time data center management, together with a range of technical topics. (To see the presentations, search YouTube for ‘Spark Summit 2013’, or go here).

The 2014 Spark Summit was be held June 30 through July 2 in San Francisco.  The event sold out at more than a thousand participants.  For a summary, see this post.

There is a rapidly growing list of Spark Meetups, including:

Now available for pre-order on Amazon:

Finally, this series of videos provides some good basic knowledge about Spark.