Big Analytics Roundup (January 18, 2016)
The main news this week is Microsoft’s rebranding and packaging of the software it acquired last year when it purchased Revolution Analytics. Also in the news: MapR offers some oddly packaged software for Test Drive on AWS, and Yahoo dumps some data. Plus, we have a few quick hits and a nice crop of explainers.
Andrew Brust speculates that Databricks’ recently announced leadership changes mean the company is preparing for acquisition or IPO. Every startup is preparing for an exit all of the time; those that say they aren’t are lying. That said, the changes at Databricks — in which nobody’s out, several people are up and one new Sales VP is added from outside — look more to me like a company that wants to sharpen its sales focus.
On Datanami, Alex Woodie surveys the SQL-on-Hadoop landscape and gets it mostly right. Three quibbles: first, the argument that Hive is fine for batch but not for interactive SQL ignores the significant performance improvements of Hive-on-Tez, which competes well with fast SQL platforms. Second, the ANSI SQL versus HiveQL debate is largely theoretical, and does not reflect meaningful differences that show up in assessments. Third, vendor-specific tools like BigSQL (IBM), Big Data SQL (Oracle), Vertica SQL (HPE) and Hawq (Pivotal) are a step backwards; rather than enabling a federated architecture, they simply promote lock-in to the vendors that offer them.
In the morning paper, Adrian Colyer abstracts an interesting paper from Nanavanti et al. on the impact of non-volatile storage devices, or storage class memories (SCM) on datacenter architecture. The paper documents how high-speed storage devices will disrupt current thinking about the ratio of CPUs to memory and storage.
In a metaphor for Pivotal, Stacey Schneider delivers her predictions for 2016 two weeks late.
- On R-bloggers, Yuki Katoh explains how to build a simple image recognition app with TensorFlow and Shiny.
- In a guest post on the Hortonworks blog, John Kreisa of Hortonworks partner Arena explains how to build an advanced analytics platform for media management. Not surprisingly, he uses HDP.
- On the MapR blog, Balaji Mohanam explains the differences between Spark and Flink and correctly notes that the differences are more theoretical than practical.
- On IBM DeveloperWorks, Jesse Chen explains why you should use Parquet with Spark SQL, offering results from 24 TPC-DS queries, with 10X speedup for Parquet versus text.
- On the AWS Big Data Blog, Arno Abeyaratne explains how to query Amazon Kinesis streams with SQL and Spark Streaming.
Microsoft Rebrands Revolution R
Microsoft announces rebranding of the software acquired when it bought Revolution Analytics last January:
- Revolution R Enterprise is now Microsoft R Server.
- Revolution R Open is now Microsoft R Open.
Microsoft R Open (MRO) is an enhanced free distribution of open source R. Enhancements currently include multi-core processing, a fixed CRAN repository date and the checkpoint package for reproducibility. MRO is available on Windows, Mac OS X and Linux. For more information, check the Microsoft R Application Network. Microsoft includes MRO in the Microsoft Data Science Virtual Machine together with Anaconda, Visual Studio Community Edition, Power BI Desktop, SQL Server and Azure SDK.
Microsoft R Server is a commercially licensed and supported bundle that includes MRO, a library of distributed machine learning algorithms, a distributed computing framework and data source connectivity tools. Microsoft supports the software on Red Hat and SUSE Linux, on Cloudera, Hortonworks and MapR Hadoop distributions and in Teradata Database. The Microsoft R Server Developer Edition is available as free download, with restricted licensing.
R Server for Windows will ship as R Services in SQL Server 2016, currently available in what Microsoft calls Community Technology Preview, and what everyone else calls beta.
In ZDNet, Mary Jo Foley inquires about pricing for the commercially licensed versions. Response from MSFT: it depends, check with your account team/reseller.
Tim Anderson in the Register notes that there are few notable enhancements from Revolution R Enterprise 7.5. Anyone familiar with the challenges of integrating acquired software will understand that.
Support on Azure appears to be a work in progress.
MapR Offers Free Test Drives
MapR and partners assemble a bundle that includes MapR’s eponymous Hadoop distribution plus Dataguise DgSecure, HPE Vertica, Apache Drill and TIBCO Spotfire. The whole package is available on Amazon Web Services’ Big Data Test Drive.
The combination doesn’t really make much sense. Looks like one of those partner things where a few Veeps of Business Development meet in a bar and concoct something after the third drink.
Yahoo Dumps Data
Remember Yahoo!? For the benefit of younger readers, Yahoo! was a cool company back in the 1990s that offered nifty tools like email and hierarchical content indexing. Since the dot-com bust, however, it has struggled to define a coherent growth strategy. The company continues to muddle through in the face of flat revenue and operating losses by selling off pieces of its stake in Alibaba, but without organic growth the company is toast.
Oh, right. That Yahoo.
Anyway, Yahoo! releases a big dataset consisting of anonymized user interactions for 20 million interactions for four months in 2015. The dataset includes around 110 billion events, clicks of users lost on Yahoo! sites searching for something to read.
Exit question: what are the odds that hackers can crack the anonymization through behavior profiling?