Big Analytics Roundup (October 19, 2015)
Ten stories this week. Don’t miss story #10, which recaps an analysis of collaboration and influence in the U.S.Congress using open source graph engines and a rich database of legislation.
(1) Rexer: R Continues to Lead
Several interesting changes from the previous survey:
- Reported primary and total use of R continues to increase
- SPSS/Statistics declined slightly in reported usage, remains #2
- RapidMiner is way down, from third to ninth. Also interesting to note that ~95% of RapidMiner users say they use the free version.
- SAS usage remained constant, but moved up in rank to third as RapidMiner fell
- Reported usage of Excel Data Mining and Tableau are way up from previous rounds of the survey
Like most surveys on this topic, there are issues with Rexer’s sampling methodology that mandate careful interpretation. Rexer’s methods are largely consistent from year to year, however, so changes between iterations of the survey are interesting and may reflect real-world trends.
(2) CfP for Spark Summit East Opens
Spark Summit East will meet at the New York Hilton February 16-18; I will be there, with bells on. The Call for Presentations is now open, link here.
(3) DataTorrent Explains DAGs
On the DataTorrent blog, Thomas Weise explains directed acyclic graphs, or DAGs, which is a fancy name for a way to describe logical dependencies with dots and arrows. It sounds prosaic, but DAGs are fundamental to Storm, Spark, Tez and Apex, all of which play a role in bringing high-performance computing to the Hadoop ecosystem.
(4) New Apache Drill Release
SQL platform Apache Drill announces Release 1.2. Key new bits:
- Relational database support (through JDBC)
- Additional window functions
- Parquet metadata caching
- Performance improvements on HBase and Hive tables
- Drop table capability for files and directories
- Enhanced MongoDB integration
Plus many bug fixes. Nice work, Drill team, but it feels like rearranging the deck chairs. Drill lags the other SQL engines in Kerberos support, YARN integration and query fault tolerance; while Teradata is stepping in to do something with Presto, Drill is an orphan. There is no UI, and no sign that the BI vendors are looking to build on Drill, so it’s not clear where Drill goes from here.
(5) Fans Flock to Flink Forward ’15
The first Flink Forward conference met for two days in Berlin last week. Data Artisans organized the program and delivered a number of the presentations. Capital One’s Slim Baltagi has kindly shared the deck from his keynoter on Flink versus Spark.
(6) Big Data Spain Meets in Madrid
The 4th Edition of Big Data Spain met last week in Madrid. On Slideshare, evil mad scientist Paco Nathan offers two decks:
— Data Science in 2016, his keynote address, covers architectural design patterns; observations on trends; example applications and use cases; and offers a glimpse ahead.
— Crash Introduction to Apache Spark, slides from a workshop, is exactly what it sounds like it is.
(7) MIT Researchers Build Data Science Machine
James Max Kanter and Kalyan Veeramachaneni of MIT develop an automated Data Science Machine (DSM), enroll it in three data science competitions, beat 615 out of 906 teams. DSM performed “nearly” as well as the human teams; but while humans spent months developing their models, the DSM spent 2-12 hours.
In a paper that describes their approach, Kanter and Veeramachaneni describe an approach to feature engineering they call Deep Feature Synthesis, which generates features based on automated analysis of a relational data model. The authors note that a naive grid search for the optimal model specification would require trillions of experiments; they use Bayesian optimization to find the best model.
(8) Spark-Based Security Platform Lands Funding
DataVisor, founded in 2013, announces a $14.5 million “A” round from GSR and NEA to develop its eponymous security analysis engine, which runs on Spark. The company, based in Mountain View, claims that its software can process billions of events per hour, and boasts Yelp and Momo as customers.
(9) Dato Releases Spark-GraphLab Interface
On the Dato Blog, Emad Soroush introduces the spark-sframe package, which enables a GraphLab user to ingest Spark RDDs as GraphLab SFrames. Dato introduced SFrames a couple of weeks ago. As I noted at the time, it doesn’t really matter how cool the SFrame is, it’s YADF — Yet Another Data Format.
Rather than forcing data scientists to convert data to a new format, machine learning vendors need to figure out how to work with existing Hadoop formats. Dato isn’t going to build a complete Business Analytics stack; it’s going to have to integrate with SQL engines and other tools, and YADF makes that harder, not easier.
I also have to wonder why Dato hasn’t registered this package on Spark Packages, like everyone else who integrates with Spark.
(10) Spark Plus GraphX Equals Mazerunner
On his personal blog, William Lyon demonstrates an analysis of influence in the U.S.Congress using the Neo4j graph database, Apache Spark GraphX and Mazerunner, an open source project that merges the capabilities of Neo4j and Spark. In a previous post, Lyon showed how he loaded data from govtrack.us into Neo4j to build a rich graph of collaboration among different members of Congress.
Next, he uses Mazerunner’s PageRank tooling to calculate the influence for each Senator and Member of Congress. Mazerunner selects and extracts the relevant subgraph from Neo4j, runs a Spark GrapX job and writes the results back to Neo4j.
Mazerunner is free and open source under an Apache 2.0 license, and is distributed on Git. Currently, it supports algorithms for PageRank, Closeness Centrality, Betweenness Centrality, Triangle Counting, Connected Components and Strongly Connected Components.