Big Analytics Roundup (March 21, 2016)
Minimal hard news this week, but some interesting survey results, analysis, articles, explainers and perspectives.
— On his personal blog, Will Kurt describes Bayesian reasoning in the Twilight Zone. I tried to learn Bayesian reasoning a few years ago, but it conflicted with my prior beliefs.
— Stack Overflow shares results from its 2016 Developer Survey. (h/t Thomas Ott) Key bits:
- Most popular technologies for math and data: Python and SQL.
- Top paying technologies: Spark and Scala.
- Top paying tech for data scientists: Scala, Spark and Hadoop.
- Top tech stack for data scientists: Python + R + SQL.
- Top development environments for data scientists: (1) Vim; (2) Notepad++; (3) RStudio; (4) IPython/Jupyter.
- Job priorities for data scientists: (1) Salary; (2) Building something that’s innovative.
- Biggest challenge at work (all respondents): Unrealistic expectations.
- Purchasing power of developers in South Africa: 25,713 Big Macs per year.
— MIT Technology Review summarizes a comparative analysis of the tweeps for Hillary Clinton and Donald Trump. Study authors use facial recognition to classify followers into demographic categories, with surprising findings.
— Daniel Chalef of Domino Data analyzes data from Google Trends and StackOverflow, discovers that people search for open source data science tools more than they do for commercial data science tools. For a more comprehensive look at this question, see Bob Muenchin’s blog on the popularity of analytics software. Search interest is one data point, Bob’s work with job postings offers a better picture of the actual state of the market.
— On his Databaseline blog, Ian Hellström corrals information on Apache streaming projects, including Apex, Beam, Flink, Flume, Ignite, NiFi, Samza, Spark Streaming and Storm/Trident.
— On the Confluent blog, Jay Kreps explains Kafka Streams. Given Kafka’s dominance in the streaming data space, I suspect that we will see Confluent move upstream — no pun intended — to streaming analytics.
— This week from the morning paper:
- Adrian Colyer explains MacroBase, an open source software project for anomaly detection in streaming data.
- … explains social engineering attacks and potential defenses.
- … explains distributed TensorFlow with MPI. Distributed versions improve (runtime) performance, but scaleability is sublinear; with 32 nodes, performance is a little less than 12X faster than a single node.
— MapR’s Tugduall Grall explains what Spark is, what it does, and what sets it apart.
— In SlideShare, Joe Chow explains random grid search for hyperparameter optimization in H2O.
— On the Databricks blog, Denny Lee et. al. explain how to use the new GraphFrames package. They include a notebook and demonstration of GraphFrames with the airline on-time performance dataset.
— MSFT’s Jeff Stokes explains how to scale stream analytics jobs with Azure Machine Learning functions.
— On the MapR blog, Carol McDonald explains how to get started using GraphX with Scala.
— Jack Vaughan interviews some old guy who thinks Spark is a thing.
— In Forbes, Gil Press reviews the Forrester TechRadar Big Data report and opines about the top ten technologies. InformationWeek’s Jessica Davis reviews the same report and draws different conclusions. The great thing about punditry is you can say anything you like.
— Gabriela Motroc engages the tiresome old “Spark versus Hadoop” theme.
— Alex Woodie opines that Hadoop must evolve toward greater simplicity. While his complaint has merit, the problem with his argument is that organisms do not “evolve” to simplicity; simplicity itself is a product of design. Pure Hadoop is simple: MapReduce and HDFS. Hadoop has evolved to something more complex because it had to do so; every additional piece added to the ecosystem is a response to unmet needs.
— H2O.ai’s Ken Sanford, who previously worked for SAS, argues that the best data scientists run R and Python. He’s right. Money talks: according to O’Reilly’s 2015 Data Science Salary Survey, the median salary for data scientists who use SAS is less than the median salary for data scientists who use R and Python.
— On Medium, PredictionIO’s Thomas Stone celebrates ten years of open source machine learning.
— Jessica Davis profiles nine big data and analytics startups she thinks you should watch: Confluent, H2O.ai, AtScale, Algorithmia, BedrockData, Wavefront, RJMetrics, BlueTalon, and Cazena.
— In TechCrunch, Hightail’s Mike Trigg opines that Silicon Valley’s unicorn problem will solve itself. I doubt that’s true; you can’t simultaneously argue that VCs are irrational on the upside (e.g. Groupon) but rational on the downside. If VCs are too dumb to spot companies with no sustainable competitive advantage, they are also too dumb to spot “well-run, profitable companies with proven business models and healthy balance sheets.”
— On Quora, Dato’s Carlos Guestrin opines about what’s next in machine learning.
— In Martech Advisor, Ankush Gupta Mar interviews Altiscale’s VP of Marketing, Barbara Lewis. Interesting bits about Altiscale’s Spark-as-Service offering.
— David Weldon asks if you are asking all the wrong questions about Apache Spark. He interviews Sean Suchter of Pepperdata.
— Teradata CEO Mike Koehler continues to demonstrate confidence in the company’s growth prospects by selling another 350,000 shares.
— Zacks downgrades TDC to hold. On Wall Street, “hold” is code for “dump it.”
Open Source Announcements
— Three announcements from Apache projects:
- Apex announces release 3.3.1 of the Malhar library, a maintenance release.
- Drill announces release 1.6.0, which includes a few new features and many bug fixes. Release notes here.
- Phoenix announces release 4.7, with ACID transaction support, better statistics, improved performance and 150+ bug fixes.
— Dataiku announces that it has hired two new Veeps to drive expansion in North America.
— Reltio announces GA of Reltio Cloud 2016.1, with early access to Reltio Insights. Reltio offers a master data management platform-as-a-service; Reltio Insights adds Spark to the mix.
— BlueData announces that it has joined the Dell Technology Partnership Program. BlueData offers a datacenter virtualization capability that enables enterprises to build an on-premises cloud. BlueData Veep Greg Kirchoff opines about the partnership. Spoiler: he likes it.