Big Analytics Roundup (August 29, 2016)
Python and R
Matt Asay argues that Python is a gateway language that leads data scientists to R. (h/t Oliver Vagner). That’s oversimplified and mostly incorrect. The breadth of R’s analytics functionality tends to draw statisticians and scientists, while Python tends to be an entry language for software developers. While R supports more analytics than Python, Python has better tooling for Big Data; PySpark, for example, does much more than SparkR.
In KDnuggets’ 2016 poll, Python use increased markedly from 2015; this suggests that R users are adding Python to their battery of tools. More people in the poll use both Python and R than use either one alone.
While R is an excellent tool for personal use, its GPL license discourages adoption by companies that develop products or deliver services built on analytics. Thus, it is very unlikely that R will overtake Python as a development platform for machine learning applications.
Aster on Hadoop
Teradata announces the availability of Aster on Hadoop and AWS. Aster on Hadoop strikes me as a bladeless knife without a handle.
Aster was kind of interesting back in 2012; SQL/MapReduce offered analysts a way to run queries in Hadoop back when Hive was clunky and slow. Today, Aster is rendered obsolete by the likes of Impala, Spark, Presto, Drill, and Hawq. According to DB-Engines, Aster ranks 77th in popularity, well below competitors Vertica, Netezza, and Greenplum.
Teradata’s leadership says that Aster is a great foundation for custom applications. Assuming that is true, for the sake of argument, the logical move is to donate Aster to open source, as Pivotal did with Greenplum.
SAP Acquiring Altiscale?
Late Summer Reading
In 2012, Amgen researchers disclosed that they were unable to reproduce findings in 47 out of 53 published cancer discoveries. In Nautilus, Ahmed Alkhateeb argues that we should not accept scientific results unless the findings are reproducible.
In a thesis submitted to Sweden’s KTH Royal Institute of Technology, Ahsan Javed Awan reports the results of benchmarking Apache Spark on a single scale-up server. He ran into some scaling issues on machines with more than twelve cores, which he records in some detail.
— Felix Gessert explains the ins and outs of different NoSQL databases and offers a rubric for choosing one.
— On the Google Research Blog, Peter Liu explains text summarization with TensorFlow.
— Joe Osborne interviews Google’s Norm Jouppi, who explains the Tensor Processing Unit (TPU).
— On the Kudu blog, Dan Burkert explains new range partitioning features in Kudu.
— Marco Tulio Ribiero et. al. explain Local Interpretable Model-Agnostic Explanations, a fancy name for partial dependency analysis.
— Stephen J. Bigelow explains the tools available on AWS for BI solutions: S3, RDS, Aurora, DynamoDB, EMR, Redshift, Quicksight and Amazon Machine Learning.
— On Slideshare, Manu Zhang and Sean Zhong explain Apache Gearpump, which is Yet Another Streaming Engine.
— Julie Bort explains why you shouldn’t depend on one cloud service provider.
— On the Confluent blog, Jay Kreps argues that multi-tenancy is the key capability of distributed systems.
— Cynthia Harvey compares AWS and Azure; she misses the big picture. AWS is a software-agnostic IaaS provider; MSFT is a software company with complementary PaaS and SaaS services. There are advantages and disadvantages to each model, but first one must recognize the difference.
— Curt Monash asks if analytic RDBMSs and data warehouse appliances are obsolete.
— SAP’s Ken Tsai opines on the role of Hadoop in digital transformation and IoT. Spoiler: he thinks Hadoop has a role.
Open Source News
— Hazelcast announces the general availability of Hazelcast 3.7, with performance improvements and a modular architecture. Hazelcast is an in-memory data grid.
— Apache Ignite completes the hat trick for in-memory bits by announcing Ignite 1.7.0.
— Apache Kudu launches Kudu 0.10.0.
— Microsoft announces the availability of Microsoft R Open (MRO) 3.3.1, with a streamlined installation process, additional packages, and bug fixes. MRO is a free and open source enhanced distribution of R.
— Big-Data-as-a-Service provider BlueData announces a $20 million “C” round led by Intel Capital. The company also announces a partnership with Intel to deliver its software on Xeon processors.
— Google offers several webinars in September for those who want to learn more about BigQuery, Cloud Dataflow, and the Google Cloud Platform.
— Syncsort announces that it has completed the acquisition of Cogito, a maker of mainframe stuff that complements Syncsort’s other mainframe stuff.