Big Analytics Roundup (August 15, 2016)

In the second quarter of 2015, Hortonworks lost $1.38 for every dollar of revenue. In the second quarter of 2016, HDP lost $1.46 for every dollar of revenue. So I guess they aren’t making it up on volume.

On the Databricks blog, Jules Damji summarizes Spark news from the past two weeks.

AWS Launches Kinesis Analytics

Amazon Web Services announces the availability of Amazon Kinesis Analytics, an SQL interface to streaming data. AWS’ Ryan Nienhuis explains how to use it in the first of a two-part series.

The biggest threat to Spark Streaming doesn’t come from the likes of Flink, Storm, Samza or Apex. It comes from popular message brokers like Apache Kafka and AWS Kinesis, who can and will add analytics to move up the value chain.

Intel Freaks Out

Intel announces an agreement to acquire Nervana Systems, a 28-month-old startup working on hardware and software solutions for deep learning. Re/code reports a price tag of $408 million. The customary tech media unicorn story storm ensues. (h/t Oliver Vagner)

Intel says it plans to use Nervana’s software to improve the Math Kernel Library and market the Nervana Engine alongside the Xeon Phi processor. Nervana neon is YADLF — Yet Another Deep Learning Framework — that ranked twelfth in usage among deep learning frameworks in KDnuggets’ recent poll. According to Nervana, neon benchmarks well against Caffe; but then, so does CNTK.

Do special-purpose chips for deep learning have legs? Obviously, Intel thinks so. The headline on that recent Wired story about Google’s deep learning chip — Time for Intel to Freak Out — looks prescient. That said, the history of computing isn’t kind to special-purpose hardware; does anyone remember Thinking Machines? If Intel has any smarts at all, it will take steps to ensure that its engine works with the deep learning frameworks people actually want to use, like TensorFlow, Theano, and Caffe.

Cloud Computing Drivers

Tony Safoian describes five trends driving the growth of cloud computing: better security, machine learning and big data, containerization, mobile and IoT. Cloud security hasn’t actually improved — your data was always safer in the cloud than it was on premises. What has changed is the perception of security, and the growing sense that IT sentiments against cloud have little to do with security and a lot to do with rent-seeking and turf.

On the other points, Safoian misses the big picture — due to the costs of data movement, the cloud is best suited to machine learning and big data when data sources are also in the cloud. As organizations host an increasing number of operational applications in the cloud, it makes sense to manage and analyze the data there as well.

Machine Learning for Social Good

Microsoft offers a platform to predict scores in weather-interrupted cricket matches.

Shameless Commerce

In a podcast, Ben Lorica interviews John Akred on the use of agile techniques in data science. Hey, someone should write a book about that.

Speaking of books, I plan to publish snippets from my new book, Disruptive Analytics, every Wednesday over the next couple of months.

DA Cover

Explainers

— Uber’s Vinoth Chandar explains why you rarely need sub-second latency for streaming analytics.

— Microsoft’s David Smith explains how to tune Apache Spark for faster analysis with Microsoft R Server.

— Databricks’ Jules Damji explains how to use SparkSession with Spark 2.0.

— On the Cloudera Engineering Blog, Devadutta Ghat et. al. explain analytics and BI on S3 with Apache Impala. Short version: you’re going to need more nodes.

— In the first of a three-part series, IBM’s Elias Abou Haydar explains how to score health data with Apache Spark.

— Basho’s Pavel Hardak explains how to use the Riak Connector for Apache Spark.

— On YouTube, Alluxio founder and CEO Haoyuan Li explains Alluxio.

— Pat Ferrel explains the roadmap for Mahout. According to OpenHUB, Mahout shows a slight uptick in developer activity, from zero to two active contributors.

— Cisco’s Saravanan Subramanian explains the features of streaming frameworks, including Spark, Flink, Storm, Samza, and Kafka Streams. A pretty good article overall, except that he omits Apache Apex, a top-level Apache project.

— Frances Perry explains what the Apache Beam has accomplished in the first six months of incubation.

Perspectives

— Curt Monash opines about Databricks and Spark. He notes that some people are unhappy that Databricks hasn’t open sourced 100% of its code, which is just plain silly.

— IBM’s Vijay Bommireddipalli touts IBM’s contributions to Spark 2.0.

— Mellanox’ Gillad Shainer touts the performance advantage of EDR InfiniBand versus Intel Omni-Path. Mellanox sells InfiniBand host bus adapters and network switches.(h/t Bob Muenchen)

— Kan Nishida runs a cluster analysis on R packages in Google BigQuery and produces something incomprehensible.

— Pivotal’s Jagdish Mirani argues that network-attached storage (NAS) may be a good alternative to direct-attached storage (DAS). Coincidentally, Pivotal’s parent company EMC sells NAS devices.

Open Source News

— Apache Flink announces two releases. Release 1.1.0 includes new connectors, the Table API for SQL operations, enhancements to the DataStream API, a Scala API for Complex Event Processing and a new metrics system. Release 1.1.1 fixes a dependency issue.

— Apache Kafka announces Release 0.10.0.1, with bug fixes.

— Apache Samza releases Samza 0.10.1 with new features, performance improvements, and bug fixes.

— Apache Storm delivers version 1.0.2, with bug fixes.

Commercial Announcements

— AWS releases EMR 5.0, with Spark 2.0, Hive 2.1 and Tez as the default execution engine for Hive and Pig. EMR is the first Hadoop distribution to support Spark 2.0.

— Fractal Analytics partners with KNIME.

— MapR announces a $50 million venture round led by the Australian Government Future Fund.

Big Analytics Roundup (July 11, 2016)

Light news this week. We have results from an interesting survey on fast data, an excellent paper from Facebook and a nice crop of explainers.

From one dumb name to another.  Dato loses trademark dispute, rebrands as Turi. They should have googled it first.

Screen Shot 2016-07-07 at 6.25.48 AM

Wikibon’s George Gilbert opines on the state of Big Data performance benchmarks. Spoiler: he thinks that most of the benchmarks published to date are BS.

Databricks releases the third eBook in their technical series: Lessons for Large-Scale Machine Learning Deployments in Apache Spark.

The State of Fast Data

OpsClarity, a startup in the applications monitoring space, publishes a survey of 4,000 respondents conducted among a convenience sample of IT folk attending trade shows and the like. Most respondents self-identify as developers, data architects or DevOps professionals. For a copy of the report, go here.

As with any survey based on a convenience sample, results should be interpreted with a grain of salt. There are some interesting findings, however.  Key bits:

  • In the real world, real time is slow. Only 27% define “real-time” as “less than 30 seconds.”  The rest chose definitions in the minutes and even hours.
  • Batch rules today. 89% report using batch processing. However, 68% say they plan to reduce batch and increase stream.
  • Apache Kafka is the most popular message broker, which is not too surprising since Kafka Summit was one of the survey venues.
  • Apache Spark is the most popular data processing platform, chosen by 70% of respondents.
  • HDFS, Cassandra, and Elasticsearch are the most popular data sinks.
  • A few diehards (9%) do not use open source software. 47% exclusively use open source.
  • 40% host data pipelines in the cloud; 32% on-premises; the rest use a hybrid architecture.

It should surprise nobody that people who attend Kafka Summit and the like plan to increase investments in stream processing. What I find interesting is the way respondents define “real-time”.

Alex Woodie summarizes the report. (Fixed broken link).

Top Read of the Week

Guoqiang Jerry Chen, et. al. explain real-time data processing at Facebook. Adrian Colyer summarizes.

Explainers

— Jake Vanderplas explains why Python is slow.

— On Wikibon, Ralph Finos explains key terms in cloud computing. Good intro.

— A blogger named Janakiram MSV describes all of the Apache streaming projects. Two corrections: Kafka Streams is a product of Confluent (corrected) and not part of Apache Kafka, and Apache Beam is an abstraction layer that runs on top of either batch or stream processing engines.

— Srini Penchikala explains how Netflix orchestrates its machine learning workflow with Spark, Python, R, and Docker.

— Kiuk Chung explains how to generate recommendations at scale with Spark and DSSTNE, the open source deep learning engine developed by Amazon.

— Madison J. Myers explains how to get started with Apache SystemML.

— Hossein Falaki and Shivaram Venkataraman explain how to use SparkR.

— Philippe de Cuzey explains how to migrate from Pig to Spark. For Pig diehards, there is also Spork.

— In a video, Evan Sparks explains what KeystoneML does.

— John Russell explains what pbdR is, and why you should care (if you use R).

— In a two-part post, Pavel Tupitsyn explains how to get started with Apache Ignite.NET. Part two is here.

— Manny Puentes of Altitude Digital explains how to invest in a big data platform.

Perspectives

— Beau Cronin summarizes four forces shaping AI: data, compute resources, software, and talent. My take: with the cost of data, computing and software collapsing, talent is the key bottleneck.

— Greg Borenstein argues for interactive machine learning. It’s an interesting argument, but not a new argument.

— Ben Taylor, Chief Data Scientist at HireVue, really does not care for Azure ML.

— Raj Kosaraju opines on the impact of machine learning on everyday life.

— An anonymous blogger at CBInsights lists ten well-funded startups developing AI tech.

— The folks at icrunchdata summarize results from the International Symposium on Biomedical Imaging, where an AI system proved nearly as accurate as human pathologists in diagnosing cancer cells.

Open Source Announcements

— Yahoo Research announces the release of Spark ADMM, a framework for solving arbitrary separable convex optimization problems with Alternating Direction Method of Multipliers. Not surprisingly given the name, it runs on Spark.

Commercial Announcements

— Talend announces plans for an IPO. The filing discloses that last year Talend lost 28 cents for every dollar in revenue, which is slightly better than the 35 cents lost in 2015. At that rate, Talend may break even in 2020, if nothing else happens in the interim.

Big Analytics Roundup (June 13, 2016)

Spark Summit 2016 met last week in SFO. There were many cool things; I will publish a separate report when presentations and videos are available.

KDnuggets releases results of its annual poll on data science software. Key findings:

  • Python use is up 51%, almost catches up to R, the #1 choice.
  • Excel and Tableau usage are up 47% and 49%, respectively.
  • Spark usage is up 91%, overtakes Hadoop.
  • SAS is down big time, drops from the top ten.

Meanwhile, Alex Woodie wraps statistics on Spark adoption, and Qubole’s Ari Amster reports on Spark usage among Qubole users.

Tim Spann recaps the week in Hadoop.

Spark Summit: Roundup of Roundups

— On the Databricks blog, Wayne Chan, Dave Wang, Jules Damji and Denny Lee recap highlights from the Summit.

— Jessica Davis rounds up the highlights.

— Jack Vaughan surrounds the story, quotes some old guy.

— Sam Dean summarizes what you need to know.

— Alex Handy collects the key bits.

— Andrew Brust separately corrals Day One and Day Two.

CFPs and Competitions

Flink Forward 2016, Berlin, September 12-14 (due June 30)

Parkinson’s Progression Markers Institute (PPMI) 2016 Challenge (due September 7)

Spark Summit Europe, Brussels, October 25-27 (closing date TBA)

Top Read

Adrian Colyer summarizes a paper on identifying architectural debt in software.

Explainers

— Deenar Torasker explains the new capabilities of HDFS.

— Ron Bodkin explains key considerations when designing continuous apps, in the second of a three part series. Part one is here.

— On his eponymous blog, Jesse Steinweg-Woods explains Gradient Boosted Trees with XGBoost in Python.

— Adam Warski explains how Kafka Streams fits into the stream processing landscape.

Perspectives

— H2O.ai’s Vinod Iyengar objects to what he calls the fragmentation of Spark support, correctly noting that Cloudera and Hortonworks support different versions of Spark in their distributions. Of course, nobody is obligated to use Spark with Cloudera and Hortonworks.

— From the Spark Summit on YouTube: Ben Lorica leads a panel discussion of incredibly smart and distinguished people, plus some old guy.

— Altiscale’s Barbara Lewis presents ten use cases for Big Data.

— Tim Wallis believes that AI will relieve boredom.

— Sam Dean touts Grappa, Drill and Kafka as successors to Spark. Grappa is going nowhere. Drill is great if all you want to do is SQL, and Kafka is great if all you want to do is streaming. Pro tip: there are no real-world analytic applications where all you want to do is streaming.

— Allen Downey opines that statistical tests are inflexible and opaque. Funny, my college roommate said the same thing when he flunked his Stat 101 mid-term.

Open Source Announcements

— LinkedIn announces release of PhotonML, a machine learning library for Spark. Feature detail here.

— Google releases TensorFlow 0.9.0, with iOS support. Speculation about deep learning on your phone ensues.

— Twitter donates DistributedLog to Apache.

Commercial Announcements

— Databricks announces general availability for the Databricks Community Edition, and completion of the first phase of Databricks Enterprise Security framework.

— Microsoft announces general availability for its managed Spark service in HDInsight, and summer availability for the Spark pushdown capability in R Server. The company also announced PowerBI support for Spark Streaming, which is confusing for those who thought PowerBI already supported Spark Streaming.

— IBM announces limited preview of a managed service branded as the Data Science Experience. IBM is coy about the details; the service definitely includes Spark, Jupyter and RStudio, H2O and “curated data sets”, and may include other bits. The service itself looks promising, but IBM’s claim to offer the “first development environment for Apache Spark” is BS.

— In an oddly opaque press release, H2O announces that it is “working with” IBM. H2O is open source software, and IBM requires no permission from H2O.ai for use or distribution; presumably, H2O will offer support contracts to users. H2O.ai did not respond to request for comment.

— Splice Machine announces plans to go open source; a company insider says they plan to donate the software to Apache. Dave Ramel reports.

Big Analytics Roundup (May 31, 2016)

Google’s TPU announcement on May 18 continues to reverberate in the tech press. In Forbes, HPC expert Karl Freund dissects Google’s announcement, suggesting that Google is indulging in a bit of hocus-pocus to promote its managed services.  Freund believes that TPUs are actually used for inference and not for model training; in other words, they replace CPUs rather than GPUs. Read the whole thing, it’s an excellent article.

Meanwhile, there’s this:

Cray-Urika-GX-system-software

Cray announces launch of Urika-GX, a supercomputing appliance that comes pre-loaded with Hortonworks Data Platform, the Cray Graph Engine, OpenStack management tools and Apache Mesos for configuration. Inside the box: Intel Xeon Broadwell cores, 22 terabytes of memory, 35 terabytes of local SSD storage and Cray’s high performance network interconnect. Cray will ship 16, 32 or 48 nodes in a rack in the third quarter, larger configurations later in the year.

And, Cringely vivisects IBM’s analytics strategy. My take: IBM’s analytics strategy is slick at the PowerPoint level, lame at the product level. IBM has acquired a lot of assets and partially wired them together, but has developed very little, and much of what it has developed organically is junk. Anyone remember Intelligent Miner? I rest my case.

Top Reads

— On the Databricks blog, Sameer Agarwal, Davies Liu and Reynold Xin deliver a deep dive into Spark 2.0’s second-generation Tungsten execution engine. Tungsten analyzes a query and generates optimized bytecode at runtime. The article includes performance comparisons to Spark 1.6 across the full range of TPC-DS.

— Via Adrian Colyer, a couple of Stanford profs explain optimal strategies for cloud provisioning. Must read for anyone who uses public cloud.

Explainers

— Alex Robbins explains resilient distributed datasets (RDDs), a core concept in Spark.

— On the AWS Big Data blog, Ben Snively explains how to use Spark SQL for ETL.

— Deborah Siegel and Danny Lee explain Genome Variant Analysis with ADAM and Spark in a three-part series.

— Alex Woodie explains two health sciences research projects running on Spark.

— Fabian Hueske explains Flink’s roadmap for streaming SQL.

— Hannah Augur explains Big Data terminology.

— In a sponsored piece, an anonymous blogger interviews Mesosphere’s Keith Chambers, who explains DC/OS.

Perspectives

— Matt Asay asks if Concord.io could topple Spark from its big data throne. The answer is no. Concord.io is exclusively for streaming, which makes it Yet Another Streaming Engine. For the three people out there who aren’t happy with Spark’s half-second latency, Concord is one option among several to check out.

— Accel Partners’ Jake Flomenberg trolls the Open Adoption concept, WSJ’s Steve Rosenbush swallows it hook, line and sinker. Someone should pay Andrew Lampitt a royalty.

— IBM’s got a fever, and the prescription is Spark.

Open Source Announcements

— The Apache Software Foundation promotes two projects from incubator status:

  • Apache TinkerPop, a graph computing framework.
  • Apache Zeppelin, a data science notebook.

— Twitter donates Yet Another Streaming Engine to open source.

— Apache Kafka announces Release 0.10.0.0

— Apache Kylin announces release 1.5.2, a major release with new features, improvements and 76 bug fixes.

Commercial Announcements

— Confluent announces Confluent Platform 3.0 (concurrently with Kafka 0.10). Andrew Brust reports.

— Amazon Web Services announces a 2X speedup for Redshift, so your data can die faster.

— Alpine Data Labs announces Chorus 6. If Alpine’s branding confuses you, you’re not alone. Chorus is a collaboration framework originally developed by Greenplum back when Alpine and Greenplum were part of one happy family. Then EMC acquired Greenplum but not Alpine and spun off Greenplum to Pivotal; Dell acquired EMC; and Pivotal open sourced Greenplum. Meanwhile Alpine merged Chorus with its Alpine machine learning workbench and rebranded the whole thing as Chorus.

Big Analytics Roundup (April 11, 2016)

Top story of the week is NVIDIA’s new DGX-1 deep learning chip; scroll down for more on that.

We have three roundups from Strata + Hadoop World, Rashomon style:

  • Alex Woodie reports six takeaways: Kafka, Spark, Hadoop, Cloud, machine learning, mainframes.
  • Jessica Davis recalls four things: comedian Paula Poundstone, MapR, public data sets, AI.
  • Nik Rouda recaps five things: Spark, machine learning, data warehousing, user interfaces, cloud.

— H2O.ai CTO and co-founder Cliff Click departs H2O, joins Neurensic, a firm that specializes in compliance analytics. Neurensic has a team of surname-eschewing executives that is surprisingly large considering it has no visible funding.

— Machine learning startup Context Relevant announces the appointment of Joseph Polverari as CEO, replacing board member Chris Kelley, who replaced founder Stephen Purpura in July, 2015, a month after the latter wrote a meditation on failure. Kelley’s major accomplishment: firing people. Appears that Context Relevant isn’t the next unicorn.

— One of the 76 IBM executives with the title of “CTO” touts cognitive computing. My take:

Screen Shot 2016-04-10 at 7.52.54 AM

— Forrester publishes its 2016 “Wave” for Big Data Streaming Analytics. You can go here and buy it for $2,495, get a free copy here, or just look at the picture below.

Screen Shot 2016-04-10 at 3.52.54 PM

— Spiderbook’s Aman Naimat examines data gleaned by trolling through billions of publicly available documents, identifies 2,680 companies that are using Hadoop at any level of maturity, and another 3,500 that are just learning. That’s out of a total universe of 500,000 companies worldwide. I’m thinking that trolling through billions of public documents may understate the actual incidence of Hadoop usage.

— Crowdflower, a data enrichment platform, surveys data scientists and publishes the results. The report does not disclose how data scientists were identified and sampled, which is key to interpreting surveys like this. Respondents report that they spend a lot of time mucking around with data, which won’t surprise anyone, since Crowdflower sells a service that helps data scientists spend less time mucking with data.

NVIDIA Unveils Deep Learning Chip

— NVIDIA announces June availability for the DGX-1, a deep learning supercomputer on a chip. The DGX-1 includes eight Tesla P100 GPUs, each of which is 12X faster than NVIDIA’s previous benchmark. For $129K you get the throughput of 250 CPU-based servers.

— NVIDIA also reveals a Deep Learning SDK with Deep Learning primitives, math libraries, tools for multi-GPU communication, a CUDA toolkit and DIGITS, a model training system. The system works with popular Deep Learning frameworks like Caffe, CNTK, TensorFlow and Theano.

— Selected media reports:

— MIT Technology Review interviews NVIDIA CEO Jen-Hsun Huang.

Explainers

— Ian Pointer explains Structured Streaming, coming up in Spark 2.0.

— Till Rohrmann introduces Complex Event Processing (CEP) with Flink.

— Maxime Beauchemin explains Caravel, Airbnb’s data exploration platform.

— LinkedIn’s Akshay Rai explains Dr. Elephant, a newly open-sourced self-service performance tuning package for Hadoop and Spark.

— In a guest post on the Cloudera Engineering Blog, engineers from Wargaming.net explain how they built their real-time recommendation engine with Spark, Kafka, HBase and Drools.

— Katrin Leinweber et. al. explain how to analyze an assay of bacteria-induced biofilm formation the freshwater diatom Achnanthidium minutissimum with KNIME. In case you’re wondering, Achnanthidium minutissimum is a kind of algae.

Perspectives

— On LinkedIn, George Hill of The Cyclist nicely critiques the 2011 McKinsey Big Data report, offering a point by point assessment.

— Mauricio Prinzlau of Cloudwards.net opines, without data, that the five languages paving the future of machine learning are MATLAB/Octave, R, Python, “Java-family/C-family” and Extreme Learning Machines (ELM). What was that last one again? Personally, I’ve never seen anyone lump Java and C into a single category, but whatever.

— In InfoWorld, “internationally recognized industry expert and thought leader” David Linthicum ventures into the machine learning discussion by arguing that it’s mostly BS.

— John Dunn demonstrates his ignorance of fraud by asking if machine learning can help banks detect it. As if they haven’t been doing that for years. Also, the “hard decline” he describes at the beginning of the article is rare; most false positives produce “soft declines,”, where the merchant is asked to request identification or speak with the call center.

— In IBT, Ian Allison wonders if financial analysts will lose their jobs to intelligent trading machines. If he watched Billions, he would know that financial analysts spend their time procuring inside information.

— Timo Elliott argues that BI is dead. I have to wonder if it was ever alive.

— Confluent CTO Neha Narkhede opines on stream processing. She’s in favor of it.

— Brandon Butler interviews AWS’ Matt Wood, who chats about competing with Google and Microsoft.

— On Forbes, Robert Hof interviews Cloudera CEO Tom Reilly.

Open Source Announcements

— Qubole releases SQL optimizer Quark to open source.

— Flink releases version 1.0.1, a maintenance release.

— Apache Lens, a “unified analytics interface,” releases version 2.5.0 to beta.

— Airbnb open sources Caravel, a data exploration package.

— Apache Tajo announces Release 0.11.2, which should please its user.

— LinkedIn releases Dr. Elephant to open source.

Commercial Announcements

— Databricks announces the agenda for Spark Summit 2016 in SFO.

— Cloudera announces Cloudera Enterprise 5.7. New analytic bits include Hive-on-Spark GA, support for the HBase-Spark module, support for Spark 1.6 and support for Impala 2.5.

— MapR announces availability of Apache Drill 1.6 as the unified SQL layer for the MapR Converged Data Platform.

2016 Big Analytics Predictions Roundup

Before publishing my own predictions for 2016 later this week, I thought it would be fun to round up published predictions on analytics and Big Data.  Looking through this list, I see a few patterns:

— Streaming is hot.  Analysts do not seem to understand distinctions between streaming data, streaming analytics and real-time decisioning.

— “Data Science” continues to be a term that means whatever you like.

— Security and anti-fraud analytics will be a thing in 2016.  (They were also a thing in 2015.)

— Industry analysts are divided about whether or not the analytics talent crunch will persist.

— IoT is a great concept for selling data management tools, but few know how to make sense of it.

On ZDNet, Andrew Brust summarizes 60 predictions from 17 executives and sees the following:

  1. Increased adoption of streaming analytics
  2. Maturation of IoT technologies
  3. Value and maturity in Big Data products
  4. Increased deployment of artificial intelligence and machine learning

On KDnuggets, Gregory Piatetsky reports on five predictions for 2016 from Tom Davenport of the International Institute of Analytics.  (Webinar replay here.)

  1. Cognitive technology will be the next thing after automated analytics.
  2. Analytical microservices will facilitate embedded analytics.
  3. Data Science and predictive analytics will merge.
  4. The analytics talent crunch will ease due to increased enrollment in graduate programs.
  5. Analytics will focus on data curation and management.

Davenport is smoking something if he thinks cognitive computing will be a thing in 2016.

In Forbes, Gil Press synthesizes the IIA’s predictions (above) with predictions from Forrester, IDC and Gartner to get six predictions:

  1. Analytics will be embedded everywhere.
  2. Machine learning will replace manual data wrangling.
  3. The shortage of analytics talent will persist.
  4. Analytics projects will be riskier than typical IT projects.
  5. Cognitive computing will be the next buzzword.  (Press clearly does not agree with Davenport).
  6. Data monetization will take off.

Predictions (2) and (3) conflict with one another; since analysts spend 80% of their time data wrangling, tooling that automates this step will relieve the talent shortage.

On Datanami, Alex Woodie wades through “dozens” of predictions and publishes the 33 most interesting.  Many of these are self-serving, obvious or nonsensical, so I will do the work Woodie’s editor did not do and distill the list to five:

  1. Streaming analytics will mature and prove its worth.
  2. Apache Kafka will be an essential integration point in enterprise infrastructure.
  3. Business user access to Hadoop data will improve.
  4. Spark will significantly displace MapReduce for Hadoop workloads.
  5. Spark processing outside of Hadoop will also increase significantly.

Teryn O’Brien of Silicon Angle reports on a webinar hosted by Alteryx that included Bob Laurent of Alteryx, Clarke Patterson of Cloudera and Francois Ajenstat of Tableau.  The panel offered three predictions:

  1. Analyst jobs will be hot and analysts will be everyday heroes.
  2. Spark, the cloud and IoT will be big in 2016.
  3. Advanced analytics will play a key role in the Presidential election.

On ITPortal, Dell’s Todd O’Brien predicts three things for 2016:

  1. The role of Citizen Data Scientists will expand and evolve.  (Me: WTF?)
  2. Analytics will significantly affect vertical markets, especially manufacturing.
  3. All innovation will trace back to analytics

On the first point, I think that O’Brien is trying to say that companies should buy analytics software that is easy to use, like what Dell offers.

On the FICO blog, FICO’s chief analytics officer Scott Zoldi offers five predictions for 2016:

  1. Streaming analytics will come of age in 2016.
  2. “Prescriptive analytics” (his term for anomaly detection) will be a must-have security technology.
  3. “Lifestyle analytics” (predictions embedded in consumer interactions) will integrate prescriptive analytics into daily life.
  4. Businesses will rethink Big Data governance.
  5. Fake data scientists will emerge.

On a SAS blog, Polly Mitchell-Guthrie predicts five things:

  1. Machine learning (will be) established in the enterprise.
  2. IOT hype hits reality.
  3. Big Data moves beyond hype.
  4. Analytics improve cybersecurity.
  5. Analytics drives increased industry-academic interaction.

It’s standard practice at SAS to call any new IT trend “hype.”

In a press release, the health analytics vendor SCIO Health Analytics makes four predictions for 2016:

  1. Greater focus on educating health consumers.
  2. Demand for more precision in health analytics.
  3. More time will be spent on reimbursement strategies.
  4. The need for data and transparency across domains will increase.

Prediction #1 may be true, but it’s not really about health analytics.

On the Talend blog, CMO Ashley Stirrup predicts four things:

  1. Real-time analytics will take center stage
  2. New business threats will emerge
  3. CIO turnover will accelerate
  4. Businesses will retool

#2 and #4 aren’t really predictions, they simply state the obvious.

Big Analytics Roundup (October 12, 2015)

Dell and Silver Lake Partners announce plans to buy EMC for $67 billion, a transaction that is a big deal in the tech world, and mildly interesting for analytics.  Dell acquired StatSoft in 2014,but nothing before or since suggests that Dell knows how to position and sell analytics.  StatSoft is lost inside Dell, and will be even more lost inside Dell/EMC.

EMC acquired Greenplum in 2010; at the time, GP was a credible competitor to Netezza, Aster and Vertica.  It turns out, however, that EMC’s superstar sales reps, accustomed to pushing storage boxes, struggled to sell analytic appliances.  Moreover, with the leading data warehouse appliances vertically integrated with hardware vendors, Greenplum was out there in the middle of nowhere peddling an appliance that isn’t really an appliance.

EMC shifted the Greenplum assets to its Pivotal Software unit, which subsequently open sourced the software it could not sell and exited the Hadoop distribution business under the ODP fig leaf.  Alpine Data Labs, which used to be tied to Greenplum like bears to honey, figured out a year ago that it could not depend on Greenplum for growth, and has diversified its platform support.

What’s left of Pivotal Software is a consulting business, which is fine — all of the big tech companies have consulting arms.  But I doubt that the software assets — Greenplum, Hawq and MADLib — have legs.

In other news, the Apache Software Foundation announces three interesting software releases:

  • Apache AccumuloRelease 1.6.4, a maintenance release.
  • Apache Ignite: Release 1.4.0, a feature release with SSL and log4j2 support, faster JDBC driver implementation and more.
  • Apache Kafka: Release 0.8.2.2, a maintenance release.

Spark

On the MapR blog, Jim Scott takes a “Spark is a fighter jet” metaphor and flies it until the wings fall off.

Spark Performance

Dave Ramel summarizes a paper he thinks is too long for you to read.  That paper, here, written by scientists affiliated with IBM and several universities, reports on detailed performance tests for MapReduce and Spark across four different workloads.  As I noted in a separate blog post, Ramel’s comment that the paper “calls into question” Spark’s record-setting performance on GraySort is wrong.

Spark Appliances

Ordinarily I don’t link sponsored content, but this article from Numascale is interesting.  Numascale, a Norwegian company, offers analytic appliances with lots of memory; there’s an R appliance, a Spark appliance and a database appliance with MonetDB.

Spark on Amazon EMR

On Slideshare, Amazon’s Jonathan Fritz and Manjeet Chayel summarize best practices for data science with Spark on EMR.  The presentation includes an overview of Spark DataFrames, a guide to running Spark on Amazon EMR, customer use cases, tips for optimizing performance and a plug for Zeppelin notebooks.

Use Cases

In Datanami, Alex Woodie describes how Uber uses Spark and Hadoop.

Stitch Fix offers personalized style recommendations to its customers.  Jas Khela describes how the startup uses Spark.   (h/t Hadoop Weekly)

SQL/OLAP/BI

Apache Drill

MapR’s Neeraja Rentachintala, Director of Product Management, rethinks SQL for Big Data.  Without a trace of irony, he explains how to bring SQL to NoSQL datastores.

Apache Hawq

On the Pivotal Big Data blog, Gavin Sherry touts Apache Hawq and Apache MADLib.  Hawq is a SQL engine that federates queries across Greenplum Database and Hadoop; MADLib is a machine learning library.   MADLib was always open source; Hawq, on the other hand, is a product Pivotal tried to sell but failed to do so.  In Datanami, George Leopold reports.

In CIO Today, Jennifer LeClaire speculates that Pivotal is “taking on” Oracle’s traditional database business with this move, which is a colossal pile of horse manure.

At Apache Big Data Europe, Caleb Welton explains Hawq’s architecture in a deep dive.  The endorsement from GE;s Jeffrey Immelt is a bit rich considering GE’s ownership stake in Pivotal, but the rest of the deck is solid.

Apache Phoenix

At Apache Big Data Europe, Nick Dimiduk delivers an overview of Phoenix, a relational database layer for HBase.  Phoenix includes a query engine that transforms SQL into native HBase API calls, a metadata repository and a JDBC driver.  SQL support is broad enough to run TPC benchmark queries.  Dimiduk also introduces Apache Calcite, a Query parser, compiler and planner framework currently in incubation.

Data Blending

On Forbes, Adrian Bridgewater touts the data blending capabilities of ClearStory Data and Alteryx without explaining why data blending is a thing.

Presto

On the AWS Big Data Blog, Songzhi Liu explains how to use Presto and Airpal on EMR.  Airpal is a web-based query tool developed by Airbnb that runs on top of Presto.

Machine Learning

Apache MADLib

MADLib is an open source project for machine learning in SQL.  Developed by people affiliated with Greenplum, MADLib has always been an open source project, but is now part of the Apache community.  Machine learning functionality is quite rich.  Currently, MADLib supports PostgreSQL, Greenplum database and Apache Hawq.  In theory, the software should be able to run in any SQL engine that supports UDFs; since Oracle, IBM and Teradata all have their own machine learning stories, I doubt that we will see MADLib running on those platforms. (h/t Hadoop Weekly)

Apache Spark (SparkR)

On the Databricks blog, Eric Liang and Xiangrui Meng review additions to the R interface in Spark 1.5, including support for Generalized Linear Models.

Apache Spark (MLLib)

On the Cloudera blog, Jose Cambronero explains what he did this summer, which included running K-S tests in Spark.

Apache Zeppelin

At Apache Big Data Europe, Datastax’ Duy Hai Doan explains why you should care about Zeppelin’s web-based notebook for interactive analytics.

H2O and Spark (Sparkling Water)

In a guest post on the Cloudera blog, Michal Malohlava, Amy Wang, and Avni Wadhwa of H2O.ai explain how to create an integrated machine learning pipeline using Spark MLLib, H2O and Sparkling Water, H2O’s interface with Spark.

How Yahoo Does Deep Learning on Spark

Cyprien Noel, Jun Shi, Andy Feng and the Yahoo Big ML Team explain how Yahoo does Deep Learning with Caffe on Spark.  Yahoo adds GPU nodes to its Hadoop clusters; each GPU node has 10X the processing power of a commodity Hadoop node.  The GPU nodes connect to the rest of the cluster through Ethernet, while Infiniband provides high-speed connectivity among the GPUs.

Screen Shot 2015-10-06 at 10.32.39 AM

Caffe is an open source Deep Learning framework developed by the Berkeley Vision and Learning Center (BVLC).  In Yahoo’s implementation, Spark assembles the data from HDFS, launches multiple Caffe learners running on the GPU nodes, then saves the resulting model back to HDFS. (h/t Hadoop Weekly)

Streaming Analytics

Apache Flink

On the MapR blog, Henry Saputra recaps an overview of Flink’s stream and graph processing from a recent Meetup.

Apache Kafka

Cloudera’s Gwen Shapira presents Kafka “worst practices”: true stories that happened to Kafka clusters. (h/t Hadoop Weekly)

Apache Spark Streaming

On the MapR blog, Jim Scott offers a guide to Spark Streaming.

Big Analytics Roundup (October 5, 2015)

Announcements timed to coincide with Strata NYC 2015 drive the news this week.  The single most interesting item for Big Analytics is the O’Reilly 2015 Data Science Survey, which warrants a post of its own.  Two key points:

  • Data scientists still use SQL, Excel, Python and R. (Doh!)
  • Data scientists who spend time in meetings and presenting results of analysis earn more than the grunts who muck around with data.

The lesson is clear: if you want to earn the big bucks, stop messing around with Zeppelin and learn PowerPoint.

There are a few interesting presentations from Strata embedded in this post, plus two that stand out:

  • Ron Kasabian of Intel and Michael Draugelis of Penn Medicine explain how they improve medical decision-making with predictive analytics.
  • Iulia Pasov and Calin-Andrei Burloiu show how they use data science to measure and prevent churn at Avira.

Paul Kent has the toughest job at SAS — promoting an initiative his boss thinks is hype.  In this sponsored presentation, Paul does a professional job presenting SAS’ Big Data story, which seems compelling.  However, the challenge for SAS in Big Data remains: name a reference customer.

Spark

Commentary

MapR’s Jim Scott takes another shot at the “will Spark replace Hadoop?” meme.  All together now:

  • Spark, like MapReduce, is a compute engine
  • Hadoop = MapReduce + HDFS + YARN + (an ecosystem of other bits)
  • Spark lacks a native file system, so it can never replace Hadoop
  • Coming in 2016:  Hadoop = Spark + HDFS + YARN + (an ecosystem…)
  • Also possible: Spark + Cassandra, Spark + MongoDB, Spark + Druid, Spark + (your database here)

On the Intersog blog, Jenny Richards gets it right by focusing on the differences between Spark and MapReduce.

Spark Maintenance Release

The Spark team announces Spark 1.5.1, a maintenance release with about 80 bug fixes.  On the Databricks blog, Reynold Xin explains Spark version numbers work.  Short version: top level numbers correspond to API compatibility, dot releases include features and enhancements, double-dot releases have bug fixes.

Spark Use Cases and Success Stories

At Strata NYC 2015, Databricks’ Reynold Xin describes  “sketching” with Spark (aka exploratory analysis and feature engineering).

Also at Strata, Edd Dumbill presents the business case for Spark, Kafka “and friends.”

On the MongoDB blog, Mat Keep interviews Thiago Cardoso, co-founder and CTO of Hekima, a social media analytics startup.  Hekima uses Spark, Hadoop and (you guessed it) MongoDB.

On SmartDataCollective, the ubiquitous Jim Scott describes what he calls use cases for Spark: exploratory analytics; machine learning; real-time dashboards and ETL.

At Big Data Dat LA 2015, ESRI’s Adam Mollenkops explains how to apply GeoSpatial Analytics with Spark. Video here.

Spark Integration

A slew of software vendors announce integration with Spark.

–Talend announces Release 6, which leverages Spark and Spark Streaming.  Stories here, herehere, here, here, here, and here.

Syncsort announces integration with Spark and Kafka.

–TIBCO announces Spotfire integration with Spark through the Spark SQL and SparkR APIs.  More here, here and here.

–Dataiku announces integration of its Data Science Studio (DSS) software with Spark.  DSS offers a commercially licensed visual workbench enabling the user to build pipelines integrating a number of data sources and formats.  Analytic functionality is modest.

–SnapLogic announces pending release of what it calls SnapLogic Elastic Integration Platform, which includes components branded as “Sparkplex” and Spark “Snaps”.  The former includes a code generator that translates user requests from the SnapLogic visual pipeline designer into Spark code.  The latter is SnapLogic branding for “prebuilt connectors”, of which there are many.  SnapLogic claims that Snaps are plug and play, so it’s a “snap” to convert your pipeline from MapReduce to Spark.  Stories here, here, here, here, and here.

Spark as a Service

In case you’re not happy with offerings from Databricks, Qubole, Amazon Web Services, Google, BlueData and MemSQL, Altiscale announces you-know-what.

SQL/OLAP

Apache Drill

Dremio, a startup led by MapR’s Drill gurus, lands a $10 million A round.

At the Hadoop Meeting NYC, the folks from Dremio present Drill use cases and a roadmap.

Polymath Abhishek Tiwari reflects on Drill.

On the MapR blog, Joseph Blue explains how to identify a data breach with Drill.

AtScale

adds support for Microsoft PowerBI to its BI on Hadoop story.

Presto

On ZDNet, Natalie Gagliordi touts Teradata’s embrace of Presto.  (Note to ZDNet’s headline editor: Presto is not an Apache project).  She correctly notes that Presto is faster than Hive-on-MapReduce, but that’s a low bar; just about everything is faster than Hive releases prior to 0.13, but Hive-on-Tez competes well with Drill, Impala, Presto and Spark SQL.  That’s a problem for Presto, because a challenger has to be outstanding at something.  Once again, it appears that Teradata is betting on the wrong horse.

Machine Learning

SparkR

Databricks’ Hossein Falaki delivers a presentation at Strata on supercharging R with Spark; slides here.  Spark’s R API is incomplete; as of Spark 1.5.1 it supports DataFrames operations (including SQL queries) and generalized linear models.  That’s better than nothing, but R users who need to do serious machine learning now need to look elsewhere.

Streaming Analytics

Apache Flink

On the MapR blog, Ellen Friedman introduces you to Flink.

Data Artisans’ Robert Metzger delivers a presentation about the architecture of Flink’s streaming runtime at ApacheCon Europe.

Spark Streaming

At Strata NY 2015, Databricks’ Tathagata Das describes the new bits in Spark Streaming.

Big Analytics Roundup (September 7, 2015)

Top news for this week: SAP and Syncsort release Spark connectors, and HP says it wants to develop one; Pivotal abandons Hawq; Flink, Spark and H2O publish agendas for the upcoming events.

Postdoc Opportunity

Peter Rose, Site Head of the RCSB Protein Data Bank West at SDSC just landed a $1.4 million grant from the National Institutes of Health under its Big Data to Knowledge (BD2K) initiative.   Peter writes to say that the team plans to use Apache Spark, and seeks postdocs to join the group.   See the flyer here: Postdoc Big Data UCSD flyer

SQL on Hadoop

Pivotal gives up trying to sell Hawq, donates it to Apache where it is now an incubator project.  Hawq is a SQL tool that federates queries across Hadoop and Greenplum database, introduced with much hoopla two years ago.

Apache Drill

On the MapR blog, Carol McDonald offers the ultimate guide to Drill’s architecture.  (h/t Hadoop Weekly)

Apache Flink/ Data Artisans

announces Flink 0.9.1, a maintenance release.

Flink Forward’s organizers publish the conference program, which will be held October 12-13 at Berlin’s KulturBrauerei in the Prenzler Berg district.  Conference organizers have arranged discounted rooms at the nearby Hotel4Youth, where presumably old folks like me are unwelcome.

On Slideplayer, Data Artisans co-founder Stephan Ewan presents an overview of Flink.

On the Data Artisans blog, Robert Metzger and Kostas Tzoumas present a practical guide to integrating Flink and Kafka.  Go here for video.

Dataconomy interviews Romeo Kienzler of IBM, who plans to present at the Flink Forward conference in October.

Apache Spark/ Databricks

Databricks publishes agenda for the Spark Summit Europe.  I’ve attended every Spark Summit to date, but will skip this one.

HP announces its future “commitment” to integrate Vertica with Spark, featuring accelerated data transfer and a model scoring capability.  Translation:  HP wants some Spark buzz to counter IBM, but they don’t have a working release or even a definite plan.

MapR’s Carol McDonald publishes a guide to integrating Spark Streaming with HBase.  (h/t Hadoop Weekly)

SAP announces SAP HANA Vora, an in-memory query engine that runs on Spark and supports in-memory queries, OLAP and drill-down analysis.  SAP’s PR engine kicks into overdrive.  Timothy Prickett Morgan touts it; additional coverage here, here, and here.

Syncsort contributes an IBM z-system mainframe connector to Spark Packages.   Analysis ensues: here, here, here, and here.  Some analysts confuse this announcement with IBM’s plan to support Spark on its mainframes.  The two things are completely different: the point of the connector is to extract data from the mainframe and move it to Spark for processing.

Bernard Marr asks whether the rise of Spark spells the end of Hadoop, explains why the answer is no.

H2O/ H2O.ai

publishes schedule and speaker lineup for H2O World 2015 (November 9-11).   Featured speakers include Hilary Mason, Monica Rogati, Stephen Boyd and Rob Tibshirani.

R

Hadley Wickham announces dplyr 0.4.3, with 30 minor improvements and bug fixes.  Release notes here.

Revolution R/ Microsoft

…announces Revolution R Open 3.2.2, an enhanced R distribution.

Big Analytics Roundup (April 6, 2015)

Late posting today due to holiday travel.

In the week following Spark Summit East, a number of Spark skeptics surfaced, a sign that people take Spark seriously.

The top item of the week, though, is Tiernan Ray’s interview with Michael Stonebraker in Barrons, a must-read.

Analytic Software

Forrester published its latest “wave” for Big Data Predictive Analytics Solutions, an inaptly named report that lumps together solutions that can work with Big Data and those that cannot.  I’ll write a more detailed summary later this week.  Quick takes:  Alteryx, Oracle and RapidMiner did well, but Alpine and Microsoft clearly need to shift some of their analyst relations spending from Gartner to Forrester.

Apache Drill

Apache Drill announces Release 0.8.

Apache Spark

Analysis

In opensource.com, Jen Wike Hugar interviews key Spark contributor Reynold Xin.

Mike Vizard, in the aptly named Talkin’ Cloud, describes the high potential for Spark in the cloud.  (Though he does not mention it, more than half of respondents to a recent Typesafe survey of Spark users said they deploy it in the cloud.)

Matei Zaharia, creator of Spark and CTO of Databricks, held an Ask Me Anything last week on Reddit.  Key takeaways: no, Matei is not a musician, and yes, he likes Nutella. 

Spark has clearly reached a point of inflection when skeptical analysis emerges.  Criticism is healthy, of course, but what the skeptics all seem to share is an ignorance of machine learning and streaming applications, and the challenge of making those applications work well in MapReduce.  In other words, they all seem to misunderstand the purpose of Spark, and would do well to learn more about the platform before quibbling on the margins.

  • Professional cat herder Andrew Oliver compares Spark to Tableau and, shockingly, finds it wanting.  Also, Andrew heard people say unflattering things about Hadoop at Spark Summit East.  Who knew that Hadoop devotees are so sensitive?
  • In DataMill, Nicole Leskowski asks if Apache Spark is the next big thing in Big Data Analytics, a question that would have been timely last year.
  • In TechTarget, Jack Vaughan wonders whether Spark is just a shiny new object, while ruminating about Digital Equipment and the PDP-11.  His point will be lost on most readers.
  • Returning to ZDNet from GigaOm, Andrew Brust asks if Spark is overhyped, citing unnamed second-hand sources that tell him Spark is “not ready for prime time.”   Note to Andrew: you can download the software here.

Spark Core

Matei Zaharia celebrates Spark’s fifth birthday with a brief history.

On the Cloudera blog, Sandy Ryza concludes his series on tuning Spark jobs.

Spark Streaming

On the Databricks blog. Cody Koeninger, Davies Liu and Tathagata Das describe the new direct Kakfa API available in Spark 1.3

Databricks

Databricks announced that Timeful, a startup specializing in intelligent time management, has deployed its recommendation engine in Databricks Cloud.  Case study available here.

Hadoop Ecosystem

In Datanami, Hadoop skeptic Alex Woodie asks if Hadoop needs a reality check, observing that the leading Hadoop distributors do not make money, a trait shared by most industries at comparable points of maturity.  Woodie cites Wikibon’s Big Data revenue summary as evidence that there is little money in Hadoop, without considering the validity of Wikibon’s data (which is self-reported by the vendors and lacks consistent definitions).  Even if we accept the Wikibon data at face value, Woodie also fails to note that startup Palantir (which is totally into Hadoop) now reports more Big Data revenue than industry leader SAS.  Another unanswered question: if Hadoop is so inconsequential, why has Teradata lost half its market value since 2012?

IBM

IBM announces BigInsights 4.0 just nine months after releasing BigInsights 3.0.  BigInsights includes the usual Hadoop bits, plus:

  • BigSQL, a federation engine for SQL across relational databases and Hadoop
  • Big Sheets, a Datameer-like spreadsheet-on-Hadoop tool
  • SystemML, a home-grown machine learning library that runs in MapReduce
  • Text analytics capability
  • Big R, an interface that can push embarrassingly parallel R processing into Hadoop

Streaming and Real-Time Processing

On the O’Reilly Radar blog, Ben Lorica describes platforms and applications for processing data streams.