Machine Learning in Hadoop: Part One
Much has changed since I last blogged on this subject a year ago (here and here). This is the first of a three-part blog covering the current state of play for machine learning in Hadoop. I use the term “machine learning” deliberately, to refer to tools that can learn from data in an automated or semi-automated manner; this includes traditional statistical modeling plus supervised and unsupervised machine learning. For convenience, I will not cover fast query tools, BI applications, graph engines or streaming analytics; all of those are important, and deserve separate treatment.
Every analytics vendor claims the ability to work with Hadoop. In Part One, we cover five things to consider when evaluating how well a particular machine learning tool integrates with Hadoop: deployment topology, hardware requirements, workload integration, data integration, and the user interface. Of course, these are not the only things an organization should consider when evaluating software; other features, such as support for specific analytic methods, required authentication protocols and other needs specific to the organization may be decisive.
Where does the machine learning software reside relative to the Hadoop TaskTracker and Data Nodes (“worker nodes”)? Is it (a) distributed among the Hadoop worker nodes; (b) deployed on special purpose “analytic” nodes or (c) deployed outside the Hadoop cluster?
Distribution among the worker nodes offers the best performance; under any other topology, data movement will impair performance. If end users tend to work with relatively small snippets of data sampled from the data store, “beside” architectures may be acceptable, but fully distributed deployment is essential for very large datasets.
Deployment on special purpose “analytic” nodes is a compromise architecture, usually motivated either by a desire to reduce software licensing fees or avoid hardware upgrades for worker node servers. There is nothing wrong with saving money, but clients should not be surprised if performance suffers under anything other than a fully distributed architecture.
If the machine learning software supports distributed deployment on the Hadoop worker nodes, can it run effectively on standard Hadoop node servers? The definition of a “standard” node server is a moving target; Cloudera, for example, recognizes that the appropriate hardware spec depends on planned workload. Machine learning, as a rule, benefits from a high memory spec, but some machine learning software tools are more efficient than others in the way they use memory.
Clients are sometimes reluctant to implement a fully distributed machine learning architecture in Hadoop because they do not want to replace or upgrade a large number of node servers. This reluctance is natural, but the problem is attributable in part to a gap in planning and rapidly changing technology. Trading off performance for cost reduction may be the right thing to do, but it should be a deliberate decision.
If the machine learning software can be distributed among the worker nodes, how well does it co-exist with other MapReduce and non-MapReduce applications? The gold standard is the ability to run under Apache YARN, which supports resource management across MapReduce and non-MapReduce applications. Machine learning software that pushes commands down to MapReduce is also acceptable, since the generated MapReduce jobs run under existing Hadoop workload management.
Software that effectively takes over the Hadoop cluster and prevents other jobs from running is only acceptable if the cluster will be dedicated to the machine learning application. This is not completely unreasonable if the Hadoop cluster replaces a conventional standalone analytic server and file system; the TCO for a Hadoop cluster is very favorable relative to a dedicated high-end analytic server. Obviously, clients should know how they plan to use the cluster when considering this.
Ideally, machine learning software should be able to work with every data format supported in Hadoop; most machine learning tools are more limited in what they can read and write. The ability to work with uncompressed text in HDFS is table stakes; more sophisticated tools can work with sequence files as well, and support popular compression formats such as Snappy and Bzip/Gzip. There is also growing interest in use of Apache Avro. Users may also want to work with data in HBase, Hive or Impala.
There is wide variation in the data formats supported by machine learning software; clients are well advised to tailor assessments to the actual formats they plan to use.
There are many aspects of the user interface that matter to clients when evaluating software, but here we consider just one aspect: Does the machine learning software require the user to specify native MapReduce commands, or does it effectively translate user requests to run in Hadoop behind the scenes?
If the user must specify MapReduce, Hive or Pig it begs the question: why not just perform that task directly in MapReduce, Hive or Pig?
In Part Two, we will examine current open source alternatives for machine learning in Hadoop.