Thursday, June 6, 2013

MapReduce2.0: Next Generation Mapreduce using YARN

There has been a lot of long overdue changes in apache hadoop, the most popular framework to perform mapreduce over big data.
As the users of hadoop may recall, one of the vulnerability in mapreduce was the presence of a single node that performed work of job tracker which performed the tasks of job scheduling as well as management and handling errors in map and reduce operations done by these jobs. Until now, there had been workarounds that addressed the failure of this job management node but the separation of work in the jobtracker was not addressed. In particular, several aspects regarding scalability, cluster utilization, change/upgradation, supporting workloads other than MapReduce itself.

YARN(Yet-Another-Resource-Negotiator) separates this problem by splitting up resource management and job scheduler into different daemons that are then applicable on applications - either as a single job or a graph of interrelated jobs. Resource management is done by a global ResourceManager(RM) that resembles the old job tracker, but exists independently of the job or its application. To perform the jobs for each application, ApplicationMaster(AM) is used which acts as an intermediary between the RM and the nodes, which are in turn manipulated through NodeManager(NM) and it relays the node health and resource requirements to the RM as needed.
The YARN archiecture

A key advantage of YARN is that it supports algorithms other than MapReduce. There are problems that need to have intermediate or realtime results while processing and traditionally can't be used in mapreduce such as graph processing. Simply put in context of hadoop, if one node's results are required by another after or during processing of one or the both records, then batch processing is the answer.

People generally get confused in the context of the newer version of hadoop(the top level project) with MRv2 as the hadoop supports both the MR1 as well as MRv2. The change is simply in the architecture, that can determine the performance(especially post the map stage - wherein the mapreduce implementations often left nodes underutilized at the reduce phase) of the application in use. One key thing to note here is the fact that the YARN is a framework in itself and what hadoop does is to run mapreduce as a job on the YARN. Thus hadoop preserves its existing API but reaps the benefits arising out of dynamic node allocation done internally by YARN.

No comments: