Tuesday, July 30, 2013

Easier hadoop job creation through mrjob

Today, I came across mrjob, which is a python package that enables creation of a mapreduce job on a hadoop cluster using python. This is indeed useful as other python based mapreduce frameworks like octopy are easier to setup but do not scale well. This was created and opensourced by Yelp, the food rating website. You can find it on github here.
Basically mrjob performs the following main operations for you:
  • Write multi-step MapReduce jobs in pure Python
  • Test on your local machine
  • Run on a Hadoop cluster
  • Run in the cloud using Amazon Elastic MapReduce (EMR)
Thus, it allows a developer to have separate development, staging and production environments.
Mrjob is available for python 2.5+ and installs without any glitch on my python 2.7 setup.
The python script in use can be used both as standalone or as a regular job on hadoop cluster. It can consist one of mapper, combiner or reduce function. For example, the sample program consisted of the following:

from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")


class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield word.lower(), 1

    def combiner(self, word, counts):
        yield word, sum(counts)

    def reducer(self, word, counts):
        yield word, sum(counts)


if __name__ == '__main__':
    MRWordFreqCount.run()

The raw data gets sent to mapper() function that creates the keys in the form of word, 1 for each mapped word. this data is then sent to whichever processer is in use by transmitting through json (although there are other binary protocols as well, which can better be applied in larger scenarios). These protocols are basically the serialization mechanism between the program and its backend here.
The configuration file does the actual work of deciding whether to use python based mapreduce on the same machine itself, or on a local/remote hadoop cluster and even for amazon elastic mapreduce environment. To do this, I simply put the .mrjob.conf file on my home folder with the following:

runners:
  hadoop:
    base_tmp_dir:/tmp/hjobs
    python_archives: &python_archives
    setup_cmds: &setup_cmds
    upload_files: &upload_files
  local:
    base_tmp_dir: /home/sumit/mrjob/tmp
    python_archives: *python_archives

I ran the supplied test over both local install of python as well as on my local hadoop cluster without any problems and am suitably impressed with its compactness, which might come handy to me or you someday!

No comments: