Basically mrjob performs the following main operations for you:
- Write multi-step MapReduce jobs in pure Python
- Test on your local machine
- Run on a Hadoop cluster
- Run in the cloud using Amazon Elastic MapReduce (EMR)
Mrjob is available for python 2.5+ and installs without any glitch on my python 2.7 setup.
The python script in use can be used both as standalone or as a regular job on hadoop cluster. It can consist one of mapper, combiner or reduce function. For example, the sample program consisted of the following:
from mrjob.job import MRJob import re WORD_RE = re.compile(r"[\w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield word.lower(), 1 def combiner(self, word, counts): yield word, sum(counts) def reducer(self, word, counts): yield word, sum(counts) if __name__ == '__main__': MRWordFreqCount.run()
The raw data gets sent to mapper() function that creates the keys in the form of word, 1 for each mapped word. this data is then sent to whichever processer is in use by transmitting through json (although there are other binary protocols as well, which can better be applied in larger scenarios). These protocols are basically the serialization mechanism between the program and its backend here.
The configuration file does the actual work of deciding whether to use python based mapreduce on the same machine itself, or on a local/remote hadoop cluster and even for amazon elastic mapreduce environment. To do this, I simply put the .mrjob.conf file on my home folder with the following:
I ran the supplied test over both local install of python as well as on my local hadoop cluster without any problems and am suitably impressed with its compactness, which might come handy to me or you someday!