How To Run A Java MapReduce Version 2 (MRv2) Job Using Hadoop

This tutorial covers how to compile and run a MapReduce Version 2 (MRv2) job written in Java using Hadoop. All the code and some of the commands used in this tutorial are derived from the Map-Reduce Part 2 — Developing First MapReduce Job section from the Hadoop Tutorial: Developing Big-Data Applications with Apache Hadoop by coreservlets.com.

Note: I’m running Hadoop 2.3.0 in the following Cloudera QuickStart VM: a VMWare virtual machine of the Cloudera Distribution Including Apache Hadoop (CDH) version 5.0 last updated on 02 Apr 2014.  I’m running this virtual machine in the following VMWare Player: VMware Player 6.0.2 build 1744117 released on 2014-04-17.  My host machine is the first version of the Microsoft Surface Pro tablet running Windows 8.1 Pro (64-bit).

Prerequisites:

  • A properly configured installation of Hadoop 2.3.0 (including all dependencies) … good luck!   ;-)

or

Outline:

  1. (Optional) Format the Hadoop Distributed File System (HDFS).
  2. Create the folder structure for this MapReduce project on the local file system.
  3. Create the folder structure for this MapReduce project on the HDFS.
  4. Download the data to the local file system.
  5. Copy the data from the local file system to the HDFS.
  6. Write the following *.java files for this MapReduce project on the local file system:
  7. Compile *.java files to *.class files on the local file system.
  8. Archive the *.class files to a *.jar on the local file system.
  9. Run this MapReduce project using the *.jar on the local file system and the data on the HDFS.
  10. Copy the results of this MapReduce project from the HDFS to the local file system.

Note: In order for you … and me (when I come back to this procedure) … to figure out where we are in the local file system, each step in the procedure starts with a call to cd.

Procedure:

  1. (Optional) Format the Hadoop Distributed File System (HDFS).

    This step doesn’t need to be done if we’re using the CDH 5.0 Cloudera QuickStart VM; however, I’ve included this step if we’re using our Hadoop installation for the first time.

    Launch the Terminal and execute the following:
    cd ~
    hdfs namenode -format
    hdfs dfs -mkdir /user
    hdfs dfs -mkdir /user/our-username

    where our-username is the username that we are currently logged in under. If we’re using the CDH 5.0 Cloudera QuickStart VM, our username is cloudera and we would execute the following:

    cd ~
    hdfs namenode -format
    hdfs dfs -mkdir /user
    hdfs dfs -mkdir /user/cloudera

    If any of these folders already exist, we’ll receive the following type of error:

    mkdir: `/user': File exists
    mkdir: `/user/cloudera': File exists
  1. Create the folder structure for this MapReduce project on the local file system.

    We’ll need somewhere to put our input data files, our Java MapReduce code, and the eventual results (i.e. our output data files), so let’s create a folder named StartsWithCount for this project under the current user’s home directory:
    cd ~
    mkdir ~/StartsWithCount
  1. Create the folder structure for this MapReduce project on the HDFS.

    Similar to the step above, we’ll need a place to put our input data and the eventual results (i.e. our output data files) in the HDFS, so let’s create a folder named StartsWithCount and nest a folder inside called input … the output folder will be created by a later step:
    cd ~/StartsWithCount
    hdfs dfs -mkdir StartsWithCount
    hdfs dfs -mkdir StartsWithCount/input
    hdfs dfs -rm -r StartsWithCount/output

    Note: Upon the successful completion of a MapReduce job, results are written to a file named part-r-00000. If that part-r-00000 file is still around the next time we run our MapReduce job, it won’t be overwritten … instead, our MapReduce job will fail … and we will be sad that we wasted our time.   :-(   Therefore, the last command above (hdfs dfs -rm -r StartsWithCount/output) recursively removes the StartsWithCount/output folder (if it exists) in the HDFS and everything in that output folder … including any files that might be named part-r-00000.   ;-)

  1. Download the data to the local file system.

    Use wget to download the text file pg1524.txt from Project Gutenberg. pg1524.txt is Shakespeare’s Hamlet … and will be our input data file for the rest of this Hadoop tutorial. Since pg1524.txt isn’t very descriptive, let’s rename pg1524.txt to hamlet.txt:
    cd ~/StartsWithCount
    wget http://www.gutenberg.org/cache/epub/1524/pg1524.txt
    mv pg1524.txt hamlet.txt
  1. Copy the data from the local file system to the HDFS.

    Upload (put) hamlet.txt to the HDFS. Since hamlet.txt is our input data file, we’ll put that text file in the StartsWithCount/input folder:
    cd ~/StartsWithCount
    hdfs dfs -put hamlet.txt StartsWithCount/input

Under Construction – Begin

  1. Write the following *.java files for this MapReduce project on the local file system:


    TODO:

    cd ~/StartsWithCount
    mkdir ~/StartsWithCount/mr
    mkdir ~/StartsWithCount/mr/wordcount
    cd ~/StartsWithCount/mr/wordcount
    wget http://hadoop-course.googlecode.com/svn/trunk/HadoopSamples/src/main/java/mr/wordcount/StartsWithCountMapper.java
    wget http://hadoop-course.googlecode.com/svn/trunk/HadoopSamples/src/main/java/mr/wordcount/StartsWithCountReducer.java
    wget http://hadoop-course.googlecode.com/svn/trunk/HadoopSamples/src/main/java/mr/wordcount/StartsWithCountJob.java

    Mapper:

    package mr.wordcount;
    
    import java.io.IOException;
    import java.util.StringTokenizer;
    
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
    
    public class StartsWithCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    	private final static IntWritable countOne = new IntWritable(1);
    	private final Text reusableText = new Text();
    	@Override	
    	protected void map(LongWritable key, Text value, Context context)
    			throws IOException, InterruptedException {
    		
    		StringTokenizer tokenizer = new StringTokenizer(value.toString());
    		while (tokenizer.hasMoreTokens()) {
    			reusableText.set(tokenizer.nextToken().substring(0, 1));
    			context.write(reusableText, countOne);
    		}
    	}
    }

    Reducer:

    package mr.wordcount;
    
    import java.io.IOException;
    
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.log4j.Logger;
    
    public class StartsWithCountReducer extends
    		Reducer<Text, IntWritable, Text, IntWritable> {
    	Logger log = Logger.getLogger(StartsWithCountMapper.class);
    	@Override
    	protected void reduce(Text token, Iterable<IntWritable> counts,
    			Context context) throws IOException, InterruptedException {
    		int sum = 0;
    		
    		for (IntWritable count : counts) {
    			sum+= count.get();
    		}
    		context.write(token, new IntWritable(sum));
    	}
    }

    MapReduce Job:

    package mr.wordcount;
    
    import org.apache.hadoop.conf.Configured;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    import org.apache.hadoop.util.Tool;
    import org.apache.hadoop.util.ToolRunner;
    
    public class StartsWithCountJob extends Configured implements Tool{
    
    	@Override
    	public int run(String[] args) throws Exception {
    		Job job = Job.getInstance(getConf(), "StartsWithCount");		
    		job.setJarByClass(getClass());
    
    		// configure output and input source
    		TextInputFormat.addInputPath(job, new Path(args[0]));
    		job.setInputFormatClass(TextInputFormat.class);
    		
    		// configure mapper and reducer
    		job.setMapperClass(StartsWithCountMapper.class);
    		job.setCombinerClass(StartsWithCountReducer.class);
    		job.setReducerClass(StartsWithCountReducer.class);
    
    		// configure output
    		TextOutputFormat.setOutputPath(job, new Path(args[1]));
    		job.setOutputFormatClass(TextOutputFormat.class);
    		job.setOutputKeyClass(Text.class);
    		job.setOutputValueClass(IntWritable.class);
    
    		return job.waitForCompletion(true) ? 0 : 1;
    	}
    	public static void main(String[] args) throws Exception {
    		int exitCode = ToolRunner.run(new StartsWithCountJob(), args);
    		System.exit(exitCode);
    	}
    }
  1. Compile *.java files to *.class files on the local file system.

    TODO:
    cd ~/StartsWithCount
    javac -classpath `yarn classpath` mr/wordcount/*.java
  1. Archive the *.class files to a *.jar on the local file system.

    TODO:
    cd ~/StartsWithCount
    jar cvf HadoopSamples.jar mr/wordcount/*.class
  1. Run this MapReduce project using the *.jar on the local file system and the data on the HDFS.

    TODO:
    cd ~/StartsWithCount
    yarn jar HadoopSamples.jar mr.wordcount.StartsWithCountJob StartsWithCount/input StartsWithCount/output

Under Construction – End

  1. Copy the results of this MapReduce project from the HDFS to the local file system.

    Download (get) the resulting output file (part-r-00000) of our successful MapReduce job from the HDFS and display that downloaded file using cat:
    cd ~/StartsWithCount
    hdfs dfs -get StartsWithCount/output/part-r-00000
    cat part-r-00000

    Alternatively, we can keep the resulting output file where it’s at on the HDFS and display it from there using hdfs dfs -cat:

    cd ~/StartsWithCount
    hdfs dfs -cat StartsWithCount/output/part-r-00000

References:

One thought on “How To Run A Java MapReduce Version 2 (MRv2) Job Using Hadoop

  1. Pingback: How to run mrv2 program in hadoop – Sparrow Analytics

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="">