How To Run A Java MapReduce Version 1 (MRv1) Job Using Hadoop On Windows

This tutorial covers how to compile and run the MaxTemperature example covered in Chapter 2 (MapReduce) of Hadoop: The Definitive Guide, 3rd Edition, using the Microsoft HDInsight Emulator for Windows Azure (Hadoop on Windows).

Prerequisites:

  • A supported Windows operating system:
    • Windows 8,
    • Windows 7,
    • Windows Vista SP2,
    • Windows XP SP3+,
    • Windows Server 2003 SP2+,
    • Windows Server 2008,
    • Windows Server 2008 R2, or
    • Windows Server 2012
  • An Internet connection.
  • Administrator privileges.
  • 7-Zip or gzip -d … or some other way of decompressing *.gz files.

Outline:

  1. Install the Microsoft HDInsight Emulator for Windows Azure (Hadoop on Windows).
  2. (Optional) Format the Hadoop Distributed File System (HDFS).
  3. Create the folder structure for this MapReduce project on the local file system.
  4. Create the folder structure for this MapReduce project on the HDFS.
  5. Download the data to the local file system.
  6. Copy the data from the local file system to the HDFS.
  7. Write the following *.java files for this MapReduce project on the local file system:
  8. Compile *.java files to *.class files on the local file system.
  9. Archive the *.class files to a *.jar on the local file system.
  10. Run this MapReduce project using the *.jar on the local file system and the data on the HDFS.
  11. Copy the results of this MapReduce project from the HDFS to the local file system.

Procedure:

  1. Install the Microsoft HDInsight Emulator for Windows Azure (Hadoop on Windows).

    Follow the instructions at http://azure.microsoft.com/en-us/documentation/articles/hdinsight-get-started-emulator/#install to install Apache Hadoop in a single-node cluster deployment using the Hortonworks Data Platform (HDP) for Windows.

    Note: The Microsoft HDInsight Emulator will be installed using the Microsoft Web Platform Installer (Web PI) launched from http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT.

    After installing, you will have the following:
    • Apache Hadoop 1.0.3
    • Apache Pig 0.9.3
    • Apache HCatalog 0.4.1
    • Apache Templeton 0.1.4
    • Apache Hive 0.9.0
    • Apache Sqoop 1.4.2
    • Apache Oozie 3.2.0


    These versions are a few years old; however, they’re good enough to get Hadoop up and running on Windows with minimal effort.

    The Microsoft HDInsight Emulator for Windows Azure makes the following system modifications:

    • Installs the following:
      • Python 2.7.3 (32-bit)
      • Hortonworks Data Platform for Windows
      • Microsoft HDInsight Emulator for Windows Azure
    • Creates a new Local Group named “HadoopUsers”
    • Creates a new Local User named “hadoop” that is a member of the following Local Groups: HadoopUsers and HomeUsers
    • Creates the following services and automatically starts these services under the context of the new Local User hadoop:
      • Apache Hadoop datanode
      • Apache Hadoop derbyserver
      • Apache Hadoop historyserver
      • Apache Hadoop hiveserver
      • Apache Hadoop hiveserver2
      • Apache Hadoop hwi
      • Apache Hadoop jobtracker
      • Apache Hadoop metastore
      • Apache Hadoop namenode
      • Apache Hadoop oozieservice
      • Apache Hadoop secondarynamenode
      • Apache Hadoop tasktracker
      • Apache Hadoop templeton


    References:

  1. (Optional) Format the Hadoop Distributed File System (HDFS).

    This step doesn’t need to be done if you have installed Microsoft HDInsight Emulator for Windows Azure; however, I’ve included this step if you’re attempting to reuse these instructions for Linux … as I do.

    Launch the Hadoop Command Line shortcut and execute the following:
    hadoop namenode -format
    hadoop fs -mkdir /user
    hadoop fs -mkdir /user/your-username

    If any of these folders already exist, then you will receive the following type of error:

    mkdir: cannot create directory /user: File exists
  2. Create the folder structure for this MapReduce project on the local file system.

    Create a folder on the C drive called “Temp” ( i.e. C:\Temp\ ).

    Inside this folder, create another folder called “MaxTemp” ( i.e. C:\Temp\MaxTemp\ ).

  3. Create the folder structure for this MapReduce project on the HDFS.

    Using the Hadoop Command Line, execute the following:
    hadoop fs -mkdir MaxTemp
    hadoop fs -mkdir MaxTemp/input

    In HDFS, this will make a folder named “MaxTemp” under the /user/your-username/ folder ( i.e. /user/your-username/MaxTemp/ ). Then, it will make a folder named “input” under the /user/your-username/MaxTemp/ folder ( i.e. /user/your-username/MaxTemp/input/ ).

  4. Download the data to the local file system.

    In C:\Temp\MaxTemp\, download the following files:


    Extract each *.gz file to C:\Temp\MaxTemp\ so that you have the following:

    • C:\Temp\MaxTemp\1901
    • C:\Temp\MaxTemp\1902


    If you’re using 7-Zip, choose “Extract Here” from the 7-Zip contextual menu on each *.gz file.

  5. Copy the data from the local file system to the HDFS.

    Using the Hadoop Command Line, execute the following:
    hadoop fs -copyFromLocal C:\Temp\MaxTemp\1901 MaxTemp/input
    hadoop fs -copyFromLocal C:\Temp\MaxTemp\1902 MaxTemp/input
    
  1. Write the following *.java files for this MapReduce project on the local file system:


    In the folder C:\Temp\MaxTemp\, write the following into a file named MaxTemperatureMapper.java ( i.e. C:\Temp\MaxTemp\MaxTemperatureMapper.java ):

    import java.io.IOException;
    
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
    
    public class MaxTemperatureMapper extends Mapper<
    						/* input key type:    */	LongWritable,
    						/* input value type:  */	Text,
    						/* output key type:   */	Text,
    						/* output value type: */	IntWritable
    						> {
    	
    	private static final int MISSING = 9999;
    	
    	@Override
    	public void map ( /* input key: */ LongWritable key, /* input value: */ Text value, Context context ) throws IOException, InterruptedException {
    		String line = value.toString();
    		String year = line.substring( 15, 19 );
    		int airTemperature;
    		if ( line.charAt( 87 ) == '+' ) { // parseInt doesn't like leading plus signs
    			airTemperature = Integer.parseInt( line.substring( 88, 92 ) );
    		} else {
    			airTemperature = Integer.parseInt( line.substring( 87, 92 ) );
    		}
    		String quality = line.substring( 92, 93 );
    		if ( airTemperature != MISSING && quality.matches( "[01459]" ) ) {
    			context.write(
    				/* output key:   */	new Text( year ),
    				/* output value: */	new IntWritable( airTemperature )
    			);
    		}
    	}
    	
    }

    In the folder C:\Temp\MaxTemp\, write the following into a file named MaxTemperatureReducer.java ( i.e. C:\Temp\MaxTemp\MaxTemperatureReducer.java ):

    import java.io.IOException;
    
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Reducer;
    
    public class MaxTemperatureReducer extends Reducer<
    						/* input key type:    */	Text,
    						/* input value type:  */	IntWritable,
    						/* output key type:   */	Text,
    						/* output value type: */	IntWritable
    						> {
    	
    	@Override
    	public void reduce ( /* input key: */ Text key, /* input value type:  */ Iterable< IntWritable > values, Context context ) throws IOException, InterruptedException {
    		int maxValue = Integer.MIN_VALUE;
    		for ( IntWritable value : values ) {
    			maxValue = Math.max( maxValue, value.get() );
    		}
    		context.write(
    			/* output key:   */	key,
    			/* output value: */	new IntWritable( maxValue )
    		);
    	}
    	
    }

    In the folder C:\Temp\MaxTemp\, write the following into a file named MaxTemperature.java ( i.e. C:\Temp\MaxTemp\MaxTemperature.java ):

    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
    public class MaxTemperature {
    	
    	public static void main ( String[] args ) throws Exception {
    		
    		if ( args.length != 2 ) {
    			System.err.println( "Usage: MaxTemperature <input path> <output path>" );
    			System.exit( -1 );
    		}
    		
    		Job job = new Job();
    		job.setJarByClass( MaxTemperature.class );
    		job.setJobName( "Max temperature" );
    		
    		FileInputFormat.addInputPath( job, new Path( args[ 0 ] ) );
    		FileOutputFormat.setOutputPath( job, new Path( args[ 1 ] ) );
    		
    		job.setMapperClass( MaxTemperatureMapper.class );
    		job.setReducerClass(MaxTemperatureReducer.class);
    		
    		job.setOutputKeyClass( Text.class );
    		job.setOutputValueClass( IntWritable.class );
    		
    		System.exit( job.waitForCompletion( true ) ? 0 : 1 );
    		
    	}
    	
    }
  1. Compile *.java files to *.class files on the local file system.

    The Microsoft HDInsight Emulator installs an older version of the JDK (i.e. 1.6.0_31) in the folder C:\Hadoop\java\ … so we’ll use this JDK to compile the *.java files.

    Using the Hadoop Command Line, execute the following:
    cd C:\Temp\MaxTemp
    C:\Hadoop\java\bin\javac.exe -classpath C:\Hadoop\hadoop-1.1.0-SNAPSHOT\hadoop-core-1.1.0-SNAPSHOT.jar;C:\Hadoop\hadoop-1.1.0-SNAPSHOT\lib\commons-cli-1.2.jar -d C:\Temp\MaxTemp MaxTemperatureMapper.java MaxTemperatureReducer.java MaxTemperature.java
    
  1. Archive the *.class files to a *.jar on the local file system.

    As mentioned above, the Microsoft HDInsight Emulator installs an older version of the JDK (i.e. 1.6.0_31) in the folder C:\Hadoop\java\ … so we’ll use this JDK to archive the *.class files.

    Using the Hadoop Command Line, execute the following:
    cd C:\Temp\MaxTemp
    C:\Hadoop\java\bin\jar.exe cvf MaxTemperature.jar -C C:\Temp\MaxTemp MaxTemperatureMapper.class MaxTemperatureReducer.class MaxTemperature.class
  1. Run this MapReduce project using the *.jar on the local file system and the data on the HDFS.

    Using the Hadoop Command Line, execute the following:
    hadoop jar C:\Temp\MaxTemp\MaxTemperature.jar MaxTemperature MaxTemp/input MaxTemp/output
  1. Copy the results of this MapReduce project from the HDFS to the local file system.

    Using the Hadoop Command Line, execute the following:
    hadoop fs -copyToLocal MaxTemp/output/part-r-00000 C:\Temp\MaxTemp

    Then, open part-r-00000 in WordPad or Notepad to see the results.

    The results should be:

    1901	317
    1902	244

    Remember that WordPad will interpret “\n” as a new line … whereas, Notepad will not. Notepad only interprets “\r\n” as a new line. Therefore, if you’re set on opening up the part-r-00000 file in Notepad, you can first open the file in WordPad, save & close the file, and then open the file in Notepad.

References:

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="">