Monday 19 September 2016

WordCount Program Map Reducer

WordCount Program



Step 1:
Open Eclipse IDE  ( download from http://www.eclipse.org/downloads/ ) and create a new project with 3 class files - WordCount.java , WordCountMapper.java and WordCountReducer.java
Step 2:
Open WordCount.java and paste the following code.
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;


public class WordCount extends Configured implements Tool{
      public int run(String[] args) throws Exception
      {
            //creating a JobConf object and assigning a job name for identification purposes
            JobConf conf = new JobConf(getConf(), WordCount.class);
            conf.setJobName("WordCount");

            //Setting configuration object with the Data Type of output Key and Value
            conf.setOutputKeyClass(Text.class);
            conf.setOutputValueClass(IntWritable.class);

            //Providing the mapper and reducer class names
            conf.setMapperClass(WordCountMapper.class);
            conf.setReducerClass(WordCountReducer.class);
            //We wil give 2 arguments at the run time, one in input path and other is output path
            Path inp = new Path(args[0]);
            Path out = new Path(args[1]);
            //the hdfs input and output directory to be fetched from the command line
            FileInputFormat.addInputPath(conf, inp);
            FileOutputFormat.setOutputPath(conf, out);

            JobClient.runJob(conf);
            return 0;
      }
   
      public static void main(String[] args) throws Exception
      {
            // this main function will call run method defined above.
        int res = ToolRunner.run(new Configuration(), new WordCount(),args);
            System.exit(res);
      }
}
Step 3:
 Open WordCountMapper.java and paste the following code.
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
      //hadoop supported data types
      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();
   
      //map method that performs the tokenizer job and framing the initial key value pairs
      // after all lines are converted into key-value pairs, reducer is called.
      public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
      {
            //taking one line at a time from input file and tokenizing the same
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
       
          //iterating through all the words available in that line and forming the key value pair
            while (tokenizer.hasMoreTokens())
            {
               word.set(tokenizer.nextToken());
               //sending to output collector which inturn passes the same to reducer
                 output.collect(word, one);
            }
       }
}
Step 4:
 Open WordCountReducer.java and paste the following code.


import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>
{
      //reduce method accepts the Key Value pairs from mappers, do the aggregation based on keys and produce the final out put
      public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
      {
            int sum = 0;
            /*iterates through all the values available with a key and add them together and give the
            final result as the key and sum of its values*/
          while (values.hasNext())
          {
               sum += values.next().get();
          }
          output.collect(key, new IntWritable(sum));
      }
}

you need to remove dependencies by
adding jar files in the hadoop source
folder. Now Click on Project tab and go to Properties.Under
Libraries tab, click Add External JARs and select all the jars in the folder
(click on 1st jar, and Press Shift and Click on last jar
to select all jars in between and click ok) /home/radha/radha/hadoop-1.21/common and /home/radha/radha/hadoop1.2.1/share/hadoop/mapreduce folders.

Step 6:


Now Click on
the Run tab and click Run-Configurations. Click on New Configuration button on
the left-top side and Apply after filling the following properties.
Name - Any name will do - Ex:
     WordCountConfig

     Project - Browse and select
     your project
     Main Class - Select
     WordCount.java - this is our main class

     Step 7:


 Now click on File tab and select
Export. under Java, select Runnable Jar.

In Launch Config - select the config
     fie you created in Step 6  (WordCountConfig).
   
     Select an export destination
     ( lets say
     desktop.)

     Under Library handling,
     select Extract Required Libraries into generated JAR and click Finish.

     Right-Click the jar file, go
     to Properties and under Permissions tab, Check Allow executing file
     as a program. and give Read and Write access to
     all the users.

1 comment:

  1. Great article with excellent idea!Thank you for such a valuable article. I really appreciate for this great information.. this

    ReplyDelete