Monday 19 September 2016

WordCount Program Map Reducer

WordCount Program



Step 1:
Open Eclipse IDE  ( download from http://www.eclipse.org/downloads/ ) and create a new project with 3 class files - WordCount.java , WordCountMapper.java and WordCountReducer.java
Step 2:
Open WordCount.java and paste the following code.
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;


public class WordCount extends Configured implements Tool{
      public int run(String[] args) throws Exception
      {
            //creating a JobConf object and assigning a job name for identification purposes
            JobConf conf = new JobConf(getConf(), WordCount.class);
            conf.setJobName("WordCount");

            //Setting configuration object with the Data Type of output Key and Value
            conf.setOutputKeyClass(Text.class);
            conf.setOutputValueClass(IntWritable.class);

            //Providing the mapper and reducer class names
            conf.setMapperClass(WordCountMapper.class);
            conf.setReducerClass(WordCountReducer.class);
            //We wil give 2 arguments at the run time, one in input path and other is output path
            Path inp = new Path(args[0]);
            Path out = new Path(args[1]);
            //the hdfs input and output directory to be fetched from the command line
            FileInputFormat.addInputPath(conf, inp);
            FileOutputFormat.setOutputPath(conf, out);

            JobClient.runJob(conf);
            return 0;
      }
   
      public static void main(String[] args) throws Exception
      {
            // this main function will call run method defined above.
        int res = ToolRunner.run(new Configuration(), new WordCount(),args);
            System.exit(res);
      }
}
Step 3:
 Open WordCountMapper.java and paste the following code.
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
      //hadoop supported data types
      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();
   
      //map method that performs the tokenizer job and framing the initial key value pairs
      // after all lines are converted into key-value pairs, reducer is called.
      public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
      {
            //taking one line at a time from input file and tokenizing the same
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
       
          //iterating through all the words available in that line and forming the key value pair
            while (tokenizer.hasMoreTokens())
            {
               word.set(tokenizer.nextToken());
               //sending to output collector which inturn passes the same to reducer
                 output.collect(word, one);
            }
       }
}
Step 4:
 Open WordCountReducer.java and paste the following code.


import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable>
{
      //reduce method accepts the Key Value pairs from mappers, do the aggregation based on keys and produce the final out put
      public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
      {
            int sum = 0;
            /*iterates through all the values available with a key and add them together and give the
            final result as the key and sum of its values*/
          while (values.hasNext())
          {
               sum += values.next().get();
          }
          output.collect(key, new IntWritable(sum));
      }
}

you need to remove dependencies by
adding jar files in the hadoop source
folder. Now Click on Project tab and go to Properties.Under
Libraries tab, click Add External JARs and select all the jars in the folder
(click on 1st jar, and Press Shift and Click on last jar
to select all jars in between and click ok) /home/radha/radha/hadoop-1.21/common and /home/radha/radha/hadoop1.2.1/share/hadoop/mapreduce folders.

Step 6:


Now Click on
the Run tab and click Run-Configurations. Click on New Configuration button on
the left-top side and Apply after filling the following properties.
Name - Any name will do - Ex:
     WordCountConfig

     Project - Browse and select
     your project
     Main Class - Select
     WordCount.java - this is our main class

     Step 7:


 Now click on File tab and select
Export. under Java, select Runnable Jar.

In Launch Config - select the config
     fie you created in Step 6  (WordCountConfig).
   
     Select an export destination
     ( lets say
     desktop.)

     Under Library handling,
     select Extract Required Libraries into generated JAR and click Finish.

     Right-Click the jar file, go
     to Properties and under Permissions tab, Check Allow executing file
     as a program. and give Read and Write access to
     all the users.

Wednesday 14 September 2016

FLUME INSTALLATION

  FLUME INSTALLATION

  My configuration is Apache Flume 1.6.0 on machine with Ubuntu 14.04 and Apache Hadoop 1.2.1 and location of Hadoop is /usr/local/Hadoop

Step 1: Download latest Flume release from Apache Website.

Step 2: Move/Copy it to the location you want to install, in my case it is “/usr/local/”.
              1.$cd Downloads/              2.$sudo cp apache-flume-1.6.0-bin.tar.gz /usr/local/

Step 3: Extract the tar file. Go to the copied folder, in my case it is “/usr/local/” , run the below   commands
          $ cd /usr/local
          $ sudo tar -xzvf apache-flume-1.6.0-bin.tar.gz

Step 4: Rename folder from “apache-flume-1.6.0-bin” to “flume” for simplicity.
          $ sudo mv apache-flume-1.6.0-bin flume

Step 5:  Update environments
           $ gedit ~/.bashrc
       Add Below Lines:
              export FLUME_HOME=/usr/local/flume
              export FLUME_CONF_DIR=$FLUME_HOME/conf
              export FLUME_CLASSPATH=$FLUME_CONF_DIR
              export PATH=$PATH:$FLUME_HOME/bin 

Step 6: Change owner to user and group, in my case it is user:hduser and group:hadoop
           $ sudo chown -R hduser:hadoop /usr/local/flume

Step 7: Rename “flume-env.sh.template” to “flume-env.sh” and write the below values.
       $ sudo mv /usr/local/flume/conf/flume-env.sh.template flume-env.sh
        $ gedit /usr/local/flume/conf/flume-env.sh

        Add below line(please use your installed java version)
           $JAVA_OPTS="-Xms500m -Xmx1000m -Dcom.sun.management.jmxremote"
            export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
      
  How to check java version:
     echo $JAVA_HOME

Step 8: Enter as hduser and run flume CLI: 
            $ flume-ng --help 

Saturday 10 September 2016

PIG LATIN INSTALLATION

       PIG LATIN INSTALLATION


1. Download a Pig release from the following link
http://mirror.wanxp.id/apache/pig/

2.Enter into the directory where the stable version is downloaded. By default it downloads in  Downloads directory.
$ cd Downloads/

3.Unzip the tar file.
$ tar -xvf pig-0.11.1.tar.gz

4.Create directory
$ sudo mkdir /usr/lib/pig

5.move pig-0.16.0 to pig
$ sudo mv pig-0.16.0 /usr/lib/pig

6.Set the PIG_HOME path in bashrc file.
$ gedit ~/.bashrc

In bashrc file append the below 2 statements
export PIG_HOME=/usr/lib/pig/pig-0.16.0
export PATH=$PATH:$PIG_HOME/bin

Save and exit.
          $ source .bashrc

7.Now lets test the installation On the command prompt type
$ pig -h

8.Starting pig in mapreduce mode


$ pig

Thursday 8 September 2016

HIVE INSTALLATION


                               HIVE INSTALLATION


Installation of Hive 1.2.1 on Ubuntu 14.04 and Hadoop 2.6.0

1. Download hive on below path:
                                                 
2. Extract the .tar.gz file in Downloads/ and rename it to hive/ and move the folder to /usr/lib/hive path:
sudo mkdir /usr/lib/hive
sudo mv Downloads/hive /usr/lib

3. Provide access to hive path by chnaging the owners and groups.
sudo chown R
hduser(manju):hadoop /usr/lib/hive

4. Configure environment variables in .bashrc file.
hduser(manju)$ vim ~/.bashrc
export HIVE_HOME=/usr/lib/hive/
export PATH=$PATH:$HIVE_HOME/bin
export HADOOP_USER_CLASSPATH_FIRST=true

5. Apply the changes to bashrc file.
Hduser(manju)$ source ~/.bashrc

6. Create folders for hive in HDFS
hadoop fs  -mkdir /tmp
hadoop fs -mkdir /user/hive/warehouse
hadoop fs -chmod g+w /tmp
hadoop fs -chmod g+w /user/hive/warehouse

7. Configuring Hive
To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is
placed in the $HIVE_HOME/conf directory. The following commands redirect to
Hive config folder and copy the template file:
hduser$ cd $HIVE_HOME/conf
hduser$ cp hive-env.sh.template hiveenv.sh

Edit the hive-env.sh file by appending the following line:
export HADOOP_HOME=/usr/local/hadoop

8. Run the hive
hive

Flume Installation and Streaming Twitter Data Using Flume


Flume Installation and Streaming Twitter Data Using Flume



Flume service is used for efficiently collecting , aggregating and moving large amount of log data.
Flume helps hadoop to get data from live streaming.


  

Bellow is the main components in this scenario,

  • Event – a singular unit of data that is transported by Flume (typically a single log entry 
  • Source – the entity through which data enters into Flume. Sources either actively poll for data or passively wait for data to be delivered to them. A variety of sources allow data to be collected, such as log4j logs and syslogs.

  • Sink – the entity that delivers the data to the destination. A variety of sinks allow data to be streamed to a range of destinations. One example is the HDFS sink that writes events to HDFS. 

  • Channel – the conduit between the Source and the Sink. Sources ingest events into the channel and the sinks drain the channel. 

  • Agent – any physical Java virtual machine running Flume. It is a collection of sources, sinks and channels. 

  • Client – produces and transmits the Event to the Source operating within the Agent







      A flow in Flume starts from the Client (Web Server). The Client transmits the event to a Source operating within the Agent. The Source receiving this event then delivers it to one or more Channels. These Channels are drained by one or more Sinks operating within the same Agent. Channels allow decoupling of ingestion rate from drain rate using the familiar producer-consumer model of data exchange. When spikes in client side activity cause data to be generated faster than what the provisioned capacity on the destination can handle, the channel size increases. This allows sources to continue normal operation for the duration of the spike. Flume agents can be chained together by connecting the sink of one agent to the source of another agent. This enables the creation of complex dataflow topologies.


    How to Take Input Data from Twitter to the HDFS?
    Before we discuss how to take the input data from Twitter to HDFS, let’s look at the necessary pre-requisites:
  • Twitter account
  • Install Hadoop/Start Hadoop
Now we will install apache flume on our machine.
       STEP 1: Download flume:

        Command: wget http://archive.apache.org/dist/flume/1.6.0/apache-flume-1.6.0-bin.tar.gz 
       STEP 2: Extract file from flume tar file.
        Command: tar -xvf apache-flume-1.6.0-bin.tar.gz
  • STEP 3: 

    Put apache-flume-1.6.0-bin directory inside /home/manju/apache/ 
    directory.

    Command: sudo mv apache-flume-1.6.0-bin /home/manju/apache

    STEP 4:

    We need to remove protobuf-java-2.4.1.jar and guava-10.1.1.jar from lib directory of apache-flume-1.6.0-bin ( when using hadoop-2.x )

    Command:
    sudo rm /home/manju/apache/apache-flume-1.6.0-bin/lib/protobuf-java-2.4.1.jar

    sudo rm /home/manju/apache/apache-flume-1.6.0-bin/lib/guava-10.0.1.jar

    STEP 5:

    We need to rename twitter4j-core-3.0.3.jar, twitter4j-media-support-3.0.3.jar and twitter4j-stream-3.0.3.jar 
    Command:
    mv /home/manju/apache/apache-flume-1.6.0-bin/lib/twitter4j-core-3.0.3.jar /home/sunil/apache/apache-flume-1.6.0-bin/lib/ftwitter4j-core-3.0.3.org

    mv /home/manju/apache/apache-flume-1.6.0-bin/lib/twitter4j-media-support-3.0.3.jar /home/sunil/apache/apache-flume-1.6.0-bin/lib/twitter4j-media-support-3.0.3.org

    mv /home/manju/apache/apache-flume-1.6.0-bin/lib/twitter4j-stream-3.0.3.jar /home/sunil/apache/apache-flume-1.6.0-bin/lib/ftwitter4j-stream-3.0.3.org

    STEP 6:

    Move the flume-sources-1.0-SNAPSHOT.jar file from Downloads directory to lib directory of apache flume:

    Command: sudo mv Downloads/flume-sources-1.0-SNAPSHOT.jar /home/manju/apache/apache-flume-1.6.0-bin/lib/

    STEP 7
    Copy flume-env.sh.template content to flume-env.sh

    Command: cd /home/manju/apache/apache-flume-1.6.0-bin/

      sudo cp conf/flume-env.sh.template conf/flume-env.sh

    STEP 8:

    Edit flume-env.sh as mentioned in below snapshot.

    command: sudo gedit conf/flume-env.sh

    Set JAVA_HOME and FLUME_CLASSPATH





    STEP 9:

    set the FLUME_HOME path in bashrc file

    Command: sudo gedit ~/.bashrc
    example :

    export FLUME_HOME=/home/manju/apache/apache-flume-1.6.0-bin
    export PATH=$FLUME_HOME/bin:$PATH
    export FLUME_CONF_DIR=$FLUME_HOME/conf
    export FLUME_CLASS_PATH=$FLUME_CONF_DIR

    save the bashrc file and close it....

    Now we have installed flume on our machine. Lets run flume to stream twitter data on to HDFS.
    We need to create an application in twitter and use its credentials to fetch data.
    STEP 10:

    URL: https://twitter.com/
    Enter your Twitter account credentials and sign in:

    STEP 11:
    Change the URL to https://apps.twitter.com
    Click on Create New App to create a new application and enter all the details in the application,
    Check Yes, I agree and click on Create your Twitter application:



    STEP 12:
    Your Application will be created:
      
    STEP 13:

    Click on Keys and Access Tokens, you will get Consumer Key and Consumer Secret.



    STEP 14:

    Scroll down and Click on Create my access token:
    Your Access token got created:




    Consumer Key (API Key): 4AtbrP50QnfyXE2NlYwROBpTm

    Consumer Secret (API Secret)         jUpeHEZr5Df4q3dzhT2C0aR2N2vBidmV6SNlEELTBnWBMGAwp3

    Access Token: 1434925639-p3Q2i3l2WLx5DvmdnFZWlYNvGdAOdf5BrErpGKk
    Access Token Secret 
    AghOILIp9JJEDVFiRehJ2N7dZedB1y4cHh0MvMJN5DQu7
    STEP 15:

    Use below link to download flume.conf file
    https://drive.google.com/file/d/0B-Cl0IfLnRozdlRuN3pPWEJ1RHc/view?usp=sharing..

    flume.conf 
    TwitterAgent.sources= Twitter
    TwitterAgent.channels= MemChannel
    TwitterAgent.sinks=HDFS
    TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
    TwitterAgent.sources.Twitter.channels=MemChannel

    TwitterAgent.sources.Twitter.consumerKey=zTtjoBHD8gaDeXpuJB2pZffht
    TwitterAgent.sources.Twitter.consumerSecret=zJF28oTN0oUvCnEJrCl75muxiArqArDzOEkCWevwEQtfey999B
    TwitterAgent.sources.Twitter.accessToken=141131163-ZNKnS0LKAMuB7jVJwidwWkBe4GzOXcgxjisS1TET
    TwitterAgent.sources.Twitter.accessTokenSecret=BZj0Tdg8xF125AmX95f0g0jr9UHRxqHBImWB8RyDbPaE3

    TwitterAgent.sources.Twitter.keywords= hadoop, cricket, sports

    TwitterAgent.sinks.HDFS.channel=MemChannel
    TwitterAgent.sinks.HDFS.type=hdfs
    TwitterAgent.sinks.HDFS.hdfs.path=hdfs://localhost:9000/twitter_data
    TwitterAgent.sinks.HDFS.hdfs.fileType=DataStream
    TwitterAgent.sinks.HDFS.hdfs.writeformat=Text
    TwitterAgent.sinks.HDFS.hdfs.batchSize=1000
    TwitterAgent.sinks.HDFS.hdfs.rollSize=0
    TwitterAgent.sinks.HDFS.hdfs.rollCount=10000
    TwitterAgent.sinks.HDFS.hdfs.rollInterval=600
    TwitterAgent.channels.MemChannel.type=memory
    TwitterAgent.channels.MemChannel.capacity=10000
    TwitterAgent.channels.MemChannel.transactionCapacity=100
     
    STEP 16:

    Put the flume.conf in the conf directory of apache-flume-1.4.0-bin
    Command: sudo cp /home/Downloads/flume.conf /home/sunil/apache/apache-flume-1.6.0-bin/conf/

    STEP 17:

    Edit flume.conf

    Command: sudo gedit conf/flume.conf

    Replace all the below highlighted credentials in flume.conf with the credentials (Consumer Key, Consumer Secret, Access Token, Access Token Secret) you received after creating the application very carefully, rest all will remain same, save the file and close it.
      
    STEP 18:

    Change permissions for flume directory.

    Command: sudo chmod -R 755 /home/sunil/apache/apache-flume-1.6.0-bin/
      
    STEP 19:

    Start fetching the data from twitter:

    Command: bin/flume-ng agent -n TwitterAgent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,console

      Now wait for 20-30 seconds and let flume stream the data on HDFS, after that press ctrl + c to break the command and stop the streaming. (Since you are stopping the process, you may get few exceptions, ignore it)

    STEP 20:

    Open the Mozilla browser in your VM, and go to /user/flume/tweets in HDFS

    Click on FlumeData file which got created:


     
    If you can see data similar as shown in below snapshot, then the unstructured data has been streamed from twitter on to HDFS successfully. Now you can do analytics on this twitter data using Hive.