Apache Hadoop 2/ YARN/MR2 Installation for Beginners :
Background:
Big Data spans three dimensions: Volume, Velocity and Variety. (IBM defined 4th dimension or property of Big Data i.e Veracity). Apache™ Hadoop® is an open source software project that enables the distributed processing of large data sets (Big Data) across clusters of commodity Machines(Low-cost Servers). It is designed to scale up to thousands of machines, with a high degree of fault tolerance and software has the intelligence to detect & handle the failures at the application layer.
NOTE: More details are available@http://hadoop.apache.org/docs/stable/
• The Apache Hadoop component introduced two new terms for Hadoop 1.0 users - MapReduce2 (MR2) and YARN.
• Apache Hadoop YARN is the next-generation Hadoop framework designed to take Hadoop beyond MapReduce for data-processing- resulted in better cluster utilization that permit Hadoop to scale to accommodate more and larger jobs.
• This blog provides information for users to migrate their Apache Hadoop MapReduce applications from Apache Hadoop 1.x to Apache Hadoop 2.x
Steps to Install Hadoop2.0 on CentOS/RHEL6 on single node Cluster setup:
Step1: Install Java from link
Set the environment variable $JAVA_HOME properly
NOTE: Java-1.6.0-openjdk OR other Hadoop Java Versions listed in a below link are more preferable.
Step2: Download Apache Hadoop2.2 to folder /opt from link : http://hadoop.apache.org/releases.html#Download
Step 3: Add all hadoop and java environment path variables to .bashrc file.
Example :
Configure $HOME/.bashrc
- HODOOP_HOME
- JAVA_PATH
- PATH
- HADOOP_HDFS_HOME
- HADOOP_YARN_HOME
- HADOOP_MAPRED_HOME
- HADOOP_CONF_DIR
- YARN_CLASS_PATH
------------------------------------------------------------------------------------------
Step 4 : Create a separate Group for Hadoop setup
# groupadd hadoop
Step 5: Add 3 user-accounts in Group "hadoop"
# useradd -g hadoop yarn
# useradd -g hadoop hdfs
# useradd -g hadoop mapred
NOTE: Its good to run daemons with a related accounts
Step 6: Create Data Directories for namenode,datanode and secondary namenode
# mkdir -p /var/data/hadoop/hdfs/nn
# mkdir -p /var/data/hadoop/hdfs/dn
# mkdir -p /var/data/hadoop/hdfs/snn
Step 7: Set permission for "hdfs" account
# chown hdfs:hadoop /var/data/hadoop/hdfs -R
Step 8: Create Log Directories
# mkdir -p /var/log/hadoop/yarn
# mkdir logs (at installation directory Example /opt/hadoop2.2.0/logs)
Step 9: Set ownership to yarn
# chown yarn:hadoop /var/log/hadoop/yarn - R
Go to Hadoop directory "/opt/hadoop2.2.0/ "
# chmod g+w logs
# chown yarn:hadoop . -R
------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------
iii) hdfs-site.xml
------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------
vi) yarn-site.xml
---------------------------------------------------------------------------------------------------------------
Step 11: Create a passwordless ssh session for "hdfs" user account :
# su - hdfs
hdfs@localhost$ ssh-keygen -t rsa
hdfs@localhost$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
hdfs@localhost$ chmod 0600 ~/.ssh/authorized_keys
---------------------------------------------------------------------------------
Step 12:
Now you are allowed to login without prompting for the password :
[hdfs@localhost]$ ssh localhost
Last login: Sun Dec 29 04:31:44 2013 from localhost
[hdfs@localhost ~]$
---------------------------------------------------------------------------------------------------------------
Step 13: Format Hadoop File system :
Format the NameNode directory as the HDFS superuser ( "hdfs" user account)
#su - hdfs
$ cd /opt/hadoop2.2/bin
$./hdfs namenode -format
It should show the message : /var/data/hadoop/hdfs/nn has been successfully formated as shown below:
[hdfs@localhost bin]$ ./hdfs namenode -format
13/12/29 02:36:52 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost.localdomain/127.0.0.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.2.0
STARTUP_MSG: classpath = /opt/hadoop-2.2.0/etc/hadoop/:/opt/hadoop-2.2.0/share/hadoop/common/lib/jetty-6.1.26.jar:/opt/hadoop-2.2.0/share/hadoop/common/lib/commons-el-1.0.jar:
STARTUP_MSG: java = 1.7.0_45
************************************************************/
13/12/29 02:36:52 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /opt/hadoop-2.2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
13/12/29 02:36:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Formatting using clusterid: CID-d47a364a-edc6-455f-b3c8-4d2ba54458d5
13/12/29 02:36:54 INFO namenode.HostFileManager: read includes:
HostSet(
)
13/12/29 02:36:54 INFO namenode.HostFileManager: read excludes:
HostSet(
)
13/12/29 02:36:54 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
13/12/29 02:36:54 INFO util.GSet: Computing capacity for map BlocksMap
13/12/29 02:36:54 INFO util.GSet: VM type = 64-bit
13/12/29 02:36:54 INFO util.GSet: 2.0% max memory = 96.7 MB
13/12/29 02:36:54 INFO util.GSet: capacity = 2^18 = 262144 entries
13/12/29 02:36:54 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
13/12/29 02:36:54 INFO blockmanagement.BlockManager: defaultReplication = 1
13/12/29 02:36:54 INFO blockmanagement.BlockManager: maxReplication = 512
13/12/29 02:36:54 INFO blockmanagement.BlockManager: minReplication = 1
13/12/29 02:36:54 INFO blockmanagement.BlockManager: maxReplicationStreams = 2
13/12/29 02:36:54 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks = false
13/12/29 02:36:54 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
13/12/29 02:36:54 INFO blockmanagement.BlockManager: encryptDataTransfer = false
13/12/29 02:36:54 INFO namenode.FSNamesystem: fsOwner = hdfs (auth:SIMPLE)
13/12/29 02:36:54 INFO namenode.FSNamesystem: supergroup = supergroup
13/12/29 02:36:54 INFO namenode.FSNamesystem: isPermissionEnabled = true
13/12/29 02:36:54 INFO namenode.FSNamesystem: HA Enabled: false
13/12/29 02:36:54 INFO namenode.FSNamesystem: Append Enabled: true
13/12/29 02:36:54 INFO util.GSet: Computing capacity for map INodeMap
13/12/29 02:36:54 INFO util.GSet: VM type = 64-bit
13/12/29 02:36:54 INFO util.GSet: 1.0% max memory = 96.7 MB
13/12/29 02:36:54 INFO util.GSet: capacity = 2^17 = 131072 entries
13/12/29 02:36:54 INFO namenode.NameNode: Caching file names occuring more than 10 times
13/12/29 02:36:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
13/12/29 02:36:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
13/12/29 02:36:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension = 30000
13/12/29 02:36:54 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
13/12/29 02:36:54 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
13/12/29 02:36:54 INFO util.GSet: Computing capacity for map Namenode Retry Cache
13/12/29 02:36:54 INFO util.GSet: VM type = 64-bit
13/12/29 02:36:54 INFO util.GSet: 0.029999999329447746% max memory = 96.7 MB
13/12/29 02:36:54 INFO util.GSet: capacity = 2^12 = 4096 entries
13/12/29 02:36:55 INFO common.Storage: Storage directory /var/data/hadoop/hdfs/nn has been successfully formatted.
13/12/29 02:36:56 INFO namenode.FSImage: Saving image file /var/data/hadoop/hdfs/nn/current/fsimage.ckpt_0000000000000000000 using no compression
13/12/29 02:36:56 INFO namenode.FSImage: Image file /var/data/hadoop/hdfs/nn/current/fsimage.ckpt_0000000000000000000 of size 196 bytes saved in 0 seconds.
13/12/29 02:36:56 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
13/12/29 02:36:56 INFO util.ExitUtil: Exiting with status 0
13/12/29 02:36:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1
************************************************************/
[hdfs@localhost bin]$
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Step 14: Start HDFS service - Namenode Daemon process
$cd ../sbin
[hdfs@localhost bin]$ cd ../sbin/
[hdfs@localhost sbin]$ ./hadoop-daemon.sh start namenode
starting namenode, logging to /opt/hadoop-2.2.0/logs/hadoop-hdfs-namenode localhost.localdomain.out
Step 15: Check the status of namenode daemon
[hdfs@localhost ]$ jps
4537 Jps
4300 NameNode =====> started successfully
[hdfs@localhost sbin]$ ps -ef | grep java
hdfs 4300 1 11 02:38 pts/1 00:00:04 /usr/java/default/bin/java -Dproc_namenode -Xmx100m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop-hdfs-namenode-localhost.localdomain.log -Dhadoop.home.dir=/opt/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Djava.library.path=/opt/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.namenode.NameNode
_______________________________________________________________________________
Step 16 : Start HDFS service - Secondary Namenode Daemon process
[hdfs@localhost sbin]$ ./hadoop-daemon.sh start secondarynamenode
starting secondarynamenode, logging to /opt/hadoop-2.2.0/logs/hadoop-hdfs-secondarynamenode-localhost.localdomain.out
[hdfs@localhost sbin]$
Step 17 : Check the status of Secondarynamenode daemon
[hdfs@localhost bin]$ jps
4300 NameNode
4913 SecondaryNameNode ======> started successfully
[hdfs@localhost sbin]$ ps -ef | grep java | grep 4913
hdfs 4913 1 7 02:46 pts/1 00:00:04 /usr/java/default/bin/java -Dproc_secondarynamenode -Xmx100m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop-hdfs-secondarynamenode-localhost.localdomain.log -Dhadoop.home.dir=/opt/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Djava.library.path=/opt/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
_____________________________________________________________________________________________________________
Step 18: Start HDFS service - DataNode Daemon process
[hdfs@localhost sbin]$ ./hadoop-daemon.sh start datanode
starting datanode, logging to /opt/hadoop-2.2.0/logs/hadoop-hdfs-datanode-localhost.localdomain.out
[hdfs@localhost sbin]$
Step 19: Check the status of Datanode daemon
[hdfs@localhost bin]$ jps
4300 NameNode
4913 SecondaryNameNode
4949 Jps
4373 DataNode ======> started successfully
[hdfs@localhost sbin]$ ps -ef | grep java | grep 4373
hdfs 4373 1 34 02:39 pts/1 00:00:06 /usr/java/default/bin/java -Dproc_datanode -Xmx100m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop-hdfs-datanode-localhost.localdomain.log -Dhadoop.home.dir=/opt/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Djava.library.path=/opt/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.datanode.DataNode
___________________________________________________________________
Step 20:Start YARN service - resourcemanager Daemon process
[hdfs@localhost sbin]$ ./yarn-daemon.sh start resourcemanager
starting resourcemanager, logging to /opt/hadoop-2.2.0/logs/yarn-hdfs-resourcemanager-localhost.localdomain.out
Step 21 : Check the status of ResourceManager daemon
[hdfs@localhost bin]$ jps
4300 NameNode
4913 SecondaryNameNode
4949 Jps
4373 DataNode
4500 ResourceManager ======> started successfully
[hdfs@localhost sbin]$ ps -ef | grep java | grep 4500
hdfs 4500 1 3 02:41 pts/1 00:00:08 /usr/java/default/bin/java -Dproc_resourcemanager -Xmx200m -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dyarn.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=yarn-hdfs-resourcemanager-localhost.localdomain.log -Dyarn.log.file=yarn-hdfs-resourcemanager-localhost.localdomain.log -Dyarn.home.dir= -Dyarn.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=/opt/hadoop-2.2.0/lib/native -Dyarn.policy.file=hadoop-policy.xml -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dyarn.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=yarn-hdfs-resourcemanager-localhost.localdomain.log -Dyarn.log.file=yarn-hdfs-resourcemanager-localhost.localdomain.log -Dyarn.home.dir=/opt/hadoop-2.2.0 -Dhadoop.home.dir=/opt/hadoop-2.2.0 -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=/opt/hadoop-2.2.0/lib/native -classpath /opt/hadoop-2.2.0/etc/hadoop/:/opt/hadoop-2.2.0/etc/hadoop/:/opt/hadoop-2.2.0/etc/hadoop/:/opt/hadoop-2.2.0/share/hadoop/common/lib/*:/opt/hadoop-2.2.0/share/hadoop/common/*:/opt/hadoop-2.2.0/share/hadoop/hdfs:/opt/hadoop-2.2.0/share/hadoop/hdfs/lib/*:/opt/hadoop-2.2.0/share/hadoop/hdfs/*:/opt/hadoop-2.2.0/share/hadoop/yarn/lib/*:/opt/hadoop-2.2.0/share/hadoop/yarn/*:/opt/hadoop-2.2.0/share/hadoop/mapreduce/lib/*:/opt/hadoop-2.2.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/contrib/capacity-scheduler/*.jar:/opt/hadoop-2.2.0/share/hadoop/yarn/*:/opt/hadoop-2.2.0/share/hadoop/yarn/lib/*:/opt/hadoop-2.2.0/etc/hadoop//rm-config/log4j.properties org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
__________________________________________________________________________________________________________
Step 22:Start YARN service - NodeManager Daemon process
[hdfs@localhost sbin]$ ./yarn-daemon.sh start nodemanager
starting nodemanager, logging to /opt/hadoop-2.2.0/logs/yarn-hdfs-nodemanager-localhost.localdomain.out
[hdfs@localhost sbin]$
Step 23 : Check the status of Nodemanager daemon
[hdfs@localhost bin]$ jps
4300 NameNode
4744 NodeManager ======> started successfully
4913 SecondaryNameNode
4949 Jps
4373 DataNode
4500 ResourceManager
[root@localhost bin]#
[hdfs@localhost sbin]$ ps -ef | grep java | grep 4744
hdfs 4744 1 2 02:42 pts/1 00:00:03 /usr/java/default/bin/java -Dproc_nodemanager -Xmx200m -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dyarn.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=yarn-hdfs-nodemanager-localhost.localdomain.log -Dyarn.log.file=yarn-hdfs-nodemanager-localhost.localdomain.log -Dyarn.home.dir= -Dyarn.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=/opt/hadoop-2.2.0/lib/native -Dyarn.policy.file=hadoop-policy.xml -server -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dyarn.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=yarn-hdfs-nodemanager-localhost.localdomain.log -Dyarn.log.file=yarn-hdfs-nodemanager-localhost.localdomain.log -Dyarn.home.dir=/opt/hadoop-2.2.0 -Dhadoop.home.dir=/opt/hadoop-2.2.0 -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=/opt/hadoop-2.2.0/lib/native -classpath /opt/hadoop-2.2.0/etc/hadoop/:/opt/hadoop-2.2.0/etc/hadoop/:/opt/hadoop-2.2.0/etc/hadoop/:/opt/hadoop-2.2.0/share/hadoop/common/lib/*:/opt/hadoop-2.2.0/share/hadoop/common/*:/opt/hadoop-2.2.0/share/hadoop/hdfs:/opt/hadoop-2.2.0/share/hadoop/hdfs/lib/*:/opt/hadoop-2.2.0/share/hadoop/hdfs/*:/opt/hadoop-2.2.0/share/hadoop/yarn/lib/*:/opt/hadoop-2.2.0/share/hadoop/yarn/*:/opt/hadoop-2.2.0/share/hadoop/mapreduce/lib/*:/opt/hadoop-2.2.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/contrib/capacity-scheduler/*.jar:/opt/hadoop-2.2.0/share/hadoop/yarn/*:/opt/hadoop-2.2.0/share/hadoop/yarn/lib/*:/opt/hadoop-2.2.0/etc/hadoop//nm-config/log4j.properties org.apache.hadoop.yarn.server.nodemanager.NodeManager ________________________________________________________
Step 24: This command gives you information on hdfs system
[hdfs@localhost bin]$ ./hadoop dfsadmin -report
Configured Capacity: 16665448448 (15.52 GB)
Present Capacity: 12396371968 (11.55 GB)
DFS Remaining: 12396347392 (11.54 GB)
DFS Used: 24576 (24 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)
Live datanodes:
Name: 127.0.0.1:50010 (localhost)
Hostname: localhost
Decommission Status : Normal
Configured Capacity: 16665448448 (15.52 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 4269076480 (3.98 GB)
DFS Remaining: 12396347392 (11.54 GB)
DFS Used%: 0.00%
DFS Remaining%: 74.38%
Last contact: Sun Dec 29 03:11:02 PST 2013
[hdfs@localhost bin]$
________________________________________________________
Step25: Stop all the services by running " stop-all.sh "
[hdfs@localhost sbin]$ ./stop-all.sh
This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
no proxyserver to stop
[hdfs@localhost sbin]$
________________________________________________________
Step 26: Start all the services by running "start-all.sh "
Added the YARN architecture block diagram to locate the presence of daemons in different components .
[hdfs@localhost sbin]$ ./start-all.sh
check the status of all services :
[hdfs@localhost sbin]$ jps
6161 NameNode
6260 DataNode
6719 NodeManager
6750 Jps
6355 SecondaryNameNode
6429 ResourceManager
[root@localhost bin]#
Job Definition and control Flow between Hadoop/Yarn components:
________________________________________________________
Step 27: Run sample application program "pi" from hadoop-mapreduce-examples-2.2.0.jar
First test with hadoop to run existing hadoop program - launch the program, monitor progress, and get/put files on the HDFS. This program calculates the value of " pi " in parallel i.e 2 maps with 10 samples:
$ hadoop jar /usr/lib/hadoop/hadoop-examples.jar pi 2 10
[hdfs@localhost bin]$ ./hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 2 10
Number of Maps = 2
Samples per Map = 10
Wrote input for Map #0
Wrote input for Map #1
Starting Job
13/12/29 04:33:12 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
13/12/29 04:33:13 INFO input.FileInputFormat: Total input paths to process : 2
13/12/29 04:33:13 INFO mapreduce.JobSubmitter: number of splits:2
13/12/29 04:33:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1388320369543_0001
13/12/29 04:33:15 INFO impl.YarnClientImpl: Submitted application application_1388320369543_0001 to ResourceManager at /0.0.0.0:8032
13/12/29 04:33:15 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1388320369543_0001/
13/12/29 04:33:15 INFO mapreduce.Job: Running job: job_1388320369543_0001
13/12/29 04:33:38 INFO mapreduce.Job: Job job_1388320369543_0001 running in uber mode : false
13/12/29 04:33:38 INFO mapreduce.Job: map 0% reduce 0%
13/12/29 04:35:22 INFO mapreduce.Job: map 83% reduce 0%
13/12/29 04:35:23 INFO mapreduce.Job: map 100% reduce 0%
13/12/29 04:36:10 INFO mapreduce.Job: map 100% reduce 100%
13/12/29 04:36:16 INFO mapreduce.Job: Job job_1388320369543_0001 completed successfully
13/12/29 04:36:16 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=50
FILE: Number of bytes written=238681
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=528
HDFS: Number of bytes written=215
HDFS: Number of read operations=11
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=208977
Total time spent by all reduces in occupied slots (ms)=39840
Map-Reduce Framework
Map input records=2
Map output records=4
Map output bytes=36
Map output materialized bytes=56
Input split bytes=292
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=56
Reduce input records=4
Reduce output records=0
Spilled Records=8
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=1712
CPU time spent (ms)=3320
Physical memory (bytes) snapshot=454049792
Virtual memory (bytes) snapshot=3515953152
Total committed heap usage (bytes)=268247040
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=236
File Output Format Counters
Bytes Written=97
Job Finished in 184.356 seconds
Estimated value of Pi is 3.80000000000000000000
[hdfs@localhost bin]$
________________________________________________________________
Step 28 : Verify the Running Services Using the Web Interface:
Web Interface for the resource Manager can be viewed by
Shows the running application on single node cluster
Application Overview -Final Status( FINISHED)
__________________________________________________________________
Step 29 : Create a Directory on HDFS
[hdfs@localhost bin]$ ./hadoop fs -mkdir test1
-------------------------------------------------------------------------
Step 30: Put local file "hellofile" into HDFS (/test1)
[hdfs@localhost bin]$ ./hadoop fs -put hellofile /test1
-------------------------------------------------------------------------
Step 31: Check the input file "hellofile" on HDFS
[hdfs@localhost bin]$ ./hadoop fs -ls /test1
Found 1 items
-rw-r--r-- 1 hdfs supergroup 113 2013-12-29 04:56 /test1/hellofile
[hdfs@localhost bin]$
___________________________________________________________
Step 32: Run application program "WordCount" from hadoop-mapreduce-examples-2.2.0.jar
WordCount Example:
WordCount example reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occured, separated by a tab.Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.
To run the example, the command syntax is
bin/hadoop jar hadoop-*-examples.jar wordcount <in-dir> <out-dir>
All of the files in the input directory (called in-dir in the command line above) are read and the counts of words in the input are written to the output directory (called out-dir above).It is assumed that both inputs and outputs are stored in HDFS.If your input is not already in HDFS, but is rather in a local file system somewhere, you need to copy the data into HDFS as shown in above steps 29-31.
NOTE: Similarly you could think of processing bigger Data Files ( Weather data , Healthcare data, Machine Log data ...etc).
[hdfs@localhost bin]$ ./hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /test1/hellofile /test1/output
13/12/29 04:57:51 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
13/12/29 04:57:53 INFO input.FileInputFormat: Total input paths to process : 1
13/12/29 04:57:53 INFO mapreduce.JobSubmitter: number of splits:1
13/12/29 04:57:54 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1388320369543_0002
13/12/29 04:57:55 INFO impl.YarnClientImpl: Submitted application application_1388320369543_0002 to ResourceManager at /0.0.0.0:8032
13/12/29 04:57:55 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1388320369543_0002/
13/12/29 04:57:55 INFO mapreduce.Job: Running job: job_1388320369543_0002
13/12/29 04:58:06 INFO mapreduce.Job: Job job_1388320369543_0002 running in uber mode : false
13/12/29 04:58:06 INFO mapreduce.Job: map 0% reduce 0%
13/12/29 04:58:17 INFO mapreduce.Job: map 100% reduce 0%
13/12/29 04:58:41 INFO mapreduce.Job: map 100% reduce 100%
13/12/29 04:58:42 INFO mapreduce.Job: Job job_1388320369543_0002 completed successfully
13/12/29 04:58:42 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=152
FILE: Number of bytes written=158589
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=215
HDFS: Number of bytes written=94
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=9934
Total time spent by all reduces in occupied slots (ms)=19948
Map-Reduce Framework
Map input records=4
Map output records=21
Map output bytes=194
Map output materialized bytes=152
Input split bytes=102
Combine input records=21
Combine output records=13
Reduce input groups=13
Reduce shuffle bytes=152
Reduce input records=13
Reduce output records=13
Spilled Records=26
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=148
CPU time spent (ms)=1520
Physical memory (bytes) snapshot=298029056
Virtual memory (bytes) snapshot=2346151936
Total committed heap usage (bytes)=143855616
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=113
File Output Format Counters
Bytes Written=94
[hdfs@localhost bin]$
Scheduler-View on Web Interface
_______________________________________________________
Step 33: View the output file of WorrdCount application program :
[hdfs@localhost bin]$ ./hadoop fs -ls /test1/output1
Found 2 items
-rw-r--r-- 1 hdfs supergroup 0 2013-12-29 05:09 /test1/output1/_SUCCESS
[hdfs@localhost bin]$ ./hadoop fs -ls /test1/output1/part-r-00000
Found 1 items
-rw-r--r-- 1 hdfs supergroup 120 2013-12-29 05:09 /test1/output1/part-r-00000
[hdfs@localhost bin]$ ./hadoop fs -cat /test1/output1/part-r-00000
Hello 693
Others 231
all 231
and 462
are 231
dear 231
everyone 462
friends 231
here 462
my 231
there 462
to 693
who 231
________________________________________________________________
References:
1) http://hadoop.apache.org/
3) http://hortonworks.com/hadoop/
4) http://www.cloudera.com/content/cloudera/en/home.html
5) http://www.meetup.com/lspe-in/pages/7th_Event_-_Hadoop_Hands_on_Session/
----------------------------------------------------------------------------------------------------------
This is small effort to make familiar with Hadoop YARN setup to run some MapReduce applications and to execute POSIX commands in HDFS environment and also to verify the output for Data analytics.There are many other configurations that you can set for history server/checkpoint/type of Scheduler.. etc which are very much required in production environment (That will be documented separately)
------------------
Click here : Overview of Hadoop .
Click here : Multi-node Cluster setup and Implementation.
Click here: Big Data Revolution and Vision ........!!!
Click here : Big Data : Watson - Era of cognitive computing !!!
End of YARN Single node Installation . :)