BIG DATA: May 2016

Big Data: Hadoop 2.x (YARN/MRv2) - Single Node Installation

Apache Hadoop 2/ YARN/MR2 Installation for Beginners :

Background:

Big Data spans three dimensions: Volume, Velocity and Variety. (IBM defined 4th dimension or property of Big Data i.e Veracity). Apache™ Hadoop® is an open source software project that enables the distributed processing of large data sets (Big Data) across clusters of commodity Machines(Low-cost Servers). It is designed to scale up to thousands of machines, with a high degree of fault tolerance and software has the intelligence to detect & handle the failures at the application layer.

NOTE: More details are available@http://hadoop.apache.org/docs/stable/

• The Apache Hadoop component introduced two new terms for Hadoop 1.0 users - MapReduce2 (MR2) and YARN.

• Apache Hadoop YARN is the next-generation Hadoop framework designed to take Hadoop beyond MapReduce for data-processing- resulted in better cluster utilization that permit Hadoop to scale to accommodate more and larger jobs.

• This blog provides information for users to migrate their Apache Hadoop MapReduce applications from Apache Hadoop 1.x to Apache Hadoop 2.x

https://hadoop.apache.org/docs/current2/hadoop-yarn/hadoop-yarn-site/YARN.html

Steps to Install Hadoop2.0 on CentOS/RHEL6 on single node Cluster setup:

Step1: Install Java from link

:http://www.oracle.com/technetwork/java/javase/downloads/index.html

Set the environment variable $JAVA_HOME properly

NOTE: Java-1.6.0-openjdk OR other Hadoop Java Versions listed in a below link are more preferable.

http://wiki.apache.org/hadoop/HadoopJavaVersions

Step2: Download Apache Hadoop2.2 to folder /opt from link : http://hadoop.apache.org/releases.html#Download

Step 3: Add all hadoop and java environment path variables to .bashrc file.

Example :

Configure $HOME/.bashrc

- HODOOP_HOME

- JAVA_PATH

- PATH

- HADOOP_HDFS_HOME

- HADOOP_YARN_HOME

- HADOOP_MAPRED_HOME

- HADOOP_CONF_DIR

- YARN_CLASS_PATH

------------------------------------------------------------------------------------------

Step 4 : Create a separate Group for Hadoop setup

# groupadd hadoop

Step 5: Add 3 user-accounts in Group "hadoop"

# useradd -g hadoop yarn

# useradd -g hadoop hdfs

# useradd -g hadoop mapred

NOTE: Its good to run daemons with a related accounts

Step 6: Create Data Directories for namenode,datanode and secondary namenode

# mkdir -p /var/data/hadoop/hdfs/nn

# mkdir -p /var/data/hadoop/hdfs/dn

# mkdir -p /var/data/hadoop/hdfs/snn

Step 7: Set permission for "hdfs" account

# chown hdfs:hadoop /var/data/hadoop/hdfs -R

Step 8: Create Log Directories

# mkdir -p /var/log/hadoop/yarn

# mkdir logs (at installation directory Example /opt/hadoop2.2.0/logs)

Step 9: Set ownership to yarn

# chown yarn:hadoop /var/log/hadoop/yarn - R

Go to Hadoop directory "/opt/hadoop2.2.0/ "

# chmod g+w logs

# chown yarn:hadoop . -R

Step 10: Configure below listed XML files at $HADOOP_PREFIX/etc/hadoop

------------------------------------------------------------------------------------------------------------------

i) core-site.xml

---------------------------------------------------------------------------------------------------------------------

ii) hadoop-env.sh

------------------------------------------------------------------------------------------------------------------------

iii) hdfs-site.xml

------------------------------------------------------------------------------------------------------------------------

iv) mapred-site.xml

--------------------------------------------------------------------------------------------------------------------

v) yarn-env.sh

------------------------------------------------------------------------------------------------------------------------

vi) yarn-site.xml

---------------------------------------------------------------------------------------------------------------

Step 11: Create a passwordless ssh session for "hdfs" user account :

# su - hdfs

hdfs@localhost$ ssh-keygen -t rsa

hdfs@localhost$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

hdfs@localhost$ chmod 0600 ~/.ssh/authorized_keys

---------------------------------------------------------------------------------

Step 12:

Now you are allowed to login without prompting for the password :

[hdfs@localhost]$ ssh localhost

Last login: Sun Dec 29 04:31:44 2013 from localhost

[hdfs@localhost ~]$

---------------------------------------------------------------------------------------------------------------

Step 13: Format Hadoop File system :

Format the NameNode directory as the HDFS superuser ( "hdfs" user account)

#su - hdfs

$ cd /opt/hadoop2.2/bin

$./hdfs namenode -format

It should show the message : /var/data/hadoop/hdfs/nn has been successfully formated as shown below:

[hdfs@localhost bin]$ ./hdfs namenode -format

13/12/29 02:36:52 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG: host = localhost.localdomain/127.0.0.1

STARTUP_MSG: args = [-format]

STARTUP_MSG: version = 2.2.0

STARTUP_MSG: classpath = /opt/hadoop-2.2.0/etc/hadoop/:/opt/hadoop-2.2.0/share/hadoop/common/lib/jetty-6.1.26.jar:/opt/hadoop-2.2.0/share/hadoop/common/lib/commons-el-1.0.jar:

STARTUP_MSG: java = 1.7.0_45

************************************************************/

13/12/29 02:36:52 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]

Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /opt/hadoop-2.2.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.

It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.

13/12/29 02:36:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Formatting using clusterid: CID-d47a364a-edc6-455f-b3c8-4d2ba54458d5

13/12/29 02:36:54 INFO namenode.HostFileManager: read includes:

HostSet(

)

13/12/29 02:36:54 INFO namenode.HostFileManager: read excludes:

HostSet(

)

13/12/29 02:36:54 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000

13/12/29 02:36:54 INFO util.GSet: Computing capacity for map BlocksMap

13/12/29 02:36:54 INFO util.GSet: VM type = 64-bit

13/12/29 02:36:54 INFO util.GSet: 2.0% max memory = 96.7 MB

13/12/29 02:36:54 INFO util.GSet: capacity = 2^18 = 262144 entries

13/12/29 02:36:54 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false

13/12/29 02:36:54 INFO blockmanagement.BlockManager: defaultReplication = 1

13/12/29 02:36:54 INFO blockmanagement.BlockManager: maxReplication = 512

13/12/29 02:36:54 INFO blockmanagement.BlockManager: minReplication = 1

13/12/29 02:36:54 INFO blockmanagement.BlockManager: maxReplicationStreams = 2

13/12/29 02:36:54 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks = false

13/12/29 02:36:54 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000

13/12/29 02:36:54 INFO blockmanagement.BlockManager: encryptDataTransfer = false

13/12/29 02:36:54 INFO namenode.FSNamesystem: fsOwner = hdfs (auth:SIMPLE)

13/12/29 02:36:54 INFO namenode.FSNamesystem: supergroup = supergroup

13/12/29 02:36:54 INFO namenode.FSNamesystem: isPermissionEnabled = true

13/12/29 02:36:54 INFO namenode.FSNamesystem: HA Enabled: false

13/12/29 02:36:54 INFO namenode.FSNamesystem: Append Enabled: true

13/12/29 02:36:54 INFO util.GSet: Computing capacity for map INodeMap

13/12/29 02:36:54 INFO util.GSet: VM type = 64-bit

13/12/29 02:36:54 INFO util.GSet: 1.0% max memory = 96.7 MB

13/12/29 02:36:54 INFO util.GSet: capacity = 2^17 = 131072 entries

13/12/29 02:36:54 INFO namenode.NameNode: Caching file names occuring more than 10 times

13/12/29 02:36:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033

13/12/29 02:36:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0

13/12/29 02:36:54 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension = 30000

13/12/29 02:36:54 INFO namenode.FSNamesystem: Retry cache on namenode is enabled

13/12/29 02:36:54 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis

13/12/29 02:36:54 INFO util.GSet: Computing capacity for map Namenode Retry Cache

13/12/29 02:36:54 INFO util.GSet: VM type = 64-bit

13/12/29 02:36:54 INFO util.GSet: 0.029999999329447746% max memory = 96.7 MB

13/12/29 02:36:54 INFO util.GSet: capacity = 2^12 = 4096 entries

13/12/29 02:36:55 INFO common.Storage: Storage directory /var/data/hadoop/hdfs/nn has been successfully formatted.

13/12/29 02:36:56 INFO namenode.FSImage: Saving image file /var/data/hadoop/hdfs/nn/current/fsimage.ckpt_0000000000000000000 using no compression

13/12/29 02:36:56 INFO namenode.FSImage: Image file /var/data/hadoop/hdfs/nn/current/fsimage.ckpt_0000000000000000000 of size 196 bytes saved in 0 seconds.

13/12/29 02:36:56 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0

13/12/29 02:36:56 INFO util.ExitUtil: Exiting with status 0

13/12/29 02:36:56 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1

************************************************************/

[hdfs@localhost bin]$

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 14: Start HDFS service - Namenode Daemon process

$cd ../sbin

[hdfs@localhost bin]$ cd ../sbin/

[hdfs@localhost sbin]$ ./hadoop-daemon.sh start namenode

starting namenode, logging to /opt/hadoop-2.2.0/logs/hadoop-hdfs-namenode localhost.localdomain.out

Step 15: Check the status of namenode daemon

[hdfs@localhost ]$ jps

4537 Jps

4300 NameNode =====> started successfully

[hdfs@localhost sbin]$ ps -ef | grep java

hdfs 4300 1 11 02:38 pts/1 00:00:04 /usr/java/default/bin/java -Dproc_namenode -Xmx100m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop-hdfs-namenode-localhost.localdomain.log -Dhadoop.home.dir=/opt/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Djava.library.path=/opt/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.namenode.NameNode

_______________________________________________________________________________

Step 16 : Start HDFS service - Secondary Namenode Daemon process

[hdfs@localhost sbin]$ ./hadoop-daemon.sh start secondarynamenode

starting secondarynamenode, logging to /opt/hadoop-2.2.0/logs/hadoop-hdfs-secondarynamenode-localhost.localdomain.out

[hdfs@localhost sbin]$

Step 17 : Check the status of Secondarynamenode daemon

[hdfs@localhost bin]$ jps

4300 NameNode

4913 SecondaryNameNode ======> started successfully

[hdfs@localhost sbin]$ ps -ef | grep java | grep 4913

hdfs 4913 1 7 02:46 pts/1 00:00:04 /usr/java/default/bin/java -Dproc_secondarynamenode -Xmx100m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop-hdfs-secondarynamenode-localhost.localdomain.log -Dhadoop.home.dir=/opt/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Djava.library.path=/opt/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS -Dhdfs.audit.logger=INFO,NullAppender -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode

_____________________________________________________________________________________________________________

Step 18: Start HDFS service - DataNode Daemon process

[hdfs@localhost sbin]$ ./hadoop-daemon.sh start datanode

starting datanode, logging to /opt/hadoop-2.2.0/logs/hadoop-hdfs-datanode-localhost.localdomain.out

[hdfs@localhost sbin]$

Step 19: Check the status of Datanode daemon

[hdfs@localhost bin]$ jps

4300 NameNode

4913 SecondaryNameNode

4949 Jps

4373 DataNode ======> started successfully

[hdfs@localhost sbin]$ ps -ef | grep java | grep 4373

hdfs 4373 1 34 02:39 pts/1 00:00:06 /usr/java/default/bin/java -Dproc_datanode -Xmx100m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=hadoop-hdfs-datanode-localhost.localdomain.log -Dhadoop.home.dir=/opt/hadoop-2.2.0 -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Djava.library.path=/opt/hadoop-2.2.0/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.datanode.DataNode

___________________________________________________________________

Step 20:Start YARN service - resourcemanager Daemon process

[hdfs@localhost sbin]$ ./yarn-daemon.sh start resourcemanager

starting resourcemanager, logging to /opt/hadoop-2.2.0/logs/yarn-hdfs-resourcemanager-localhost.localdomain.out

Step 21 : Check the status of ResourceManager daemon

[hdfs@localhost bin]$ jps

4300 NameNode

4913 SecondaryNameNode

4949 Jps

4373 DataNode

4500 ResourceManager ======> started successfully

[hdfs@localhost sbin]$ ps -ef | grep java | grep 4500

hdfs 4500 1 3 02:41 pts/1 00:00:08 /usr/java/default/bin/java -Dproc_resourcemanager -Xmx200m -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dyarn.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=yarn-hdfs-resourcemanager-localhost.localdomain.log -Dyarn.log.file=yarn-hdfs-resourcemanager-localhost.localdomain.log -Dyarn.home.dir= -Dyarn.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=/opt/hadoop-2.2.0/lib/native -Dyarn.policy.file=hadoop-policy.xml -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dyarn.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=yarn-hdfs-resourcemanager-localhost.localdomain.log -Dyarn.log.file=yarn-hdfs-resourcemanager-localhost.localdomain.log -Dyarn.home.dir=/opt/hadoop-2.2.0 -Dhadoop.home.dir=/opt/hadoop-2.2.0 -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=/opt/hadoop-2.2.0/lib/native -classpath /opt/hadoop-2.2.0/etc/hadoop/:/opt/hadoop-2.2.0/etc/hadoop/:/opt/hadoop-2.2.0/etc/hadoop/:/opt/hadoop-2.2.0/share/hadoop/common/lib/*:/opt/hadoop-2.2.0/share/hadoop/common/*:/opt/hadoop-2.2.0/share/hadoop/hdfs:/opt/hadoop-2.2.0/share/hadoop/hdfs/lib/*:/opt/hadoop-2.2.0/share/hadoop/hdfs/*:/opt/hadoop-2.2.0/share/hadoop/yarn/lib/*:/opt/hadoop-2.2.0/share/hadoop/yarn/*:/opt/hadoop-2.2.0/share/hadoop/mapreduce/lib/*:/opt/hadoop-2.2.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/contrib/capacity-scheduler/*.jar:/opt/hadoop-2.2.0/share/hadoop/yarn/*:/opt/hadoop-2.2.0/share/hadoop/yarn/lib/*:/opt/hadoop-2.2.0/etc/hadoop//rm-config/log4j.properties org.apache.hadoop.yarn.server.resourcemanager.ResourceManager

__________________________________________________________________________________________________________

Step 22:Start YARN service - NodeManager Daemon process

[hdfs@localhost sbin]$ ./yarn-daemon.sh start nodemanager

starting nodemanager, logging to /opt/hadoop-2.2.0/logs/yarn-hdfs-nodemanager-localhost.localdomain.out

[hdfs@localhost sbin]$

Step 23 : Check the status of Nodemanager daemon

[hdfs@localhost bin]$ jps

4300 NameNode

4744 NodeManager ======> started successfully

4913 SecondaryNameNode

4949 Jps

4373 DataNode

4500 ResourceManager

[root@localhost bin]#

[hdfs@localhost sbin]$ ps -ef | grep java | grep 4744

hdfs 4744 1 2 02:42 pts/1 00:00:03 /usr/java/default/bin/java -Dproc_nodemanager -Xmx200m -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dyarn.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=yarn-hdfs-nodemanager-localhost.localdomain.log -Dyarn.log.file=yarn-hdfs-nodemanager-localhost.localdomain.log -Dyarn.home.dir= -Dyarn.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=/opt/hadoop-2.2.0/lib/native -Dyarn.policy.file=hadoop-policy.xml -server -Dhadoop.log.dir=/opt/hadoop-2.2.0/logs -Dyarn.log.dir=/opt/hadoop-2.2.0/logs -Dhadoop.log.file=yarn-hdfs-nodemanager-localhost.localdomain.log -Dyarn.log.file=yarn-hdfs-nodemanager-localhost.localdomain.log -Dyarn.home.dir=/opt/hadoop-2.2.0 -Dhadoop.home.dir=/opt/hadoop-2.2.0 -Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -Djava.library.path=/opt/hadoop-2.2.0/lib/native -classpath /opt/hadoop-2.2.0/etc/hadoop/:/opt/hadoop-2.2.0/etc/hadoop/:/opt/hadoop-2.2.0/etc/hadoop/:/opt/hadoop-2.2.0/share/hadoop/common/lib/*:/opt/hadoop-2.2.0/share/hadoop/common/*:/opt/hadoop-2.2.0/share/hadoop/hdfs:/opt/hadoop-2.2.0/share/hadoop/hdfs/lib/*:/opt/hadoop-2.2.0/share/hadoop/hdfs/*:/opt/hadoop-2.2.0/share/hadoop/yarn/lib/*:/opt/hadoop-2.2.0/share/hadoop/yarn/*:/opt/hadoop-2.2.0/share/hadoop/mapreduce/lib/*:/opt/hadoop-2.2.0/share/hadoop/mapreduce/*:/contrib/capacity-scheduler/*.jar:/contrib/capacity-scheduler/*.jar:/opt/hadoop-2.2.0/share/hadoop/yarn/*:/opt/hadoop-2.2.0/share/hadoop/yarn/lib/*:/opt/hadoop-2.2.0/etc/hadoop//nm-config/log4j.properties org.apache.hadoop.yarn.server.nodemanager.NodeManager ________________________________________________________

Step 24: This command gives you information on hdfs system

[hdfs@localhost bin]$ ./hadoop dfsadmin -report

Configured Capacity: 16665448448 (15.52 GB)

Present Capacity: 12396371968 (11.55 GB)

DFS Remaining: 12396347392 (11.54 GB)

DFS Used: 24576 (24 KB)

DFS Used%: 0.00%

Under replicated blocks: 0

Blocks with corrupt replicas: 0

Missing blocks: 0

-------------------------------------------------

Datanodes available: 1 (1 total, 0 dead)

Live datanodes:

Name: 127.0.0.1:50010 (localhost)

Hostname: localhost

Decommission Status : Normal

Configured Capacity: 16665448448 (15.52 GB)

DFS Used: 24576 (24 KB)

Non DFS Used: 4269076480 (3.98 GB)

DFS Remaining: 12396347392 (11.54 GB)

DFS Used%: 0.00%

DFS Remaining%: 74.38%

Last contact: Sun Dec 29 03:11:02 PST 2013

[hdfs@localhost bin]$

________________________________________________________

Step25: Stop all the services by running " stop-all.sh "

[hdfs@localhost sbin]$ ./stop-all.sh

This script is Deprecated. Instead use stop-dfs.sh and stop-yarn.sh

Stopping namenodes on [localhost]

localhost: stopping namenode

localhost: stopping datanode

Stopping secondary namenodes [0.0.0.0]

0.0.0.0: stopping secondarynamenode

stopping yarn daemons

stopping resourcemanager

localhost: stopping nodemanager

no proxyserver to stop

[hdfs@localhost sbin]$

________________________________________________________

Step 26: Start all the services by running "start-all.sh "

Added the YARN architecture block diagram to locate the presence of daemons in different components .

[hdfs@localhost sbin]$ ./start-all.sh

check the status of all services :

[hdfs@localhost sbin]$ jps

6161 NameNode

6260 DataNode

6719 NodeManager

6750 Jps

6355 SecondaryNameNode

6429 ResourceManager

[root@localhost bin]#

Job Definition and control Flow between Hadoop/Yarn components:

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=31822268

________________________________________________________

Step 27: Run sample application program "pi" from hadoop-mapreduce-examples-2.2.0.jar

First test with hadoop to run existing hadoop program - launch the program, monitor progress, and get/put files on the HDFS. This program calculates the value of " pi " in parallel i.e 2 maps with 10 samples:

$ hadoop jar /usr/lib/hadoop/hadoop-examples.jar pi 2 10

[hdfs@localhost bin]$ ./hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 2 10

Number of Maps = 2

Samples per Map = 10

Wrote input for Map #0

Wrote input for Map #1

Starting Job

13/12/29 04:33:12 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032

13/12/29 04:33:13 INFO input.FileInputFormat: Total input paths to process : 2

13/12/29 04:33:13 INFO mapreduce.JobSubmitter: number of splits:2

13/12/29 04:33:14 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1388320369543_0001

13/12/29 04:33:15 INFO impl.YarnClientImpl: Submitted application application_1388320369543_0001 to ResourceManager at /0.0.0.0:8032

13/12/29 04:33:15 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1388320369543_0001/

13/12/29 04:33:15 INFO mapreduce.Job: Running job: job_1388320369543_0001

13/12/29 04:33:38 INFO mapreduce.Job: Job job_1388320369543_0001 running in uber mode : false

13/12/29 04:33:38 INFO mapreduce.Job: map 0% reduce 0%

13/12/29 04:35:22 INFO mapreduce.Job: map 83% reduce 0%

13/12/29 04:35:23 INFO mapreduce.Job: map 100% reduce 0%

13/12/29 04:36:10 INFO mapreduce.Job: map 100% reduce 100%

13/12/29 04:36:16 INFO mapreduce.Job: Job job_1388320369543_0001 completed successfully

13/12/29 04:36:16 INFO mapreduce.Job: Counters: 43

File System Counters

FILE: Number of bytes read=50

FILE: Number of bytes written=238681

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=528

HDFS: Number of bytes written=215

HDFS: Number of read operations=11

HDFS: Number of large read operations=0

HDFS: Number of write operations=3

Job Counters

Launched map tasks=2

Launched reduce tasks=1

Data-local map tasks=2

Total time spent by all maps in occupied slots (ms)=208977

Total time spent by all reduces in occupied slots (ms)=39840

Map-Reduce Framework

Map input records=2

Map output records=4

Map output bytes=36

Map output materialized bytes=56

Input split bytes=292

Combine input records=0

Combine output records=0

Reduce input groups=2

Reduce shuffle bytes=56

Reduce input records=4

Reduce output records=0

Spilled Records=8

Shuffled Maps =2

Failed Shuffles=0

Merged Map outputs=2

GC time elapsed (ms)=1712

CPU time spent (ms)=3320

Physical memory (bytes) snapshot=454049792

Virtual memory (bytes) snapshot=3515953152

Total committed heap usage (bytes)=268247040

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters

Bytes Read=236

File Output Format Counters

Bytes Written=97

Job Finished in 184.356 seconds

Estimated value of Pi is 3.80000000000000000000

[hdfs@localhost bin]$

________________________________________________________________

Step 28 : Verify the Running Services Using the Web Interface:

Web Interface for the resource Manager can be viewed by

http://localhost:8088

Shows the running application on single node cluster

Application Overview -Final Status( FINISHED)

__________________________________________________________________

Step 29 : Create a Directory on HDFS

[hdfs@localhost bin]$ ./hadoop fs -mkdir test1

-------------------------------------------------------------------------

Step 30: Put local file "hellofile" into HDFS (/test1)

[hdfs@localhost bin]$ ./hadoop fs -put hellofile /test1

-------------------------------------------------------------------------

Step 31: Check the input file "hellofile" on HDFS

[hdfs@localhost bin]$ ./hadoop fs -ls /test1

Found 1 items

-rw-r--r-- 1 hdfs supergroup 113 2013-12-29 04:56 /test1/hellofile

[hdfs@localhost bin]$

___________________________________________________________

Step 32: Run application program "WordCount" from hadoop-mapreduce-examples-2.2.0.jar

WordCount Example:

WordCount example reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occured, separated by a tab.Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.

To run the example, the command syntax is

bin/hadoop jar hadoop-*-examples.jar wordcount <in-dir> <out-dir>

All of the files in the input directory (called in-dir in the command line above) are read and the counts of words in the input are written to the output directory (called out-dir above).It is assumed that both inputs and outputs are stored in HDFS.If your input is not already in HDFS, but is rather in a local file system somewhere, you need to copy the data into HDFS as shown in above steps 29-31.

NOTE: Similarly you could think of processing bigger Data Files ( Weather data , Healthcare data, Machine Log data ...etc).

[hdfs@localhost bin]$ ./hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /test1/hellofile /test1/output

13/12/29 04:57:51 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032

13/12/29 04:57:53 INFO input.FileInputFormat: Total input paths to process : 1

13/12/29 04:57:53 INFO mapreduce.JobSubmitter: number of splits:1

13/12/29 04:57:54 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1388320369543_0002

13/12/29 04:57:55 INFO impl.YarnClientImpl: Submitted application application_1388320369543_0002 to ResourceManager at /0.0.0.0:8032

13/12/29 04:57:55 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1388320369543_0002/

13/12/29 04:57:55 INFO mapreduce.Job: Running job: job_1388320369543_0002

13/12/29 04:58:06 INFO mapreduce.Job: Job job_1388320369543_0002 running in uber mode : false

13/12/29 04:58:06 INFO mapreduce.Job: map 0% reduce 0%

13/12/29 04:58:17 INFO mapreduce.Job: map 100% reduce 0%

13/12/29 04:58:41 INFO mapreduce.Job: map 100% reduce 100%

13/12/29 04:58:42 INFO mapreduce.Job: Job job_1388320369543_0002 completed successfully

13/12/29 04:58:42 INFO mapreduce.Job: Counters: 43

File System Counters

FILE: Number of bytes read=152

FILE: Number of bytes written=158589

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=215

HDFS: Number of bytes written=94

HDFS: Number of read operations=6

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=1

Launched reduce tasks=1

Data-local map tasks=1

Total time spent by all maps in occupied slots (ms)=9934

Total time spent by all reduces in occupied slots (ms)=19948

Map-Reduce Framework

Map input records=4

Map output records=21

Map output bytes=194

Map output materialized bytes=152

Input split bytes=102

Combine input records=21

Combine output records=13

Reduce input groups=13

Reduce shuffle bytes=152

Reduce input records=13

Reduce output records=13

Spilled Records=26

Shuffled Maps =1

Failed Shuffles=0

Merged Map outputs=1

GC time elapsed (ms)=148

CPU time spent (ms)=1520

Physical memory (bytes) snapshot=298029056

Virtual memory (bytes) snapshot=2346151936

Total committed heap usage (bytes)=143855616

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters

Bytes Read=113

File Output Format Counters

Bytes Written=94

[hdfs@localhost bin]$

Verify the Running Services Using the Web Interface:

All Applications

Scheduler-View on Web Interface

_______________________________________________________

Step 33: View the output file of WorrdCount application program :

[hdfs@localhost bin]$ ./hadoop fs -ls /test1/output1

Found 2 items

-rw-r--r-- 1 hdfs supergroup 0 2013-12-29 05:09 /test1/output1/_SUCCESS

-rw-r--r-- 1 hdfs supergroup 120 2013-12-29 05:09 /test1/output1/part-r-00000

[hdfs@localhost bin]$ ./hadoop fs -ls /test1/output1/part-r-00000

Found 1 items

-rw-r--r-- 1 hdfs supergroup 120 2013-12-29 05:09 /test1/output1/part-r-00000

[hdfs@localhost bin]$ ./hadoop fs -cat /test1/output1/part-r-00000

Hello 693

Others 231

all 231

and 462

are 231

dear 231

everyone 462

friends 231

here 462

my 231

there 462

to 693

who 231

[hdfs@localhost bin]$

________________________________________________________________

References:

1) http://hadoop.apache.org/

2) Hadoop: The Definitive Guide by Tom White http://it-ebooks.info/book/635/

3) http://hortonworks.com/hadoop/

4) http://www.cloudera.com/content/cloudera/en/home.html

5) http://www.meetup.com/lspe-in/pages/7th_Event_-_Hadoop_Hands_on_Session/

----------------------------------------------------------------------------------------------------------

This is small effort to make familiar with Hadoop YARN setup to run some MapReduce applications and to execute POSIX commands in HDFS environment and also to verify the output for Data analytics.There are many other configurations that you can set for history server/checkpoint/type of Scheduler.. etc which are very much required in production environment (That will be documented separately)

------------------

Click here : Overview of Hadoop .

Click here : Multi-node Cluster setup and Implementation.

Click here: Big Data Revolution and Vision ........!!!

Click here : Big Data : Watson - Era of cognitive computing !!!

End of YARN Single node Installation . :)

BIG DATA

Wednesday, 18 May 2016

HADOOP INSTALLATION