Flume Installation and Streaming Twitter Data Using Flume
Flume service is used for efficiently collecting , aggregating and moving large amount of log data.
Flume helps hadoop to get data from live streaming.
Bellow is the main components in this scenario,
- Event – a singular unit of data that is transported by Flume (typically a single log entry
- Source – the entity through which data enters into Flume. Sources either actively poll for data or passively wait for data to be delivered to them. A variety of sources allow data to be collected, such as log4j logs and syslogs.
- Sink – the entity that delivers the data to the destination. A variety of sinks allow data to be streamed to a range of destinations. One example is the HDFS sink that writes events to HDFS.
- Channel – the conduit between the Source and the Sink. Sources ingest events into the channel and the sinks drain the channel.
- Agent – any physical Java virtual machine running Flume. It is a collection of sources, sinks and channels.
- Client – produces and transmits the Event to the Source operating within the AgentA flow in Flume starts from the Client (Web Server). The Client transmits the event to a Source operating within the Agent. The Source receiving this event then delivers it to one or more Channels. These Channels are drained by one or more Sinks operating within the same Agent. Channels allow decoupling of ingestion rate from drain rate using the familiar producer-consumer model of data exchange. When spikes in client side activity cause data to be generated faster than what the provisioned capacity on the destination can handle, the channel size increases. This allows sources to continue normal operation for the duration of the spike. Flume agents can be chained together by connecting the sink of one agent to the source of another agent. This enables the creation of complex dataflow topologies.
How to Take Input Data from Twitter to the HDFS?
Before we discuss how to take the input data from Twitter to HDFS, let’s look at the necessary pre-requisites: - Twitter account
- Install Hadoop/Start Hadoop
STEP 1: Download flume:
Command: wget http://archive.apache.org/dist/flume/1.6.0/apache-flume-1.6.0-bin.tar.gz
STEP 2: Extract file from flume tar file.
Command: tar -xvf apache-flume-1.6.0-bin.tar.gz
- STEP 3:Put apache-flume-1.6.0-bin directory inside /home/manju/apache/directory.
Command: sudo mv apache-flume-1.6.0-bin /home/manju/apacheSTEP 4:
We need to remove protobuf-java-2.4.1.jar and guava-10.1.1.jar from lib directory of apache-flume-1.6.0-bin ( when using hadoop-2.x )
Command:sudo rm /home/manju/apache/apache-flume-1.6.0-bin/lib/protobuf-java-2.4.1.jarsudo rm /home/manju/apache/apache-flume-1.6.0-bin/lib/guava-10.0.1.jarSTEP 5:
We need to rename twitter4j-core-3.0.3.jar, twitter4j-media-support-3.0.3.jar and twitter4j-stream-3.0.3.jarCommand:mv /home/manju/apache/apache-flume-1.6.0-bin/lib/twitter4j-core-3.0.3.jar /home/sunil/apache/apache-flume-1.6.0-bin/lib/ftwitter4j-core-3.0.3.orgmv /home/manju/apache/apache-flume-1.6.0-bin/lib/twitter4j-media-support-3.0.3.jar /home/sunil/apache/apache-flume-1.6.0-bin/lib/twitter4j-media-support-3.0.3.orgmv /home/manju/apache/apache-flume-1.6.0-bin/lib/twitter4j-stream-3.0.3.jar /home/sunil/apache/apache-flume-1.6.0-bin/lib/ftwitter4j-stream-3.0.3.orgSTEP 6:
Move the flume-sources-1.0-SNAPSHOT.jar file from Downloads directory to lib directory of apache flume:
Command: sudo mv Downloads/flume-sources-1.0-SNAPSHOT.jar /home/manju/apache/apache-flume-1.6.0-bin/lib/STEP 7:Copy flume-env.sh.template content to flume-env.sh
Command: cd /home/manju/apache/apache-flume-1.6.0-bin/
sudo cp conf/flume-env.sh.template conf/flume-env.shSTEP 8:
Edit flume-env.sh as mentioned in below snapshot.
command: sudo gedit conf/flume-env.shSet JAVA_HOME and FLUME_CLASSPATHSTEP 9:
set the FLUME_HOME path in bashrc fileCommand: sudo gedit ~/.bashrcexample :export FLUME_HOME=/home/manju/apache/apache-flume-1.6.0-binexport PATH=$FLUME_HOME/bin:$PATHexport FLUME_CONF_DIR=$FLUME_HOME/confexport FLUME_CLASS_PATH=$FLUME_CONF_DIR
save the bashrc file and close it....
Now we have installed flume on our machine. Lets run flume to stream twitter data on to HDFS.
We need to create an application in twitter and use its credentials to fetch data.STEP 10:
URL: https://twitter.com/
Enter your Twitter account credentials and sign in:STEP 11:Change the URL to https://apps.twitter.com
Click on Create New App to create a new application and enter all the details in the application,Check Yes, I agree and click on Create your Twitter application:
STEP 12:Your Application will be created:STEP 13:
Click on Keys and Access Tokens, you will get Consumer Key and Consumer Secret.STEP 14:
Scroll down and Click on Create my access token:Your Access token got created:
Consumer Key (API Key): 4AtbrP50QnfyXE2NlYwROBpTm
Consumer Secret (API Secret) jUpeHEZr5Df4q3dzhT2C0aR2N2vBidmV6SNlEELTBnWBMGAwp3
Access Token: 1434925639-p3Q2i3l2WLx5DvmdnFZWlYNvGdAOdf5BrErpGKk
Access Token SecretAghOILIp9JJEDVFiRehJ2N7dZedB1y4cHh0MvMJN5DQu7STEP 15:
Use below link to download flume.conf file
https://drive.google.com/file/d/0B-Cl0IfLnRozdlRuN3pPWEJ1RHc/view?usp=sharing..flume.conf
TwitterAgent.channels= MemChannel
TwitterAgent.sinks=HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels=MemChannel
TwitterAgent.sources.Twitter.consumerKey=zTtjoBHD8gaDeXpuJB2pZffht
TwitterAgent.sources.Twitter.consumerSecret=zJF28oTN0oUvCnEJrCl75muxiArqArDzOEkCWevwEQtfey999B
TwitterAgent.sources.Twitter.accessToken=141131163-ZNKnS0LKAMuB7jVJwidwWkBe4GzOXcgxjisS1TET
TwitterAgent.sources.Twitter.accessTokenSecret=BZj0Tdg8xF125AmX95f0g0jr9UHRxqHBImWB8RyDbPaE3
TwitterAgent.sources.Twitter.keywords= hadoop, cricket, sports
TwitterAgent.sinks.HDFS.channel=MemChannel
TwitterAgent.sinks.HDFS.type=hdfs
TwitterAgent.sinks.HDFS.hdfs.path=hdfs://localhost:9000/twitter_data
TwitterAgent.sinks.HDFS.hdfs.fileType=DataStream
TwitterAgent.sinks.HDFS.hdfs.writeformat=Text
TwitterAgent.sinks.HDFS.hdfs.batchSize=1000
TwitterAgent.sinks.HDFS.hdfs.rollSize=0
TwitterAgent.sinks.HDFS.hdfs.rollCount=10000
TwitterAgent.sinks.HDFS.hdfs.rollInterval=600
TwitterAgent.channels.MemChannel.type=memory
TwitterAgent.channels.MemChannel.capacity=10000
TwitterAgent.channels.MemChannel.transactionCapacity=100
STEP 16:
Put the flume.conf in the conf directory of apache-flume-1.4.0-bin
Command: sudo cp /home/Downloads/flume.conf /home/sunil/apache/apache-flume-1.6.0-bin/conf/
STEP 17:Edit flume.conf
Command: sudo gedit conf/flume.conf
Replace all the below highlighted credentials in flume.conf with the credentials (Consumer Key, Consumer Secret, Access Token, Access Token Secret) you received after creating the application very carefully, rest all will remain same, save the file and close it.STEP 18:
Change permissions for flume directory.
Command: sudo chmod -R 755 /home/sunil/apache/apache-flume-1.6.0-bin/STEP 19:
Start fetching the data from twitter:
Command: bin/flume-ng agent -n TwitterAgent --conf ./conf/ -f conf/flume.conf -Dflume.root.logger=DEBUG,consoleNow wait for 20-30 seconds and let flume stream the data on HDFS, after that press ctrl + c to break the command and stop the streaming. (Since you are stopping the process, you may get few exceptions, ignore it)
STEP 20:
Open the Mozilla browser in your VM, and go to /user/flume/tweets in HDFS
Click on FlumeData file which got created:
If you can see data similar as shown in below snapshot, then the unstructured data has been streamed from twitter on to HDFS successfully. Now you can do analytics on this twitter data using Hive.
I have to voice my passion for your kindness giving support to those people that should have guidance on this important matter.
ReplyDeletehadoop training in bangalore
big data training in chennai
It's Amazing! Am very Glad to read your blog. Many Will Get Good Kwnoledge After Reading Your Blog With The Good Stuff. Keep Sharing This Type Of Blogs For Further Uses.
ReplyDeleteHadoop Online Training in India
Hadoop Online Training in Nodia, Delhi
Much obliged to you, for sharing those brilliantly expressive perceptions. As the reader of this blog, I'll attempt to do some equity in reacting; there's a great deal that you've pressed in articulating the critical imperatives of, as you pleasantly put it. Keep Sharing
ReplyDeleteBig Data Hadoop online training in India
Hadoop online training in USA, UK, Australia
I esteem your undertakings since it passes on the message of what you are endeavoring to state. It's a remarkable aptitude to make even the person who doesn't consider the subject could prepared to understand the subject. Your web diaries are legitimate and besides lavishly portrayed. I might want to scrutinize a consistently expanding number of captivating articles from your blog. Keep Sharing
ReplyDeleteData stage online training in India
Online Data stage training in Australia
Its such as you learn my mind! You appeаr tо grasp ѕo much approximately this, such as you wrote the book in it or something.
ReplyDeleteI think that you could ɗo wіth some percent to pressure the mesѕage home a little bit,
but instead of that, this iѕ excellent blog. An excellent
read. I ԝilⅼ defіnitely be back.
data science training in chennai
data science training in velachery
android training in chennai
android training in velachery
devops training in chennai
devops training in velachery
artificial intelligence training in chennai
artificial intelligence training in velachery
I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.
ReplyDeletehardware and networking training in chennai
hardware and networking training in annanagar
xamarin training in chennai
xamarin training in annanagar
ios training in chennai
ios training in annanagar
iot training in chennai
iot training in annanagar
Such a very useful Blog. Very interesting to read this article. I have learn some new information.thanks for sharing.
ReplyDeleteweb designing training in chennai
web designing training in tambaram
digital marketing training in chennai
digital marketing training in tambaram
rpa training in chennai
rpa training in tambaram
tally training in chennai
tally training in tambaram
Its such as you learn my mind! You appeаr tо grasp ѕo much approximately this, such as you wrote the book in it or something.
ReplyDeleteI think that you could ɗo wіth some percent to pressure the mesѕage home a little bit,
but instead of that, this iѕ excellent blog. An excellent post..
oracle training in chennai
oracle training in omr
oracle dba training in chennai
oracle dba training in omr
ccna training in chennai
ccna training in omr
seo training in chennai
seo training in omr
I am unable to download flume-sources-1.0-SNAPSHOT.jar, pls help
ReplyDelete