=======================================================
Apache Spark & Scala
=======================================================
What is BigData
===============
File 201201hourly.txt ---------------> 520 MB
Notepad --------------- 30% of CPU / 1 GB RAM ------------------- failed to open the file
Wordpad --------------- 30% of CPU / 1 GB RAM ------------------- Opened the file successfully
BigData is a term which is used to define data / problem. It is NOT a technology, technique, software / hardware
BigData is all about the ability of the software to handle (storage, retrieval/processing) the data
Oracle database ---- single instance, single machine license ---can handle about 12 TB of Data (license treshold)
MySQL database ---- 18 TB
> Brief History of Data Analysis
> Data Makes every thing clearer
> Big Data ---- Why all the excitement
- Nowcasting vs Forecasting
- Google flu trends --- Feb 2010 google predicted outbreaks 2 weeks before CDC
New models predicting which cities are more at risk of ebola outbreak
- eg:- Big Data and election 2012 - Obama team recruited team of behavioural scientists
IBM definition of BigData
=========================
BigData is something which follows 3 "V"s
1) Volume - size of data
2) Velocity - speed at which data is changing
3) Variety - types of data
a) Structured data ---- Data residing in RDBMS like Oracle, SQLServer, DB2 etc... (Data confirms schema)
b) Semi-structured data ---- XML, JSON logs etc...
c) Un-structured data ---- Text, Application logs, emails, videos, audios....
Gartner ( agrees with IBM definition of BigData....if and ONLY if there exists another V )
4) Veracity / Value - If I process this data, will I get any outcome out of it ... If yes, then I will consider it as BigData... else ... it is just garbage !!!!
Where does Big Data Come from ?
===============================
Online
- Clicks
- Ad impressions
- Billing Events
- Fast Forward, Pause, ....
- Server Requests
- Transactions
- Network Message
- Fault
.....
User Generated Content (Web & Mobile)
- Facebook
- Twitter
- Youtube
- Instagram
- Trip Advisor
- Linked In
.....
Health & Scientific Studies
- Large Hadron, Collider Data
- Genome Sequencing
- Clinical Trails
....
Graph Data
- Social Networks
- Telecommunication Networks
- Road Networks
- Collaborations / Relationships
.....
Log Files
- Web Server logs
- Application Server logs
- Machine Syslog files
- Sensor data
- IoT, RFID tags
Data Science
============
Data Science is all about deriving knowledge from BigData, efficiently and intelligently
Data Science encompasses a set of tools and methods that enable data driven activities in science, business, medicine and government.
Data Science approach ( Jeff Hammerback's Model, Facebook, Cloud era)
=====================
1. Identify the problem
2. Instrument Data Sources
3. Collect Data
4. Prepare Data
- Clean
- Filter
- Integrate
- Aggregate
- Transform
5. Build Model
6. Evaluate Model
7. Communicate Results
Data Science Topics
===================
1. Data acquisition
2. Data Preparation
3. Analysis
4. Data Presentation
5. Data Products
6. Observation & Experimentation
Data Science Roles
==================
At Enterprises
===============
Data Source --> Application Databases, Application Server logs, Intranet files, etc..
ETL ----------> Informatica, Ab-Initio, Talend, IBM DataStage
DataWarehouse -> TeraData, Oracle, IBM DB2, SQL Server
BI & Analytics -> SAS, SAP, Business Objects, Cognos, R, MicroStrategy ...
At WebCompanies
================
Data Source --> Application Databases, Logs from Services Tiers, Web Crawl Data etc...
ETL ----------> Apache Flume, Sqoop, Kafka, Pig, Oozie etc..
DataWarehouse -> Hadoop, Hive, Spark, Spark SQL
BI & Analytics -> Custom Dashboard, RazorFlow etc..
Value of Analytics - enabling Smarter decisions
===============================================
-----------------------------------------------------------------------------------------------------------------------
Traditional Targeting Using
Marketing Analytics
-----------------------------------------------------------------------------------------------------------------------
Number of customers targeted 10000 1000
-----------------------------------------------------------------------------------------------------------------------
Cost per customer targeted ( assume $2) $2 $2
-----------------------------------------------------------------------------------------------------------------------
Response Rate 2% 10%
-----------------------------------------------------------------------------------------------------------------------
Number of Responses 200 100
-----------------------------------------------------------------------------------------------------------------------
Total Revenues (assume $100 per response) 20000 10000
-----------------------------------------------------------------------------------------------------------------------
Total Cost of the campaign 20000 2000
-----------------------------------------------------------------------------------------------------------------------
Total Profit 0 8000
-----------------------------------------------------------------------------------------------------------------------
Apache Spark
============
Spark is a general purpose in-memory computing system
-> It is meant for fast Data Analysis
-> It abstracts API's in Java, Scala, Python and provides optimized engine that supports general execution
-> Spark core is simply an API that provides Scheduling, Monitoring and Distribution of BigData processing on Distributed Clusters
Why is the need for Third generation distributed application frameworks
=========================================================================
-> Need for faster processing to enable faster decision making
-> MapReduce primarily meant for Batch Processing
-> Different frameworks for different applications
-> Difficult to program in MapReduce
Spark functional features
=========================
-> Provides powerful caching and disk persistence capabilities
-> Enables faster batch processing
-> Highly scalable and simplified architecture
-> General purpose engine that supports multiple applications
-> Real time stream processing
-> Iterative algorithms
-> Interactive Data analysis
-> Machine Learning applications
-> Graph processing
Apache Spark & Scala
=======================================================
What is BigData
===============
File 201201hourly.txt ---------------> 520 MB
Notepad --------------- 30% of CPU / 1 GB RAM ------------------- failed to open the file
Wordpad --------------- 30% of CPU / 1 GB RAM ------------------- Opened the file successfully
BigData is a term which is used to define data / problem. It is NOT a technology, technique, software / hardware
BigData is all about the ability of the software to handle (storage, retrieval/processing) the data
Oracle database ---- single instance, single machine license ---can handle about 12 TB of Data (license treshold)
MySQL database ---- 18 TB
> Brief History of Data Analysis
> Data Makes every thing clearer
> Big Data ---- Why all the excitement
- Nowcasting vs Forecasting
- Google flu trends --- Feb 2010 google predicted outbreaks 2 weeks before CDC
New models predicting which cities are more at risk of ebola outbreak
- eg:- Big Data and election 2012 - Obama team recruited team of behavioural scientists
IBM definition of BigData
=========================
BigData is something which follows 3 "V"s
1) Volume - size of data
2) Velocity - speed at which data is changing
3) Variety - types of data
a) Structured data ---- Data residing in RDBMS like Oracle, SQLServer, DB2 etc... (Data confirms schema)
b) Semi-structured data ---- XML, JSON logs etc...
c) Un-structured data ---- Text, Application logs, emails, videos, audios....
Gartner ( agrees with IBM definition of BigData....if and ONLY if there exists another V )
4) Veracity / Value - If I process this data, will I get any outcome out of it ... If yes, then I will consider it as BigData... else ... it is just garbage !!!!
Where does Big Data Come from ?
===============================
Online
- Clicks
- Ad impressions
- Billing Events
- Fast Forward, Pause, ....
- Server Requests
- Transactions
- Network Message
- Fault
.....
User Generated Content (Web & Mobile)
- Youtube
- Trip Advisor
- Linked In
.....
Health & Scientific Studies
- Large Hadron, Collider Data
- Genome Sequencing
- Clinical Trails
....
Graph Data
- Social Networks
- Telecommunication Networks
- Road Networks
- Collaborations / Relationships
.....
Log Files
- Web Server logs
- Application Server logs
- Machine Syslog files
- Sensor data
- IoT, RFID tags
Data Science
============
Data Science is all about deriving knowledge from BigData, efficiently and intelligently
Data Science encompasses a set of tools and methods that enable data driven activities in science, business, medicine and government.
Data Science approach ( Jeff Hammerback's Model, Facebook, Cloud era)
=====================
1. Identify the problem
2. Instrument Data Sources
3. Collect Data
4. Prepare Data
- Clean
- Filter
- Integrate
- Aggregate
- Transform
5. Build Model
6. Evaluate Model
7. Communicate Results
Data Science Topics
===================
1. Data acquisition
2. Data Preparation
3. Analysis
4. Data Presentation
5. Data Products
6. Observation & Experimentation
Data Science Roles
==================
At Enterprises
===============
Data Source --> Application Databases, Application Server logs, Intranet files, etc..
ETL ----------> Informatica, Ab-Initio, Talend, IBM DataStage
DataWarehouse -> TeraData, Oracle, IBM DB2, SQL Server
BI & Analytics -> SAS, SAP, Business Objects, Cognos, R, MicroStrategy ...
At WebCompanies
================
Data Source --> Application Databases, Logs from Services Tiers, Web Crawl Data etc...
ETL ----------> Apache Flume, Sqoop, Kafka, Pig, Oozie etc..
DataWarehouse -> Hadoop, Hive, Spark, Spark SQL
BI & Analytics -> Custom Dashboard, RazorFlow etc..
Value of Analytics - enabling Smarter decisions
===============================================
-----------------------------------------------------------------------------------------------------------------------
Traditional Targeting Using
Marketing Analytics
-----------------------------------------------------------------------------------------------------------------------
Number of customers targeted 10000 1000
-----------------------------------------------------------------------------------------------------------------------
Cost per customer targeted ( assume $2) $2 $2
-----------------------------------------------------------------------------------------------------------------------
Response Rate 2% 10%
-----------------------------------------------------------------------------------------------------------------------
Number of Responses 200 100
-----------------------------------------------------------------------------------------------------------------------
Total Revenues (assume $100 per response) 20000 10000
-----------------------------------------------------------------------------------------------------------------------
Total Cost of the campaign 20000 2000
-----------------------------------------------------------------------------------------------------------------------
Total Profit 0 8000
-----------------------------------------------------------------------------------------------------------------------
Apache Spark
============
Spark is a general purpose in-memory computing system
-> It is meant for fast Data Analysis
-> It abstracts API's in Java, Scala, Python and provides optimized engine that supports general execution
-> Spark core is simply an API that provides Scheduling, Monitoring and Distribution of BigData processing on Distributed Clusters
Why is the need for Third generation distributed application frameworks
=========================================================================
-> Need for faster processing to enable faster decision making
-> MapReduce primarily meant for Batch Processing
-> Different frameworks for different applications
-> Difficult to program in MapReduce
Spark functional features
=========================
-> Provides powerful caching and disk persistence capabilities
-> Enables faster batch processing
-> Highly scalable and simplified architecture
-> General purpose engine that supports multiple applications
-> Real time stream processing
-> Iterative algorithms
-> Interactive Data analysis
-> Machine Learning applications
-> Graph processing
Useful info..
ReplyDeleteThank you
DeleteGood work sir, Thanks for the proper explanation about bigDataand Hadoop introduction. I found one of the good resource related to bigData and Hadoop . It is providing in-depth knowledge on Big data and Hadoop. which I am sharing a link with you where you can get more clear on bigData and Hadoop.
ReplyDelete