Thursday 20 April 2017

About Spark & Scala

=======================================================
Apache Spark & Scala
=======================================================

What is BigData
===============

File 201201hourly.txt ---------------> 520 MB

Notepad --------------- 30% of CPU / 1 GB RAM ------------------- failed to open the file
Wordpad --------------- 30% of CPU / 1 GB RAM ------------------- Opened the file successfully

BigData is a term which is used to define data / problem. It is NOT a technology, technique, software / hardware
BigData is all about the ability of the software to handle (storage, retrieval/processing) the data

Oracle database ---- single instance, single machine license ---can handle about 12 TB of Data (license treshold)
MySQL database ---- 18 TB


> Brief History of Data Analysis
> Data Makes every thing clearer
> Big Data ---- Why all the excitement
- Nowcasting vs Forecasting
- Google flu trends --- Feb 2010 google predicted outbreaks 2 weeks before CDC
 New models predicting which cities are more at risk of ebola outbreak
- eg:- Big Data and election 2012 - Obama team recruited team of behavioural scientists


IBM definition of BigData
=========================

BigData is something which follows 3 "V"s

1) Volume - size of data
2) Velocity - speed at which data is changing
3) Variety - types of data

a) Structured data ---- Data residing in RDBMS like Oracle, SQLServer, DB2 etc... (Data confirms schema)
b) Semi-structured data ---- XML, JSON logs etc...
c) Un-structured data ---- Text, Application logs, emails, videos, audios....

Gartner ( agrees with IBM definition of BigData....if and ONLY if there exists another V )

4) Veracity / Value - If I process this data, will I get any outcome out of it ... If yes, then I will consider it as BigData... else ... it is just garbage !!!!

Where does Big Data Come from ?
===============================

Online
- Clicks
- Ad impressions
- Billing Events
- Fast Forward, Pause, ....
- Server Requests
- Transactions
- Network Message
- Fault
.....

User Generated Content (Web & Mobile)
- Facebook
- Twitter
- Youtube
- Instagram
- Trip Advisor
- Linked In
.....

Health & Scientific Studies
- Large Hadron, Collider Data
- Genome Sequencing
- Clinical Trails
....

Graph Data
- Social Networks
- Telecommunication Networks
- Road Networks
- Collaborations / Relationships
.....

Log Files
- Web Server logs
- Application Server logs
- Machine Syslog files
- Sensor data
- IoT, RFID tags


Data Science
============
Data Science is all about deriving knowledge from BigData, efficiently and intelligently
Data Science encompasses a set of tools and methods that enable data driven activities in science, business, medicine and government.

Data Science approach ( Jeff Hammerback's Model, Facebook, Cloud era)
=====================

1. Identify the problem
2. Instrument Data Sources
3. Collect Data
4. Prepare Data
- Clean
- Filter
- Integrate
- Aggregate
- Transform
5. Build Model
6. Evaluate Model
7. Communicate Results

Data Science Topics
===================

1. Data acquisition
2. Data Preparation
3. Analysis
4. Data Presentation
5. Data Products
6. Observation & Experimentation

Data Science Roles
==================

At Enterprises
===============
Data Source --> Application Databases, Application Server logs, Intranet files, etc..
ETL ----------> Informatica, Ab-Initio, Talend, IBM DataStage
DataWarehouse -> TeraData, Oracle, IBM DB2, SQL Server
BI & Analytics -> SAS, SAP, Business Objects, Cognos, R, MicroStrategy ...

At WebCompanies
================
Data Source --> Application Databases, Logs from Services Tiers, Web Crawl Data etc...
ETL ----------> Apache Flume, Sqoop, Kafka, Pig, Oozie etc..
DataWarehouse -> Hadoop, Hive, Spark, Spark SQL
BI & Analytics -> Custom Dashboard, RazorFlow etc..


Value of Analytics - enabling Smarter decisions
===============================================

-----------------------------------------------------------------------------------------------------------------------
Traditional Targeting Using
Marketing Analytics
-----------------------------------------------------------------------------------------------------------------------
Number of customers targeted 10000 1000
-----------------------------------------------------------------------------------------------------------------------
Cost per customer targeted ( assume $2) $2 $2
-----------------------------------------------------------------------------------------------------------------------
Response Rate 2% 10%
-----------------------------------------------------------------------------------------------------------------------
Number of Responses 200 100
-----------------------------------------------------------------------------------------------------------------------
Total Revenues (assume $100 per response) 20000 10000
-----------------------------------------------------------------------------------------------------------------------
Total Cost of the campaign 20000 2000
-----------------------------------------------------------------------------------------------------------------------
Total Profit 0 8000
-----------------------------------------------------------------------------------------------------------------------


Apache Spark
============

Spark is a general purpose in-memory computing system

-> It is meant for fast Data Analysis
-> It abstracts API's in Java, Scala, Python and provides optimized engine that supports general execution
-> Spark core is simply an API that provides Scheduling, Monitoring and Distribution of BigData processing on Distributed Clusters

Why is the need for Third generation distributed application frameworks
=========================================================================

-> Need for faster processing to enable faster decision making
-> MapReduce primarily meant for Batch Processing
-> Different frameworks for different applications
-> Difficult to program in MapReduce

Spark functional features
=========================
-> Provides powerful caching and disk persistence capabilities
-> Enables faster batch processing
-> Highly scalable and simplified architecture
-> General purpose engine that supports multiple applications
-> Real time stream processing
-> Iterative algorithms
-> Interactive Data analysis
-> Machine Learning applications
-> Graph processing