BIG DATA: July 2018

1.What is spark?

Apache Spark is a fast, in-memory data processing engine,which allows data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.

2.Features of Apache Spark?

Apache Spark being an open-source framework for Bigdata has a various advantage over other big data solutions like Apache Spark is Dynamic in Nature, it supports in-Memory Computation of RDDs. It provides a provision of reusability, Fault Tolerance, real-time stream processing and many more.

a. swift Processing

b. Dynamic in Nature

c. In-Memory Computation in Spark

d. Reusability

e. Fault Tolerance in Spark

f. Real-Time Stream Processing

g. Lazy Evaluation in Apache Spark

h. Support Multiple Languages

i. Active, Progressive and Expanding Spark Community

j. Support for Sophisticated Analysis

k. Integrated with Hadoop

l. Spark GraphX

m. Cost Efficient.

3.What is a Resilient Distribution Dataset in Apache Spark?

· RDDs can contain any types of objects, including user-defined classes.  

· An RDD is simply a capsulation around a very large dataset. In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result.  

· Under the hood, Spark will automatically distribute the data contained in RDDs across your cluster and parallelize the operations you perform on them.  

4.What is a Transformation in Apache Spark?

·Apply some functions to the data in RDD to create a new RDD.  
·One of the most common transformations is filter which will return a new RDD with a subset of the data in the original RDD.  

5.What are security options in Apache Spark?

Spark currently supports authentication via a shared secret. Authentication can be configured to be on via the spark.authenticate configuration parameter. This parameter controls whether the Spark communication protocols do authentication using the shared secret. This authentication is a basic handshake to make sure both sides have the same shared secret and are allowed to communicate. If the shared secret is not identical they will not be allowed to communicate. The shared secret is created as follows:

· For Spark on YARN deployments, configuring spark.authenticate to true will automatically handle generating and distributing the shared secret. Each application will use a unique shared secret.

· For other types of Spark deployments, the Spark parameter spark.authenticate.secret should be configured on each of the nodes. This secret will be used by all the Master/Workers and applications.

6.How will you monitor Apache Spark?

Every SparkContext launches a web UI, by default on port 4040, that displays useful information about the application. This includes:

· A list of scheduler stages and tasks

· A summary of RDD sizes and memory usage

· Environmental information.

· Information about the running executors

You can access this interface by simply opening http://<driver-node>:4040 in a web browser. If multiple SparkContexts are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).

Note that this information is only available for the duration of the application by default. To view the web UI after the fact, set spark.eventLog.enabled to true before starting the application. This configures Spark to log Spark events that encode the information displayed in the UI to persisted storage.

7.What are the main libraries of Apache Spark?

import. org.apache.Spark.sql
import.org.apache.Spark.utill_

import.org..apache.SparkContext
import.org.apache.SparkContext._
import.org.apache.SparkConf

8.What are the main functions of Spark Core in Apache Spark?

Spark Core is the central component of Apache Spark. It serves following functions:

I. Distributed Task Dispatching

II. Job Scheduling

III. I/O Functions

Apache Spark - Core Programming. ... Spark uses a specialized fundamental data structure known as RDD (Resilient Distributed Datasets) that is a logical collection of data partitioned across machines.

9.How will you do memory tuning in Spark?

The process of adjusting settings to record for memory, cores, and instances used by the system is termed tuning. This process guarantees that the Spark has optimal performance and prevents resource bottlenecking. Effective changes are made to each property and settings, to ensure the correct usage of resources based on system specific setup. Apache Spark has in-memory computation nature. As a result resources in the cluster (CPU, memory etc.) may get bottlenecked.

Sometimes to decrease memory usage RDDs are stored in serialized form. Data serialization plays important role in good network performance and can also help in reducing memory usage, and memory tuning.

If used properly, tuning can:

· Ensure proper use of all resources in an effective manner.

· Eliminates those jobs that run long.

· Improves the performance time of the system.

· Guarantees that jobs are on correct execution engine.

Consider the following three things in tuning memory usage:

· Amount of memory used by objects (the entire dataset should fit in-memory)

· The cost of accessing those objects

· Overhead of garbage collection.

The Java objects can be accessed but consume 2-5x more space than the raw data inside their field. The reasons for such behavior are:

· Every distinct Java object has an “object header”. The size of this header is 16 bytes. Sometimes the object has little data in it, thus in such cases, it can be bigger than the data.

· There are about 40 bytes of overhead over the raw string data in Java String. It stores each character as two bytes because of String’s internal usage of UTF-16 encoding. If there are 10 characters String, it can easily consume 60 bytes.

· Common collection classes like HashMap and LinkedList make use of linked data structure, there we have “wrapper” object for every entry. This object has both header and pointer (8 bytes each) to the next object in the list.

· Collections of primitive types often store them as “boxed objects”. For example, java.lang.Integer.

a. Spark Data Structure Tuning

By avoiding the Java features that add overhead we can reduce the memory consumption. There are several ways to achieve this:

· Avoid the nested structure with lots of small objects and pointers.

· Instead of using strings for keys, use numeric IDs or enumerated objects.

· If the RAM size is less than 32 GB, set JVM flag to –xx:+UseCompressedOopsto make a pointer to four bytes instead of eight.

b. Spark Garbage Collection Tuning

JVM garbage collection is problematic with large churn RDD stored by the program. To make room for new objects, Java removes the older one; it traces all the old objects and finds the unused one. But the key point is that cost of garbage collection in Spark is proportional to a number of Java objects. Thus, it is better to use a data structure in Spark with lesser objects. One more way to achieve this is to persist objects in serialized form. As a result, there will be only one object per RDD partition.

Spark Garbage Collection Tuning

In garbage collection, tuning in Apache Spark, the first step is to gather statistics on how frequently garbage collection occurs. It also gathers the amount of time spent in garbage collection. Thus, can be achieved by adding -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to Java option. The next time when Spark job run, a message will display in workers log whenever garbage collection occurs. These logs will be in worker node, not on drivers program.

Java heap space divides into two regions Young and Old. The young generation holds short-lived objects while Old generation holds objects with longer life. The garbage collection tuning aims at, long-lived RDDs in the old generation. It also aims at the size of a young generation which is enough to store short-lived objects. With this, we can avoid full garbage collection to gather temporary object created during task execution. Some steps that may help to achieve this are:

· If full garbage collection is invoked several times before a task is complete this ensures that there is not enough memory to execute the task.

· In garbage collection statistics, if OldGen is near to full we can reduce the amount of memory used for caching. This can be achieved by lowering spark.memory.fraction. the better choice is to cache fewer objects than to slow down task execution. Or we can decrease the size of young generation i.e., lowering –Xmn.

The effect of Apache Spark garbage collection tuning depends on our application and amount of memory used.

Other consideration for Spark Performance Tuning

a. Level of Parallelism

To use the full cluster the level of parallelism of each program should be high enough. According to the size of the file, Spark sets the number of “Map” task to run on each file. The level of parallelism can be passed as a second argument. We can set the config property spark.default.parallelism to change the default.

b. Memory Usage of Reduce Task in Spark

Although RDDs fit in our memory many times we come across a problem of OutOfMemoryError. This is because the working set of our task say groupByKey is too large. We can fix this by increasing the level of parallelism so that each task’s input set is small. We can increase the number of cores in our cluster because Spark reuses one executor JVM across many tasks and has low task launching cost.

Learn about groupByKey and other Transformations and Actions API in Apache Spark with examples.

c. Broadcasting Large Variables

The size of each serialized task reduces by using broadcast functionality in SparkContext. If a task uses a large object from driver program inside of them, turn it into the broadcast variable. Generally, it considers the tasks that are about 20 Kb for optimization.

Data Locality in Apache Spark

Data locality plays an important role in the performance of Spark Jobs. The case in which the data and code that operates on that data are together, the computation is faster. But if the two are separate, then either the code should be moved to data or vice versa. It is faster to move serialized code from place to place then the chunk of data because the size of the code is smaller than the data.

Based on data current location there are various levels of locality. The order from closest to farthest is:

· The best possible locality is that the PROCESS_LOCAL resides in same JVM as the running code.

· NODE_LOCAL resides on the same node in this. It is because the data travel between processes is quite slower than PROCESS_LOCAL.

· There is no locality preference in NO_PREF data is accessible from anywhere.

· RACK_LOCAL data is on the same rack of the server. Since the data is on the same rack but on the different server, so it sends the data in the network, through a single switch.

10.What are the two ways to create RDD in Spark?

ANY data resides somewhere else in the network and not in the same rack.

· There are following ways to create RDD in Spark are:

· 1.Using parallelized collection.

· 2.From external datasets (Referencing a dataset in external storage system ).

· 3.From existing apache spark RDDs.

11.What are the main operations that can be done on a RDD in Apache Spark?

There are two main operations that can be performed on a RDD in Spark:

I. Transformation: This is a function that is used to create a new RDD out of an existing RDD.

II. Action: This is a function that returns a value to Driver program after running a computation on RDD.

12.What are the common Transformations in Apache Spark?

Spark RDD Operations. Two types of Apache Spark RDD operations are- Transformations and Actions. A Transformation is a function that produces new RDD from the existing RDDs but when we want to work with the actual dataset, at that point Action is performed.

13.What are the common Actions in Apache Spark?

Filter: Use .filter() to remove unwanted records or data.

Analyse: Use .map() or .flatmap() to conduct data transformations according to preferred analytic techniques.

Aggregate:A .reduce() or .countItems() function to group the data according to the needed criteria.

Visualize:Create an attractive output / derivative for graphs, figures or tables.

14.What is a Shuffle operation in Spark?

Shuffling is a process of redistributing data across partitions (aka repartitioning) that may or may not cause moving data across JVM processes or even over the wire (between executors on separate machines).

Shuffling is the process of data transfer between stages.

Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.

15. What are the operations that can cause a shuffle in Spark?

Shuffle operation is used in Spark to re-distribute data across multiple partitions. It is a costly and complex operation.

In general a single task in Spark operates on elements in one partition.

To execute shuffle, we have to run an operation on all elements of all partitions.

It is also called all-to- all operation.

16. What is purpose of Spark SQL?

Apache Spark SQL is a Spark module to simplify working with structured data using DataFrame and DataSet abstractions in Python, Java, and Scala.

These abstractions are the distributed collection of data organized into named columns. It provides a good optimization technique. Using Spark SQL we can query data, both from inside a Spark program and from external tools that connect through standard database connectors (JDBC/ODBC) to Spark SQL.

In Apache Spark SQL we can use structured and semi-structured data in three ways:

· To simplify working with structured data it provides DataFrame abstraction in Python, Java, and Scala. DataFrame is a distributed collection of data organized into named columns. It provides a good optimization technique.

· The data can be read and written in a variety of structured formats. For example, JSON, Hive Tables, and Parquet.

· Using SQL we can query data, both from inside a Spark program and from external tools. The external tool connects through standard database connectors (JDBC/ODBC) to Spark SQL.

· The best way to use Spark SQL is inside a Spark application. This empowers us to load data and query it with SQL. At the same time, we can also combine it with “regular” program code in Python, Java or Scala.

· Uses of Apache Spark SQL

It executes SQL queries.
We can read data from existing Hive installation using SparkSQL.
When we run SQL within another programming language we will get the result as Dataset/Data Frame.

17.What is a Data Frame in Spark SQL?

In Spark, a Data Frame is a distributed collection of data organized into named columns. Data Frames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

18. What is a Parquet file in Spark?

Apache Parquet is a free and open-source column-oriented data store of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formatsavailable in Hadoop namely RCFile and Optimized RCFile. It is compatible with most of the data processing frameworks in the Hadoop environment.

19.What is the difference between Apache Spark and Apache Hadoop MapReduce?

· Apache Spark – It is an open source big data framework. It provides faster and more general purpose data processing engine. Spark is basically designed for fast computation. It also covers wide range of workloads for example batch, interactive, iterative and streaming.

· Hadoop MapReduce – It is also an open source framework for writing applications. It also processes structured and unstructured data that are stored in HDFS. Hadoop MapReduce is designed in a way to process a large volume of data on a cluster of commodity hardware. MapReduce can process data in batch mode.

2.2. Speed

· Apache Spark – Spark is lightning fast cluster computing tool. Apache Spark runs applications up to 100x faster in memory and 10x faster on disk than Hadoop. Because of reducing the number of read/write cycle to disk and storing intermediate data in-memory Spark makes it possible.

· Hadoop MapReduce – MapReduce reads and writes from disk, as a result, it slows down the processing speed.

2.3. Difficulty

· Apache Spark – Spark is easy to program as it has tons of high-level operators with RDD – Resilient Distributed Dataset.

· Hadoop MapReduce – In MapReduce, developers need to hand code each and every operation which makes it very difficult to work.

2.4. Easy to Manage

· Apache Spark – Spark is capable of performing batch, interactive and Machine Learning and Streaming all in the same cluster. As a result makes it a completedata analytics engine. Thus, no need to manage different component for each need. Installing Spark on a cluster will be enough to handle all the requirements.

· Hadoop MapReduce – As MapReduce only provides the batch engine. Hence, we are dependent on different engines. For example- Storm, Giraph, Impala, etc. for other requirements. So, it is very difficult to manage many components.

2.5. Real-time analysis

· Apache Spark – It can process real time data i.e. data coming from the real-time event streams at the rate of millions of events per second, e.g. Twitter data for instance or Facebook sharing/posting. Spark’s strength is the ability to process live streams efficiently.

· Hadoop MapReduce – MapReduce fails when it comes to real-time data processing as it was designed to perform batch processing on voluminous amounts of data.

2.6. latency

· Apache Spark – Spark provides low-latency computing.

· Hadoop MapReduce – MapReduce is a high latency computing framework.

2.7. Interactive mode

· Apache Spark – Spark can process data interactively.

· Hadoop MapReduce – MapReduce doesn’t have an interactive mode.

2.8. Streaming

· Apache Spark – Spark can process real time data through Spark Streaming.

· Hadoop MapReduce – With MapReduce, you can only process data in batch mode.

2.9. Ease of use

· Apache Spark – Spark is easier to use. Since, its abstraction (RDD) enables a user to process data using high-level operators. It also provides rich APIs in Java, Scala, Python, and R.

· Hadoop MapReduce – MapReduce is complex. As a result, we need to handle low-level APIs to process the data, which requires lots of hand coding.

2.10. Recovery

· Apache Spark – RDDs allows recovery of partitions on failed nodes by re-computation of the DAG while also supporting a more similar recovery style to Hadoop by way of checkpointing, to reduce the dependencies of an RDDs.

· Hadoop MapReduce – MapReduce is naturally resilient to system faults or failures. So, it is a highly fault-tolerant system.

2.11. Scheduler

· Apache Spark – Due to in-memory computation spark acts its own flow scheduler.

· Hadoop MapReduce – MapReduce needs an external job scheduler for example, Oozie to schedule complex flows.

2.12. Fault tolerance

· Apache Spark – Spark is fault-tolerant. As a result, there is no need to restart the application from scratch in case of any failure.

· Hadoop MapReduce – Like Apache Spark, MapReduce is also fault-tolerant, so there is no need to restart the application from scratch in case of any failure.

2.13. Security

· Apache Spark – Spark is little less secure in comparison to MapReduce because it supports the only authentication through shared secret password authentication.

· Hadoop MapReduce – Apache Hadoop MapReduce is more secure because of Kerberos and it also supports Access Control Lists (ACLs) which are a traditional file permission model.

2.14. Cost

· Apache Spark – As spark requires a lot of RAM to run in-memory. Thus, increases the cluster, and also its cost.

· Hadoop MapReduce – MapReduce is a cheaper option available while comparing it in terms of cost.

2.15. Language Developed

· Apache Spark – Spark is developed in Scala.

· Hadoop MapReduce – Hadoop MapReduce is developed in Java.

2.16. Category

· Apache Spark – It is data analytics engine. Hence, it is a choice for Data Scientist.

· Hadoop MapReduce – It is basic data processing engine.

2.17. License

· Apache Spark – Apache License 2

· Hadoop MapReduce – Apache License 2

2.18. OS support

· Apache Spark – Spark supports cross-platform.

· Hadoop MapReduce – Hadoop MapReduce also supports cross-platform.

2.19. Programming Language support

· Apache Spark – Scala, Java, Python, R, SQL.

· Hadoop MapReduce – Primarily Java, other languages like C, C++, Ruby, Groovy, Perl, Python are also supported using Hadoop streaming.

2.20. SQL support

· Apache Spark – It enables the user to run SQL queries using Spark SQL.

· Hadoop MapReduce – It enables users to run SQL queries using Apache Hive.

2.21. Scalability

· Apache Spark – Spark is highly scalable. Thus, we can add n number of nodes in the cluster. Also a largest known Spark Cluster is of 8000 nodes.

· Hadoop MapReduce – MapReduce is also highly scalable we can keep adding n number of nodes in the cluster. Also, a largest known Hadoop cluster is of 14000 nodes.

2.22. The line of code

· Apache Spark – Apache Spark is developed in merely 20000 line of codes.

· Hadoop MapReduce – Hadoop 2.0 has 1,20,000 line of codes

2.23. Machine Learning

· Apache Spark – Spark has its own set of machine learning ie MLlib.

· Hadoop MapReduce – Hadoop requires machine learning tool for example Apache Mahout.

Learn: Apache Storm vs Spark Streaming – Feature wise Comparison

2.24. Caching

· Apache Spark – Spark can cache data in memory for further iterations. As a result it enhances the system performance.

· Hadoop MapReduce – MapReduce cannot cache the data in memory for future requirements. So, the processing speed is not that high as that of Spark.

2.25. Hardware Requirements

· Apache Spark – Spark needs mid to high-level hardware.

· Hadoop MapReduce – MapReduce runs very well on commodity hardware.

2.26. Community

· Apache Spark – Spark is one of the most active project at Apache. Since, it has a very strong community.

· Hadoop MapReduce – MapReduce community has been shifted to Spark.

20.What are the main languages supported by Apache Spark?

Programming languages supported by Spark include:

· Java.

· Python.

· Scala.

· SQL.

· R.

21.What are the file systems supported by Spark?

Such as MapR (file system and database), Google Cloud, Amazon S3, Apache Cassandra, Apache Hadoop (HDFS), Apache HBase, Apache Hive, Berkeley's Tachyon project, relational databases and MongoDb. But below i am discussing some mainly spark supported file systems.

22.What is a Spark Driver?

The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master. In practical terms, the driver is the program that creates the SparkContext, connecting to a given SparkMaster.

23.What is an RDD Lineage?

RDD lineage is nothing but the graph of all the parent RDDs of an RDD. We also call it an RDD operator graph or RDD dependency graph. To be very specific, it is an output of applying transformations to the spark. Then, it creates a logical execution plan.

Basically, logical execution plan gets initiated with earliest RDDs. Earliest RDDs are nothing but RDDs which are not dependent on other RDDs. To be very specific those are independent of reference cached data. Moreover, it ends with the RDD those produces the result of the action which has been called to execute.

val r00 = sc.parallelize(0 to 9)

val r01 = sc.parallelize(0 to 90 by 10)

val r10 = r00 cartesian r01

val r11 = r00.map(n => (n, n))

val r12 = r00 zip r01

val r13 = r01.keyBy(_ / 20)

val r20 = Seq(r11, r12, r13).foldLeft(r10)(_ union _)

24.What are the two main types of Vector in Spark?

There are two main types of Vector in Spark:

1. Dense Vector: A dense vector is backed by an array of double data type. This array contains the values.

E.g. {1.0 , 0.0, 3.0}

2. Sparse Vector:

A sparse vector is backed by two parallel arrays. One array is for indices and the other array is for values.

E.g. {3, [0,2], [1.0,3.0]}

In this array, the first element is the number of elements in vector. Second element is the array of indices of non-zero values. Third element is the array of non-zero values.

25.What are the different deployment modes of Apache Spark?

Let's try to look at the differences between client and cluster mode.

Client:

· Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all available resources at it's disposal to execute work.

· Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).

· Because the Master node has dedicated resources of it's own, you don't need to "spend" worker resources for the Driver program.

· If the driver process dies, you need an external monitoring system to reset it's execution.

Cluster:

· Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader

· Driver runs as a dedicated, standalone process inside the Worker.

· Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).

· Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.

· When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers.

· /bin/spark-submit \

· --class <main-class> --master <master-url> \

· --deploy-mode <deploy-mode> \

· --conf <key>=<value> \

· ... # other options <application-jar> \

· [application-arguments]

$./bin/spark-submit --class org.apache.spark.examples.SparkPi \

--master yarn \

--deploy-mode cluster \

--driver-memory 4g \

--executor-memory 2g \

--executor-cores 1 \

--queue thequeue \

examples/jars/spark-examples*.jar \

26.What is lazy evaluation in Apache Spark?

1) Apply Transformations operations on RDD or "loading data into RDD" is not executed immediately until it sees an action. Transformations on RDDs and storing data in RDD are lazily evaluated. Resources will be utilized in a better way if Spark uses lazy evaluation.

2) Spark uses lazy evaluation to reduce the number of passes it has to take over our data by grouping operations together. In case MapReduce, user/developer has to spend a lot of time on how to group operations together in order to minimize the number of MapReduce passes. In spark, there is no benefit of writing a single complex map instead of chaining together many simple operations. The user can organize their spark program into smaller operations. Spark will be managed very efficiently of all the operations by using lazy evaluation

3) Lazy evaluation helps to optimize the disk and memory usage in Spark.

4) In general, when are doing computation on data, we have to consider two things, that is space and time complexities. Using spark lazy evaluation, we can overcome both. The actions are triggered only when the data is required. It reduces overhead.

5) It also saves computation and increases speed. Lazy evaluation will play a key role in saving calculation overhead.
Only necessary values are computed instead of whole dataset (Its all depends on actions operations, and few
transformations also)

27.What are the core components of a distributed application in Apache Spark?

Spark Core. Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems, and more. Spark Core is also home to the API that defines resilient distributed datasets (RDDs), which are Spark's main programming abstraction.

28. What is the difference in cache() and persist() methods in Apache Spark?

Caching or persistence are optimization techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storages like disk and/or replicated.

RDDs can be cached using cache operation. They can also be persisted using persist operation.

The difference between cache and persist operations is purely syntactic. cache is a synonym of persist or persist(MEMORY_ONLY), i.e. cache is merely persist with the default storage level MEMORY_ONLY.

29.How will you remove data from cache in Apache Spark?

In general, Apache Spark automatically removes the unused objects from cache.

It uses Least Recently Used (LRU) algorithm to drop old partitions.

There are automatic monitoring mechanisms in Spark to monitor cache usage on each node.

In case, we want to forcibly remove an object from cache in Apache Spark, we can use RDD.unpersist() method.

30. What is the use of SparkContext in Apache Spark?

A SparkContext is a client of Spark’s execution environment and it acts as the master of the Spark application. SparkContext sets up internal services and establishes a connection to a Spark execution environment. You can create RDDs, accumulators and broadcast variables, access Spark services and run jobs (until SparkContext stops) after the creation of SparkContext. Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.

In Spark shell, a special interpreter-aware SparkContext is already created for the user, in the variable called sc.

The first step of any Spark driver application is to create a SparkContext. The SparkContext allows the Spark driver application to access the cluster through a resource manager. The resource manager can be YARN, or Spark’s Cluster Manager.

Few functionalities which SparkContext offers are:
1. We can get the current status of a Spark application like configuration, app name.
2. We can set Configuration like master URL, default logging level.
3. One can create Distributed Entities like RDDs.

31. Do we need HDFS for running Spark application?

Need of Hadoop to Run Spark. Further more, to run Spark in a distributed mode, it is installed on top of Yarn. Then Spark's advanced analytics applications are used for data processing. Hence, if you run Spark in a distributed mode using HDFS, you can achieve maximum benefit by connecting all projects in the cluster.

32. What is Spark Streaming?

Spark Streaming is an extension of the core SparkAPI that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. ... Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.

33. How does Spark Streaming work internally?

Stream processing is low latency processing and analyzing of streaming data. Its internal working is as follows. Live input data streams is received and divided into batches by Spark streaming, the Spark engine to generate the final stream of results in batches then processes these batches.

34. What is a Pipeline in Apache Spark?

A Spark Pipeline is specified as a sequence of stages, & each stage is either a Transformer or an Estimator. These stages are run in order, & the input Data Frame is transformed as it passes through each stage.

Pipeline is a concept from Machine learning.

It is a sequence of algorithms that are executed for processing and learning from data.

A Pipeline is similar to a workflow. There can be one or more stages in a Pipeline.

35.How does Pipeline work in Apache Spark?

A Pipeline is a sequence of stages. Each stage in Pipeline can be a Transformer or an Estimator.

We run these stages in an order. Initially a DataFrame is passed as an input to Pipeline.

This Data Frame keeps on transforming with each stage of Pipeline.

Most of the time, Runtime checking is done on DataFrame passing through the Pipeline. We can also save a Pipeline to a disk. It can be re-read from disk a later point of time.

36. What is the difference between Transformer and Estimator in Apache Spark?

A Transformer is an abstraction for feature transformer and learned model. It converts one DataFrame to another DataFrame. ... In a feature transformer a DataFrame is the input and the output is a new DataFrame with a new mapped column. An Estimator is an abstraction for a learning algorithm that fits or trains on data.

37. What are the different types of Cluster Managers in Apache Spark?

There are three types of clusters managers in spark,

· Spark Standalone.

· Apache Mesos.

· Hadoop YARN.

38. How will you minimize data transfer while working with Apache Spark?

In Spark, Data Transfer can be reduced by avoiding operation which results in data shuffle.
Avoid operations like repartition and coalesce, ByKey operations like groupByKey and reduceByKey, and join operations like cogroup and join.

Spark Shared Variables help in reducing data transfer. There two types for shared variables-Broadcast variable and Accumulator.

Broadcast variable:

If we have a large dataset, instead of transferring a copy of data set for each task, we can use a broadcast variable which can be copied to each node at one time
and share the same data for each task in that node. Broadcast variable help to give a large data set to each node.
First, we need to create a broadcast variable using Spark Context. broadcast and then broadcast the same to all nodes from driver program. Value method
can be used to access the shared value. The broadcast variable will be used only if tasks for multiple stages use the same data.

Accumulator:

Spark functions used variables defined in the driver program and local copied of variables will be generated. Accumulators are shared variables which help to update
variables in parallel during execution and share the results from workers to the driver.

Generally Shuffle operation in Spark leads to a large amount of data transfer.

We can configure Spark Shuffle process for optimum data transfer.

Some of the main points to do it are as follows:

I. spark.shuffle.compress: This configuration can be set to true to compress map output files. This reduces the amount of data transfer due to compression.

II. ByKey operations: We can minimize the use of ByKey operations to minimize the shuffle calls.

39.What is the main use of MLib in Apache Spark?

MLib is a machine-learning library in Apache Spark.

Some of the main uses of MLib in Spark are as follows:

I. ML Algorithms: It contains Machine Learning algorithms such as classification, regression, clustering, and collaborative filtering.

II. Featurization: MLib provides algorithms to work with features. Some of these are feature extraction, transformation, dimensionality reduction, and selection.

III.Pipelines:It contains tools for constructing, evaluating, and tuning ML Pipelines.

IV. Persistence:It also provides methods for saving and load algorithms, models, and Pipelines.

V. Utilities: It contains utilities for linear algebra, statistics, data handling, etc.

40. What is the Check pointing in Apache Spark?

The need with Spark Streaming application is that it should be operational 24/7. Thus, the system should also be fault tolerant. If any data is lost, the recovery should be speedy. Spark streaming accomplishes this using checkpointing.

So, Checkpointing is a process to truncate RDD lineage graph. It saves the application state timely to reliable storage (HDFS). As the driver restarts the recovery takes place.

There are two types of data that we checkpoint in Spark:

1.Metadata Checkpointing– Metadata means the data about data. It refers to saving the metadata to fault tolerant storage like HDFS. Metadata includes configurations, DStream operations, and incomplete batches. Configuration refers to the configuration used to create streaming DStream operations are operations which define the steaming application. Incomplete batches are batches which are in the queue but are not complete.

2.Data Checkpointing –:It refers to save the RDD to reliable storage because its need arises in some of the stateful transformations. It is in the case when the upcoming RDD depends on the RDDs of previous batches. Because of this, the dependency keeps on increasing with time. Thus, to avoid such increase in recovery time the intermediate RDDs are periodically checkpointed to some reliable storage. As a result, it cuts down the dependency chain.

41.What is an Accumulator in Apache Spark?

Accumulator is a shared variable in Apache Spark, used to aggregating information across the cluster. In other words, aggregating information / values from worker nodes back to the driver program.

42. What is a Broadcast variable in Apache Spark?

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

43. What is Structured Streaming in Apache Spark

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.

44. How will you pass functions to Apache Spark?

In Spark API, we pass functions to driver program so that it can be run on a cluster. Two common ways to pass functions in Spark are as follows:

I. Anonymous Function Syntax:This is used for passing short pieces of code in an anonymous function.

II. Static Methods in a Singleton object:We can also define static methods in an object with only once instance i.e. Singleton. This object along with its methods can be passed to cluster nodes.

45. What is a Property Graph?

The term property graph has come to denote an attributed, multi-relational graph. That is, a graph where the edges are labeled and both vertices and edges can have any number of key/value properties associated with them. An example of a property graph with two vertices and one edge is diagrammed below.

46. What is Neighborhood Aggregation in Spark?

Neighborhood Aggregation is a concept in Graph module of Spark. It refers to the task of aggregating information about the neighborhood of each vertex.

E.g. We want to know the number of books referenced in a book. Or number of times a Tweet is retweeted.

This concept is used in iterative graph algorithms.

Some of the popular uses of this concept are in Page Rank, Shortest Path etc.

We can use aggregateMessages[] and mergeMsg[] operations in Spark for implementing Neighborhood Aggregation.

47. What are different Persistence levels in Apache Spark?

Different Persistence levels in Apache Spark are as follows:

I. MEMORY_ONLY:In this level, RDD object is stored as a de-serialized Java object in JVM. If an RDD doesn’t fit in the memory, it will be recomputed.

II. MEMORY_AND_DISK:In this level, RDD object is stored as a de-serialized Java object in JVM. If an RDD doesn’t fit in the memory, it will be stored on the Disk.

III. MEMORY_ONLY_SER: In this level, RDD object is stored as a serialized Java object in JVM. It is more efficient than de- serialized object.

IV. MEMORY_AND_DISK_SER: In this level, RDD object is stored as a serialized Java object in JVM. If an RDD doesn’t fit in the memory, it will be stored on the Disk.

V. DISK_ONLY:In this level, RDD object is stored only on Disk.

48. How will you select the storage level in Apache Spark?

We use storage level to maintain balance between CPU efficiency and Memory usage.

1. If our RDD objects fit in memory, we use MEMORY_ONLY option. In this option, the performance is very good due to objects being in Memory only.

2. In case our RDD objects cannot fit in memory, we go for MEMORY_ONLY_SER option and select a serialization library that can provide space savings with serialization. This option is also quite fast in performance.

3. In case our RDD object cannot fit in memory with a big gap in memory vs. total object size, we go for MEMORY_AND_DISK option. In this option some RDD object are stored on Disk. For fast fault recovery we use replication of objects to multiple partitions.

49. What are the options in Spark to create a Graph?

We can create a Graph in Spark from a collection of vertices and edges. Some of the options in Spark to create a Graph are as follows:

I. Graph.apply: This is the simplest option to create graph. We use this option to create a graph from RDDs of vertices and edges.

II. Graph.fromEdges:We can also create a graph from RDD of edges. In this option, vertices are created automatically and a default value is assigned to each vertex.

III. Graph.fromEdgeTuples:We can also create a graph from only an RDD of tuples.

50. What are the basic Graph operators in Spark?

Some of the common Graph operators in Apache

Spark are as follows:

I. numEdges

II. numVertices

III. inDegrees

IV. outDegrees

V. degrees

VI. vertices

VII. edges

VIII. persist

IX. cache

51.What is the partitioning approach used in GraphX of Apache Spark?

GraphX uses Vertex-cut approach to distributed graph partitioning.

In this approach, a graph is not split along edges. Rather we partition graph along vertices.

These vertices can span on multiple machines. This approach reduces communication and storage overheads.

Edges are assigned to different partitions based on the partition strategy that we select.

BIG DATA

Wednesday 25 July 2018

Apache Spark Interview Question and Answers

Spark Garbage Collection Tuning

Other consideration for Spark Performance Tuning

a. Level of Parallelism

b. Memory Usage of Reduce Task in Spark

c. Broadcasting Large Variables