Below, we have created a list of the most frequently asked Spark Interview Questions and Answers. Reading these can help you gain more knowledge and insights into this computing system. If you are looking for a job change or starting your career in Spark, this list of Spark Interview Questions can help you gain more confidence and eventually a job of your choice, in this field.
Apache Spark is a high-functioning, fast, and general-purpose cluster computing system. It provides high-functioning APIs in various programming languages such as Java, Python, Scala, and its prime purpose is to provide an optimized engine capable of supporting general execution graphs.
Apache Spark | |
---|---|
What is Apache Spark? | It is an open-source general-purpose cluster-computing framework. It gives over 80 high-level operators that make it handy to construct parallel apps and you can use it interactively from the Python, Scala, and SQL shells. |
Latest Version | 2.4.4 released on 1st September 2019 |
Created By | Matei Zaharia |
Written in | Python, Scala, Java, SQL |
Official Website | https://spark.apache.org |
Operating System | Linux, Windows, macOS |
License | Apache License 2.0 |
Here in this article, we will be listing frequently asked Spark Interview Questions and Answers with the belief that they will be helpful for you to gain higher marks. Also, to let you know that this article has been written under the guidance of industry professionals and covered all the current competencies.
Spark has the following important features which help developers in many ways:
Apache Spark is an open-source general-purpose distributed data processing engine used to process and analyze large amounts of data efficiently. It has a wide array of uses in ETL and SQL batch jobs, processing of data from sensors, IoT Data Management, Financial Systems and Machine Learning Tasks.
Tungsten is a codename for the project in Apache Spark whose main function is to make changes in the execution engine. Tungsten engine in Spark is used to exponentially increase the efficiency of memory and CPU for its native applications by pushing standard performance limits further as per hardware compatibility.
Parquet is a column-based file format which is used to optimize the speed of queries and is very efficient than a CSV or JSON file format. Spark SQL supports both read and write functions on parquet files which capture schema of original data automatically
Spark is faster than Hive because it does the processing of data in the main memory of worker nodes thus preventing unnecessary I/O operations within disks.
The PageRank Algorithm in Spark delivers an output probability distribution which is used to represent the chances of a person randomly clicking on links arriving on a particular page.
Spark Streaming is an extension of the core Spark API.Its main use is to allow data engineers and data scientists to process real-time data from multiple sources like Kafka, Amazon Kinesis and Flume. This processed data can be exported to file systems, databases and dashboards for further analysis.
Spark | Hadoop |
---|---|
It's a Data Analytics Engine | It is a Big Data Process Engine |
Used to Process real-time data, using real-time events like Twitter and Facebook | Batch processing with a huge volume of data |
Has a Low latency computing | Has a High latency computing |
Can process data extracted interactively | Process the data extracted in batch mode |
It is easier to use, enables a user to process data using high-level operators through abstractions | Hadoop's model is a bit complex, need to handle low-level APIs |
Has an in-memory computation, thus, no external scheduler is required | The external job scheduler is required for memory computation |
It is a bit less secure as compare to Hadoop | Highly secure |
Costlier than Hadoop | Less Costly |
In Spark, Actions are RDD’s operation whose value returns back to the spark driver programs which then kick off a job to be executed in a cluster. Reduce, Collect, Take, saves Textfile are common examples of actions in Apache Spark.
The optimizer used by Spark SQL is the Catalyst optimizer. Its main job is to optimize queries that are written in Spark SQL and DataFrame DSL. The Catalyst Optimizer runs queries much faster than its counterpart, RDD.
In Spark, if any partition of an RDD is lost due to the failure of a worker node, that partition can be re-computed using the lineage of operations from the original fault-tolerant dataset.
Here are the uses of GraphX in Spark:
RDD | Dataframe |
---|---|
It is the representation of a set of records and an immutable collection of objects within distributed computing. | It is used for storing data and is basically the equivalent to a table in a relational database with more precious optimization. |
This is an array of reference for partitioned objects by representing a large set of data. | It is a distributed collection of data in the form of named rows and columns |
Here all the datasets are logically partitioned across servers to be computed across different nodes in a cluster. | It has a matrix-like structure with different types of columns, such as numeric, logical, and so on. |
This supports compile-time type safety, having been based on Object-Oriented Programming. | If there is a non-existent column that the user tries to access, there is an attribute error but no scope for compile-time type safety. |
Almost all data sources are supported by RDD | Dataframes require data sources to be in the JSON, CSV, or AVRO format, whereas storage systems having HIVE, HDFS, or MySQL tables. |
In Spark, Coalesce is just another method for partitioning the data into a data frame. This is primarily used for reducing the number of partitions inside a data frame. It is most commonly used in cases where the user wants to decrease the amount of partitions without any confusion of shuffle.
Here are some of the advantages of using Spark rather than Hadoop’s MapReduce:
Coalesce | Repartition |
---|---|
It is used for definitely decreasing the number of partitions used in a Dataframe. | This method can decrease or increase the number of partitions used in a Dataframe. |
It uses the existing partitions to minimize the amount of data being shuffled in a Dataframe. | It just creates new partitions and while doing a full shuffle. |
The partitions through this method are of variable sizes. | The partitions in this method are roughly the same sizes. |
Spark uses Lazy Evaluation because of the following reasons:
Cache () | Persist () |
---|---|
While using this, the default storage level is MEMORY_ONLY for RDD and MEMORY_AND_DISK for Dataset. | While using this, the user can use various storage levels for both RDD and Dataset. |
RDDs or Resilient Distributed Datasets are the fundamental data structure present in Spark. They are immutable and fault-tolerant in nature. There are multiple ways to create RDDs in Spark. They are:
RDDs can be created by taking an existing collection from a driver’s program and passing it to the Spark’s SparkContext’s parallelize () method. Here’s an example:
val rdd=spark.sparkContext.parallelize(Seq(("Java", 10000),
("Python", 200000), ("Scala", 4000)))
rdd.foreach(println)
Output
(Python,100000)
(Scala,3000)
(Java,20000)
Mostly, in production systems, users can generate RDDs from files by simply reading the data from the files. Let us see how:
Val rdd = spark.sparkContext.textFile("/path/textFile.txt")
The above line of code creates an RDD in which each record represents a line of code.
You can easily convert any DataFrame or DataSet into an RDD. It can be done by using the rdd() method. Here’s how:
val myRdd2 = spark.range(20).toDF().rdd
In the above line of code, toDF() creates a DataFrame, and by calling an RDD, the range of code returns with a newly created RDD.
There are two types of RDD Operations in Spark. They are: