How RDD can be created in Spark?

RDDs or Resilient Distributed Datasets are the fundamental data structure present in Spark. They are immutable and fault-tolerant in nature. There are multiple ways to create RDDs in Spark. They are:

Creating RDD from a Seq or List using Parallelize

RDDs can be created by taking an existing collection from a driver’s program and passing it to the Spark’s SparkContext’s parallelize () method. Here’s an example:
val rdd=spark.sparkContext.parallelize(Seq(("Java", 10000), ("Python", 200000), ("Scala", 4000))) rdd.foreach(println)
Output (Python,100000) (Scala,3000) (Java,20000)

Creating an RDD using a text file

Mostly, in production systems, users can generate RDDs from files by simply reading the data from the files. Let us see how: Val rdd = spark.sparkContext.textFile("/path/textFile.txt") The above line of code creates an RDD in which each record represents a line of code.

Creating RDDs from Dataframes and DataSets

You can easily convert any DataFrame or DataSet into an RDD. It can be done by using the rdd() method. Here’s how: val myRdd2 = spark.range(20).toDF().rdd In the above line of code, toDF() creates a DataFrame, and by calling an RDD, the range of code returns with a newly created RDD.

BY Best Interview Question ON 10 Jun 2020