How RDD can be created in Spark?
RDDs or Resilient Distributed Datasets are the fundamental data structure present in Spark. They are immutable and fault-tolerant in nature. There are multiple ways to create RDDs in Spark. They are:
- Creating RDD from a Seq or List using Parallelize
RDDs can be created by taking an existing collection from a driver’s program and passing it to the Spark’s SparkContext’s parallelize () method. Here’s an example:
val rdd=spark.sparkContext.parallelize(Seq(("Java", 10000),
("Python", 200000), ("Scala", 4000)))
rdd.foreach(println)
Output
(Python,100000)
(Scala,3000)
(Java,20000)
- Creating an RDD using a text file
Mostly, in production systems, users can generate RDDs from files by simply reading the data from the files. Let us see how:
Val rdd = spark.sparkContext.textFile("/path/textFile.txt")
The above line of code creates an RDD in which each record represents a line of code.
- Creating RDDs from Dataframes and DataSets
You can easily convert any DataFrame or DataSet into an RDD. It can be done by using the rdd() method. Here’s how:
val myRdd2 = spark.range(20).toDF().rdd
In the above line of code, toDF() creates a DataFrame, and by calling an RDD, the range of code returns with a newly created RDD.