Dataframes

Last updated on: 2025-05-30

What is a Dataframe?

A dataframe is a collection of data with a set of named columns, much like a table in SQL. It provides a convenient and powerful abstraction for working with structured and semi-structured data. DataFrames are built on top of Resilient Distributed Datasets (RDDs) and are optimized for handling large-scale data processing.

Before we dive into using DataFrames, let’s start by setting up a Spark session. (You can learn how to create one in this Spark Session Page article).

Creating an Empty DataFrame

After you initialize a Spark session, use the following code to create an empty DataFrame.

val emptyDf = spark.emptyDataFrame

emptyDf.show()

Output

++
||
++
++

This displays an empty DataFrame with no columns or rows. Next, we’ll define a schema to create a structured DataFrame.

Define the schema for a dataframe

Before defining the schema, we have to import the following objects from spark.sql.

  • spark.sql.Row - used to create individual rows for a dataframe.
  • spark.sql.types._ - used to define data types and schema structures, such as StructType, StructField etc.
// Create a dataframe of students and their marks obtained in the final exam.
val schema = StructType(Seq(
  StructField("Roll", IntegerType, nullable = true),
  StructField("Name", StringType, nullable = true),
  StructField("Marks", IntegerType, nullable = true)
))
// Show Empty DataFrame
val schemaDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
schemaDf.show()

We define the schema of a dataframe using StructType and StructField. To know more about them, please refer this Struct Data Type artilce. We have to define the data type of each column while defining schema of the dataframe. To know more about different data types in Spark, refer this Spark Data Types article.

Output

+----+----+-----+
|Roll|Name|Marks|
+----+----+-----+
+----+----+-----+

Viewing a DataFrame's Schema

To view the schema of a dataframe, we use the .printSchema() function, which prints the schema in an easy-to-read format.

// prints the schema of the dataframe
schemaDf.printSchema()

Output

root
 |-- Roll: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Marks: integer (nullable = true)

Alternatively, use the .schema attribute to get the raw schema

//alternate way to print the schema of the created DataFrame
println(schemaDf.schema)

Output

StructType(StructField(Roll,IntegerType,true),StructField(Name,StringType,true),StructField(Marks,IntegerType,true))

Now we have understood how to create a dataframe directly. It is time to convert Spark dataframes into other Spark abstractions, i.e., Spark RDDs and Spark Datasets.

Converting a Spark DataFrame to RDD

To transform a DataFrame into a RDD, import spark.implicits._ and create a DataFrame from a collection. And to print the elements of RDD, execute .collect().foreach(println) function.

import spark.implicits._

// Create a DataFrame with column names
val df = Seq(
  (1, "Ajay", 55),
  (2, "Bharghav", 63),
  (3, "Chaitra", 60),
  (4, "Kamal", 75),
  (5, "Sohaib", 70)
).toDF("Roll", "Name", "Marks")

// Convert DataFrame to RDD[Row]
val rddFromDF= df.rdd

// Print the RDD
rddFromDF.collect().foreach(println)

Output

(1,Ajay,55)
(2,Bharghav,63)
(3,Chaitra,60)
(4,Kamal,75)
(5,Sohaib,70)

Converting Spark DataFrame to Dataset

Datasets in Spark provide compile-time type safety and require a case class. Use the .as[] method to convert a DataFrame to a Dataset after defining the class. Note: .as[] requires importing spark.implicits._.

import spark.implicits._
// Define a case class for Dataset
case class Marks(Roll: Int, Name: String, Marks: Int)

// Create a DataFrame
val df = Seq(
  (1, "Ajay", 55),
  (2, "Bharghav", 63),
  (3, "Chaitra", 60),
  (4, "Kamal", 75),
  (5, "Sohaib", 70)
).toDF("Roll", "Name", "Marks")

// Convert DataFrame to Dataset[Marks]
val ds: Dataset[Marks] = df.as[Marks]

// Show the Dataset
ds.show()

Output

+----+--------+-----+
|Roll|    Name|Marks|
+----+--------+-----+
|   1|    Ajay|   55|
|   2|Bharghav|   63|
|   3| Chaitra|   60|
|   4|   Kamal|   75|
|   5|  Sohaib|   70|
+----+--------+-----+

Summary

In this article, we’ve covered:

  • What a DataFrame is and how it relates to SQL tables.

  • How to create both empty and structured DataFrames.

  • Ways to define and view schemas.

  • How to convert DataFrames into RDDs and Datasets for more flexible data manipulation.

These foundational concepts are essential for working efficiently with Spark’s powerful data processing capabilities.

References