SparkSession

Last updated on: 2025-05-30

In this article, we'll explore what SparkSession is, why it is important, and how to use it effectively in your big data applications using Scala.

What is SparkSession?

  • SparkSession is the basis of programming with Apache Spark.
  • `It provides a unified interface to various Spark functionalities, such as loading data, creating DataFrames, running SQL queries, and managing resources.
  • Before Spark 2.0, you had to use separate contexts (SQLContext, HiveContext) for different functionalities.
  • SparkSession consolidates these contexts into a single, more convenient interface.

Why is SparkSession important?

SparkSession simplifies the process of working with Spark by offering the following:

  1. Unified Interface: Combining multiple Spark contexts into one.
  2. Ease of Use: Simplifying the codebase and improving readability.
  3. Functionality: Offering a wide range of methods for data manipulation and querying.
  4. Resource Management: Managing cluster resources more efficiently.

How to create a SparkSession in Scala?

Here’s how you can do it:

  • In the root directory of your project, create src/main/scala
  • Inside scala folder, create a Main.scala file
  • You need to import the necessary packages and instantiate a SparkSession object.
object Main extends App{
  import org.apache.spark.sql.SparkSession

  // Create a SparkSession
  val spark = SparkSession.builder
    .appName("SparkSessionExample")
    .master("local[*]")  // Change to your cluster master URL
    .getOrCreate()

  // Print the Spark version
  println(s"Spark Version: ${spark.version}")
}

In this example:

  • builder creates a new SparkSession.Builder instance.
  • appName sets the name of your Spark application.
  • master sets the cluster URL to connect to (use "local[*]" for local testing).
  • getOrCreate creates a new SparkSession or returns an existing one if available.

How to set configurations in SparkSession?

SparkSession config can be set by providing a config key and config value in the parameters.

val spark = SparkSession.builder
  .appName("SparkSessionExample")
  .master("local[*]")
  .config("config.key1", "config.value1")
  .config("config.key2", "config.value2")
  .getOrCreate()

For a list of all the options, you can refer to Spark Configurations

How to view all configurations in SparkSession?

SparkConf can be used to view all the configurations provided in SparkSession. As we already have a spark object, we can call the .conf method to get all configurations.

spark.conf.getAll.foreach(println)

The above code will print a list of all configurations.

How to read data using SparkSession?

With SparkSession, you can read data from CSV, JSON, Parquet, and many other formats. You can look into more data types at Spark Data Sources.

Here’s an example of how to read a CSV file:

// Read a CSV file into a DataFrame
val df = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("path/to/your/csvfile.csv")

// To view the first 5 rows of the DataFrame
df.show(5)

In this example:

  • option("header", "true") specifies that the first row of the CSV file contains column names.
  • option("inferSchema", "true") automatically infers the data types of the columns.

How to run SQL queries with SparkSession?

SparkSession also allows you to run SQL queries on your DataFrames. To do this, you first need to create a temporary view of the DataFrame, and then you can execute SQL queries using the sql method:

// Create a temporary view
df.createOrReplaceTempView("my_table")

// Run a SQL query
val result = spark.sql("SELECT * FROM my_table WHERE column_name > 100")

// Show the result of the query
result.show()

In this example:

  • createOrReplaceTempView("my_table") creates a temporary view named my_table.
  • spark.sql runs an SQL query to filter rows based on a condition.

How to close the SparkSession?

After completing your data processing tasks, it is a good practice to stop the SparkSession to release the resources:

// Stop the SparkSession
spark.stop()

Conclusion

SparkSession is a crucial component of Apache Spark that simplifies working with big data. It provides a unified interface for creating data frames, running SQL queries, and managing Spark applications. By understanding and utilizing SparkSession, you can write more efficient and cleaner Spark code.

References