Getting started with Spark

Last updated on: 2024-06-24

In this tutorial, we'll learn how to set up Apache Spark in three different ways:

  • Local installation
  • Docker
  • IntelliJ IDEA

Apache Spark Local Installation

In this section, we'll discuss how to set up Spark through direct installation from the official Apache Spark website

Install Prerequisites

Install Apache Spark

  1. Go to Apache Spark Downloads page
  2. Choose the Spark release version (or go with the default option)
  3. Choose the package type (or go with the default option)
  4. Click on the link beside "Download Spark:"

Set up Spark environment variables

  1. Extract the downloaded zip folder to your desired location
  2. Download winutils.exe from here
  3. Move it to the bin location of Spark extracted folder
  4. Now set the environment variables as follows:
    1. %SPARK_HOME% (to the Spark folder location)
    2. %HADOOP_HOME% (to the Spark folder location)
  5. Add "%SPARK_HOME%\bin" under Path in environment variables

Run Spark

  1. Open Command Prompt in the bin location of Spark folder
  2. Run the following command
spark-shell
  1. It shows Spark and Scala versions that are being used and also opens interactive scala shell
  2. SparkSession and SparkContext objects are available by default as spark and sc respectively

Apache Spark Setup in Docker

In this section, we'll discuss how to set up Spark using Docker

Install Docker Desktop

  1. Go to Docker Desktop page
  2. Click "Download for Windows" or any other option depending on your OS
  3. Docker Desktop setup file will be downloaded
  4. Run the setup file (Docker Desktop Installer.exe) to install Docker Desktop
  5. Restart your PC

Run docker container for Spark

  1. Run the following command in command line
docker run -it --rm spark /opt/spark/bin/spark-shell
  1. This will run the spark-shell inside the docker container
  2. Here you can execute spark code in CLI (Command Line Interface)
  3. You have successfully set up Spark using Docker

Apache Spark Setup in IntelliJ IDEA

In this section, we'll discuss how to set up Spark using IDE Plugin/Extension

Setting up the IDE

We recommend using IntelliJ IDEA as your development environment. However, you are free to use other IDEs as well which support Scala Metals extension.

Create a new Scala project

  1. Go to File -> New -> Project -> New Project
  2. Type your project name Select your project location
  3. Select language as Scala and "Build System" as sbt
  4. Select your JDK, sbt and Scala (2.12 or 2.13 only) versions
  5. Click "Create"
  6. Wait till the project is loaded. You can check the progress at the bottom of the IDE

Install Spark dependencies using build.sbt

  1. Go to build.sbt
  2. Define a value called "sparkVersion" and assign "3.5.1" anywhere at the beginning of the file
val sparkVersion: String = "3.5.1"
  1. Add the following lines at the end of the file. This will install Spark Core and Spark SQL based on the version number
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion
  1. Now reload the sbt build changes. In IntelliJ IDEA, it'll be near the top-right corner under the bell(notifications) icon
  2. This will install the Spark dependencies
  3. Congrats! You have successfully installed Spark dependencies using sbt

References