Getting started with Spark
Last updated on: 2024-06-24
In this tutorial, we'll learn how to set up Apache Spark in three different ways:
- Local installation
- Docker
- IntelliJ IDEA
Apache Spark Local Installation
In this section, we'll discuss how to set up Spark through direct installation from the official Apache Spark website
Install Prerequisites
- Java 8/11/17 (Download JDK here: OpenJDK)
- Scala 2.12 or 2.13 (Download here: Scala Official Download)
Install Apache Spark
- Go to Apache Spark Downloads page
- Choose the Spark release version (or go with the default option)
- Choose the package type (or go with the default option)
- Click on the link beside "Download Spark:"
Set up Spark environment variables
- Extract the downloaded zip folder to your desired location
- Download winutils.exe from here
- Move it to the bin location of Spark extracted folder
- Now set the environment variables as follows:
- %SPARK_HOME% (to the Spark folder location)
- %HADOOP_HOME% (to the Spark folder location)
- Add "%SPARK_HOME%\bin" under Path in environment variables
Run Spark
- Open Command Prompt in the bin location of Spark folder
- Run the following command
spark-shell
- It shows Spark and Scala versions that are being used and also opens interactive scala shell
- SparkSession and SparkContext objects are available by default as spark and sc respectively
Apache Spark Setup in Docker
In this section, we'll discuss how to set up Spark using Docker
Install Docker Desktop
- Go to Docker Desktop page
- Click "Download for Windows" or any other option depending on your OS
- Docker Desktop setup file will be downloaded
- Run the setup file (Docker Desktop Installer.exe) to install Docker Desktop
- Restart your PC
Run docker container for Spark
- Run the following command in command line
docker run -it --rm spark /opt/spark/bin/spark-shell
- This will run the spark-shell inside the docker container
- Here you can execute spark code in CLI (Command Line Interface)
- You have successfully set up Spark using Docker
Apache Spark Setup in IntelliJ IDEA
In this section, we'll discuss how to set up Spark using IDE Plugin/Extension
Setting up the IDE
We recommend using IntelliJ IDEA as your development environment. However, you are free to use other IDEs as well which support Scala Metals extension.
- Install IntelliJ IDEA by following this Jetbrains site -> Jetbrains IntelliJ IDEA Installation
- Install Scala plugin by following this Jetbrains site -> IntelliJ IDEA Scala Plugin Installation
- Work with Scala SBT in any IDE (like IDEA, VS Code, etc.) by following this guide -> Scala SBT
SBT stands for Scala Build Tool
Create a new Scala project
- Go to File -> New -> Project -> New Project
- Type your project name Select your project location
- Select language as Scala and "Build System" as sbt
- Select your JDK, sbt and Scala (2.12 or 2.13 only) versions
- Click "Create"
- Wait till the project is loaded. You can check the progress at the bottom of the IDE
Install Spark dependencies using build.sbt
- Go to build.sbt
- Define a value called "sparkVersion" and assign "3.5.1" anywhere at the beginning of the file
val sparkVersion: String = "3.5.1"
- Add the following lines at the end of the file. This will install Spark Core and Spark SQL based on the version number
libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion
libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion
- Now reload the sbt build changes. In IntelliJ IDEA, it'll be near the top-right corner under the bell(notifications) icon
- This will install the Spark dependencies
- Congrats! You have successfully installed Spark dependencies using sbt
References
- Local - OpenJDK, Scala, Apache Spark & winutils.exe
- Docker - Docker Desktop
- Jetbrains IntelliJ IDEA - IDEA Installation & Scala Plugin Installation
- Other IDEs - Scala SBT
- Apache Spark Official GitHub