Introduction to Spark

Last updated on: 2025-05-30

Apache Spark was started in 2009 as a research project at UC Berkley’s AMPLab.

In June 2013, Spark gained incubation status at the Apache Software Foundation (ASF) and was established as an Apache Top-Level Project in February 2014.

In this article, we are going to understand:

What is Apache Spark?
Why Apache Spark should be used?
What are the integrations of Apache Spark?
What languages does Apache Spark support?
What libraries does Apache Spark support?

What is Apache Spark?

Apache Spark is a multi-language engine used for big data workloads.
It provides in-memory storage for intermediate computations.
It is built on an advanced SQL Engine for large-scale data.
Apache Spark can be run on single-node machines or clusters.
It is powered by a sophisticated distributed SQL engine designed for processing large-scale data.

Why should one use Apache Spark?

It is really fast, simple to use, and scalable.
It is unified (one technology - many tasks).
Apache Spark is open-source and has a huge active working community (over 2000 contributors).
It is good at iterative processing and doesn't write intermediate results to a disk, making it faster than Hadoop MapReduce.
It is used by ~80% of Fortune 500 companies to solve their data workload.
It is capable of batch/stream processing

What languages does Apache Spark support?

Languages	Supported Versions
Scala	2.12 & 2.13
Python	above 3.8
R	above 3.5
Java	Java 8/11/17

What is the Apache Spark ecosystem?

Spark Core

Spark Core is the base engine for large-scale parallel and distributed data processing.
It handles functionalities like task scheduling, memory management, fault recovery, etc.

Spark SQL

Spark SQL is built on top of Spark Core and is used for structured data processing.
It is used to execute relational queries expressed in either SQL or the DataFrame/Dataset API.
Spark SQL supports various data formats like CSV, JSON, parquet, hive, etc.

MLlib

MLlib is used to simplify the implementation of Machine Learning.
It has many algorithms and utilities including regression, classification, clustering, decision trees, etc.

GraphX

GraphX is used for graphs and graph-parallel computation.
It provides ETL, exploratory analysis, and iterative graph computation in a single unified system.

Spark Structured Streaming

Spark Structured Streaming is the current streaming engine used by Spark.
Spark Streaming enables scalable, high-throughput, fault-tolerant stream processing of live data streams.

What are the integrations of Apache Spark?

Apache Spark can be integrated with numerous tools and technologies in various domains, such as:

Pandas
Power BI
Tableau
Cassandra
Mlflow
Kubernetes
Apache Kafka

To learn more about Apache Spark integrations, you can refer Awesome Spark repo