Introduction to Spark

Last updated on: 2024-06-21

Apache Spark started in 2009 as a research project at UC Berkley’s AMPLab. In June 2013, Spark entered incubation status at the Apache Software Foundation (ASF), and established as an Apache Top-Level Project in February 2014.

In this article, we are going to understand:

What is Apache Spark?
Why Apache Spark should be used?
What are the integrations of Apache Spark?
What languages does Apache Spark support?
What libraries does Apache Spark support?

What is Apache Spark?

Apache Spark is a multi-language engine that is used for big data workloads.
Apache Spark provides in-memory storage for intermediate computations.
Apache Spark can be run on single-node machines or clusters.
Apache Spark is used for executing data engineering, data science, and machine learning tasks.

Why Apache Spark should be used?

Apache Spark is really fast, simple to use, and scalable.
Apache Spark is unified (one technology - many tasks).
Apache Spark is open-source and has a huge community working on it (over 2000 contributors).
Apache Spark is good at iterative processing. It doesn't write intermediate results to a disk which makes it faster than Hadoop MapReduce.
Apache Spark is used by around 80% of Fortune 500 companies to solve their data workload tasks.
Apache Spark is capable of batch/stream processing

What languages does Apache Spark support?

Languages	Supported Versions
Scala	2.12 & 2.13
Python	above 3.8
R	above 3.5
Java	Java 8/11/17

What is Apache Spark ecosystem?

Spark Core

Spark Core is the base engine for large-scale parallel and distributed data processing.
Spark Core handles functionalities like task scheduling, memory management, fault recovery, etc.

Spark SQL

Spark SQL is built on top of Spark Core and is used to structured processing on the data.
Spark SQL provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.
Spark SQL supports various data formats like csv, json, parquet, hive, etc.

MLlib

MLlib is used to make implementation of Machine Learning simple.
MLlib has implementations of regression, classification, clustering, etc.

GraphX

GraphX is used for graphs and graph-parallel computation.
GraphX provides ETL, exploratory analysis, and iterative graph computation in one unified system.

Spark Structured Streaming

Spark Streaming enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
Spark Streaming doesn't support any updates
The current streaming engine used by spark is Spark Structured Streaming

What are the integrations of Apache Spark?

Apache Spark can be integrated with a lot of tools and technologies in various domains. Some of the key tools are:

Pandas
Power BI
Tableau
Cassandra
Mlflow
Kubernetes
Apache Kafka

To learn more about Apache Spark integrations, you can refer Awesome Spark repo