Introduction to Spark

Last updated on: 2025-05-30

Apache Spark was started in 2009 as a research project at UC Berkley’s AMPLab.

In June 2013, Spark gained incubation status at the Apache Software Foundation (ASF) and was established as an Apache Top-Level Project in February 2014.

In this article, we are going to understand:

  • What is Apache Spark?
  • Why Apache Spark should be used?
  • What are the integrations of Apache Spark?
  • What languages does Apache Spark support?
  • What libraries does Apache Spark support?

What is Apache Spark?

  • Apache Spark is a multi-language engine used for big data workloads.
  • It provides in-memory storage for intermediate computations.
  • It is built on an advanced SQL Engine for large-scale data.
  • Apache Spark can be run on single-node machines or clusters.
  • It is powered by a sophisticated distributed SQL engine designed for processing large-scale data.

Why should one use Apache Spark?

  • It is really fast, simple to use, and scalable.
  • It is unified (one technology - many tasks).
  • Apache Spark is open-source and has a huge active working community (over 2000 contributors).
  • It is good at iterative processing and doesn't write intermediate results to a disk, making it faster than Hadoop MapReduce.
  • It is used by ~80% of Fortune 500 companies to solve their data workload.
  • It is capable of batch/stream processing

What languages does Apache Spark support?

LanguagesSupported Versions
Scala2.12 & 2.13
Pythonabove 3.8
Rabove 3.5
JavaJava 8/11/17

What is the Apache Spark ecosystem?

Spark Core

  • Spark Core is the base engine for large-scale parallel and distributed data processing.
  • It handles functionalities like task scheduling, memory management, fault recovery, etc.

Spark SQL

  • Spark SQL is built on top of Spark Core and is used for structured data processing.
  • It is used to execute relational queries expressed in either SQL or the DataFrame/Dataset API.
  • Spark SQL supports various data formats like CSV, JSON, parquet, hive, etc.

MLlib

  • MLlib is used to simplify the implementation of Machine Learning.
  • It has many algorithms and utilities including regression, classification, clustering, decision trees, etc.

GraphX

  • GraphX is used for graphs and graph-parallel computation.
  • It provides ETL, exploratory analysis, and iterative graph computation in a single unified system.

Spark Structured Streaming

  • Spark Structured Streaming is the current streaming engine used by Spark.
  • Spark Streaming enables scalable, high-throughput, fault-tolerant stream processing of live data streams.

What are the integrations of Apache Spark?

Apache Spark can be integrated with numerous tools and technologies in various domains, such as:

  • Pandas
  • Power BI
  • Tableau
  • Cassandra
  • Mlflow
  • Kubernetes
  • Apache Kafka

To learn more about Apache Spark integrations, you can refer Awesome Spark repo

References