Introduction to Spark
Last updated on: 2025-05-30
Apache Spark was started in 2009 as a research project at UC Berkley’s AMPLab.
In June 2013, Spark gained incubation status at the Apache Software Foundation (ASF) and was established as an Apache Top-Level Project in February 2014.
In this article, we are going to understand:
- What is Apache Spark?
- Why Apache Spark should be used?
- What are the integrations of Apache Spark?
- What languages does Apache Spark support?
- What libraries does Apache Spark support?
What is Apache Spark?
- Apache Spark is a multi-language engine used for big data workloads.
- It provides in-memory storage for intermediate computations.
- It is built on an advanced SQL Engine for large-scale data.
- Apache Spark can be run on single-node machines or clusters.
- It is powered by a sophisticated distributed SQL engine designed for processing large-scale data.
Why should one use Apache Spark?
- It is really fast, simple to use, and scalable.
- It is unified (one technology - many tasks).
- Apache Spark is open-source and has a huge active working community (over 2000 contributors).
- It is good at iterative processing and doesn't write intermediate results to a disk, making it faster than Hadoop MapReduce.
- It is used by ~80% of Fortune 500 companies to solve their data workload.
- It is capable of batch/stream processing
What languages does Apache Spark support?
Languages | Supported Versions |
---|---|
Scala | 2.12 & 2.13 |
Python | above 3.8 |
R | above 3.5 |
Java | Java 8/11/17 |
What is the Apache Spark ecosystem?
Spark Core
- Spark Core is the base engine for large-scale parallel and distributed data processing.
- It handles functionalities like task scheduling, memory management, fault recovery, etc.
Spark SQL
- Spark SQL is built on top of Spark Core and is used for structured data processing.
- It is used to execute relational queries expressed in either SQL or the DataFrame/Dataset API.
- Spark SQL supports various data formats like CSV, JSON, parquet, hive, etc.
MLlib
- MLlib is used to simplify the implementation of Machine Learning.
- It has many algorithms and utilities including regression, classification, clustering, decision trees, etc.
GraphX
- GraphX is used for graphs and graph-parallel computation.
- It provides ETL, exploratory analysis, and iterative graph computation in a single unified system.
Spark Structured Streaming
- Spark Structured Streaming is the current streaming engine used by Spark.
- Spark Streaming enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
What are the integrations of Apache Spark?
Apache Spark can be integrated with numerous tools and technologies in various domains, such as:
- Pandas
- Power BI
- Tableau
- Cassandra
- Mlflow
- Kubernetes
- Apache Kafka
To learn more about Apache Spark integrations, you can refer Awesome Spark repo