Introduction to Spark
Last updated on: 2024-06-21
Apache Spark started in 2009 as a research project at UC Berkley’s AMPLab. In June 2013, Spark entered incubation status at the Apache Software Foundation (ASF), and established as an Apache Top-Level Project in February 2014.
In this article, we are going to understand:
- What is Apache Spark?
- Why Apache Spark should be used?
- What are the integrations of Apache Spark?
- What languages does Apache Spark support?
- What libraries does Apache Spark support?
What is Apache Spark?
- Apache Spark is a multi-language engine that is used for big data workloads.
- Apache Spark provides in-memory storage for intermediate computations.
- Apache Spark can be run on single-node machines or clusters.
- Apache Spark is used for executing data engineering, data science, and machine learning tasks.
Why Apache Spark should be used?
- Apache Spark is really fast, simple to use, and scalable.
- Apache Spark is unified (one technology - many tasks).
- Apache Spark is open-source and has a huge community working on it (over 2000 contributors).
- Apache Spark is good at iterative processing. It doesn't write intermediate results to a disk which makes it faster than Hadoop MapReduce.
- Apache Spark is used by around 80% of Fortune 500 companies to solve their data workload tasks.
- Apache Spark is capable of batch/stream processing
What languages does Apache Spark support?
Languages | Supported Versions |
---|---|
Scala | 2.12 & 2.13 |
Python | above 3.8 |
R | above 3.5 |
Java | Java 8/11/17 |
What is Apache Spark ecosystem?
Spark Core
- Spark Core is the base engine for large-scale parallel and distributed data processing.
- Spark Core handles functionalities like task scheduling, memory management, fault recovery, etc.
Spark SQL
- Spark SQL is built on top of Spark Core and is used to structured processing on the data.
- Spark SQL provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.
- Spark SQL supports various data formats like csv, json, parquet, hive, etc.
MLlib
- MLlib is used to make implementation of Machine Learning simple.
- MLlib has implementations of regression, classification, clustering, etc.
GraphX
- GraphX is used for graphs and graph-parallel computation.
- GraphX provides ETL, exploratory analysis, and iterative graph computation in one unified system.
Spark Structured Streaming
- Spark Streaming enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
- Spark Streaming doesn't support any updates
- The current streaming engine used by spark is Spark Structured Streaming
What are the integrations of Apache Spark?
Apache Spark can be integrated with a lot of tools and technologies in various domains. Some of the key tools are:
- Pandas
- Power BI
- Tableau
- Cassandra
- Mlflow
- Kubernetes
- Apache Kafka
To learn more about Apache Spark integrations, you can refer Awesome Spark repo