Introduction to Spark

Last updated on: 2024-06-21

Apache Spark started in 2009 as a research project at UC Berkley’s AMPLab. In June 2013, Spark entered incubation status at the Apache Software Foundation (ASF), and established as an Apache Top-Level Project in February 2014.

In this article, we are going to understand:

  • What is Apache Spark?
  • Why Apache Spark should be used?
  • What are the integrations of Apache Spark?
  • What languages does Apache Spark support?
  • What libraries does Apache Spark support?

What is Apache Spark?

  • Apache Spark is a multi-language engine that is used for big data workloads.
  • Apache Spark provides in-memory storage for intermediate computations.
  • Apache Spark can be run on single-node machines or clusters.
  • Apache Spark is used for executing data engineering, data science, and machine learning tasks.

Why Apache Spark should be used?

  • Apache Spark is really fast, simple to use, and scalable.
  • Apache Spark is unified (one technology - many tasks).
  • Apache Spark is open-source and has a huge community working on it (over 2000 contributors).
  • Apache Spark is good at iterative processing. It doesn't write intermediate results to a disk which makes it faster than Hadoop MapReduce.
  • Apache Spark is used by around 80% of Fortune 500 companies to solve their data workload tasks.
  • Apache Spark is capable of batch/stream processing

What languages does Apache Spark support?

LanguagesSupported Versions
Scala2.12 & 2.13
Pythonabove 3.8
Rabove 3.5
JavaJava 8/11/17

What is Apache Spark ecosystem?

Spark Core

  • Spark Core is the base engine for large-scale parallel and distributed data processing.
  • Spark Core handles functionalities like task scheduling, memory management, fault recovery, etc.

Spark SQL

  • Spark SQL is built on top of Spark Core and is used to structured processing on the data.
  • Spark SQL provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.
  • Spark SQL supports various data formats like csv, json, parquet, hive, etc.

MLlib

  • MLlib is used to make implementation of Machine Learning simple.
  • MLlib has implementations of regression, classification, clustering, etc.

GraphX

  • GraphX is used for graphs and graph-parallel computation.
  • GraphX provides ETL, exploratory analysis, and iterative graph computation in one unified system.

Spark Structured Streaming

  • Spark Streaming enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
  • Spark Streaming doesn't support any updates
  • The current streaming engine used by spark is Spark Structured Streaming

What are the integrations of Apache Spark?

Apache Spark can be integrated with a lot of tools and technologies in various domains. Some of the key tools are:

  • Pandas
  • Power BI
  • Tableau
  • Cassandra
  • Mlflow
  • Kubernetes
  • Apache Kafka

To learn more about Apache Spark integrations, you can refer Awesome Spark repo

References