Data Types

Last updated on: 2025-05-30

Apache Spark gives the users the flexibility of handling different types of data seamlessly. Spark Data Types are broadly categorized into 5 types

  1. Numeric Type
  2. String Type
  3. Boolean Type
  4. DateTime Type
  5. Complex Type

In this article, we will explore these data types, with a sample dataframe to understand how they are represented.

// Creating a new DataFrame
val df = Seq(
  (1.toShort, "Ajay", 'A'.toString, 10L, "2010-01-01", 55.5F, 92.75, true, Seq("Singing", "Sudoku")),
  (2.toShort, "Bharghav", 'B'.toString, 20L, "2009-06-04", 63.2F, 88.5, false, Seq("Dancing", "Painting")),
  (3.toShort, "Chaitra", 'C'.toString, 30L, "2010-12-12", 60.1F, 75.8, true, Seq("Chess", "Long Walks")),
  (4.toShort, "Kamal", 'D'.toString, 40L, "2010-08-25", 75.0F, 82.3, false, Seq("Reading Books", "Cooking")),
  (5.toShort, "Sohaib", 'E'.toString, 50L, "2009-04-14", 70.8F, 90.6, true, Seq("Singing", "Cooking"))
).toDF("ID", "Name", "Grade", "LongValue", "DOB", "FloatMarks", "DoubleMarks", "IsActive", "Hobbies")

// Convert DOB column from String to DateType
val dfWithDate = df.withColumn("DOB", 
  to_date(col("DOB"), "yyyy-MM-dd")
)

dfWithDate.show(truncate = false)
dfWithDate.printSchema()

Output

+---+--------+-----+---------+----------+----------+-----------+--------+------------------------+
|ID |Name    |Grade|LongValue|DOB       |FloatMarks|DoubleMarks|IsActive|Hobbies                 |
+---+--------+-----+---------+----------+----------+-----------+--------+------------------------+
|1  |Ajay    |A    |10       |2010-01-01|55.5      |92.75      |true    |[Singing, Sudoku]       |
|2  |Bharghav|B    |20       |2009-06-04|63.2      |88.5       |false   |[Dancing, Painting]     |
|3  |Chaitra |C    |30       |2010-12-12|60.1      |75.8       |true    |[Chess, Long Walks]     |
|4  |Kamal   |D    |40       |2010-08-25|75.0      |82.3       |false   |[Reading Books, Cooking]|
|5  |Sohaib  |E    |50       |2009-04-14|70.8      |90.6       |true    |[Singing, Cooking]      |
+---+--------+-----+---------+----------+----------+-----------+--------+------------------------+

root
 |-- ID: short (nullable = false)
 |-- Name: string (nullable = true)
 |-- Grade: string (nullable = true)
 |-- LongValue: long (nullable = false)
 |-- DOB: date (nullable = true)
 |-- FloatMarks: float (nullable = false)
 |-- DoubleMarks: double (nullable = false)
 |-- IsActive: boolean (nullable = false)
 |-- Hobbies: array (nullable = true)
 |    |-- element: string (containsNull = true)
  1. Numeric Type

Byte type

The ByteType represents 1-byte signed integers ranging from -128 to 127. To assign a value explicitly as a byte, use .toByte. This ensures Spark reads the value with the correct data type.

val schema = StructType(Seq(
  StructField("byte_value", ByteType, nullable = false)
  )
)
// Create DataFrame with ByteType
val data = Seq(
  Row(10.toByte), 
  Row(-100.toByte), 
  Row(127.toByte)
)

val byteTypeDf = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)

byteTypeDf.show()

Output

+----------+
|byte_value|
+----------+
|        10|
|      -100|
|       127|
+----------+

ShortType

ShortType is used to represent 2-byte signed integers, supporting values between -32,768 and 32,767. You can convert a number to short type using the .toShort method:


1.toShort 
2.toShort
3.toShort

are defined explicitly to be stored as ShortType

How to define LongType data?

LongType is ideal for large integers and stores 8-byte signed values, ranging from –9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. To create a long integer, either use the L suffix or .toLong:

10L
20L
30L
40L
50L

FloatType data

Floating point numbers that occupy 4 bytes fall under FloatType. You can create a float using the F suffix or .toFloat conversion:

55.5F
63.2F
60.1F
75.0F
70.8F

DoubleType data

DoubleType represents 8-byte floating-point numbers, offering higher precision than float. In Scala, floating-point numbers without a suffix default to DoubleType. You can also use .toDouble:

92.75
88.5
75.8
82.3
90.6

We can perform a lot of operations on numeric data types, which can return valuable information about the dataframe. To know more about numeric operations, please refer this Mathematical Operations article.

  1. String Type

StringType data?

String type is a combination of character type data, put together. If we want to explicitly define a character as string, we use the method .toString.

'A'.toString
'B'.toString
'C'.toString
'D'.toString
'E'.toString
  1. Boolean Type

BooleanType data

Boolean variables represents true or false values.

true
false
true
false
true
  1. DateTime Type

DateType

Dates in Spark are stored using the DateType in the format yyyy-MM-dd. While data might be initially read as strings, you can use the to_date() function to convert them into proper date objects: A detailed explanation of date time datatype is

"2010-01-01"
"2009-06-04"
"2010-12-12"
"2010-08-25"
"2009-04-14"

We can perform multiple operations on date and time columns which will help to have the data in multiple formats and get desired results. If you wish to read about that, refer this Spark Date Time operations

  1. Complex Type

ArrayType

Array type variables represent a collection (or list) of elements of the same data type. In Spark, ArrayType is commonly used to store lists or multiple values in a single column.

Seq("Singing", "Sudoku")
Seq("Dancing", "Painting")
Seq("Chess", "Long Walks")
Seq("Reading Books", "Cooking")
Seq("Singing", "Cooking")

Summary

In this article, we explored key data types available in Spark DataFrames:

  • Numeric Types: Including ByteType, ShortType, LongType, FloatType, and DoubleType

  • StringType: For text and character data

  • BooleanType: For logical values (true/false)

  • DateType: For representing dates in standard format

  • ArrayType: For storing lists of values in a single column

  • The foundational understanding of Spark DataFrame types is crucial for data preprocessing, schema management, and ensuring compatibility when working with large-scale structured data in Spark.

References: