Data Locality in Spark

Post category:Spark
Post comments:0 Comments
Post author:nitendratech
Post last modified:June 6, 2022

Data locality refers to how close data is to the code processing it. Having the code and the data together tends to make computations faster in Apache Spark. If the code and data are separated, either one has to move to the other. Moving the serialized code through the network is faster than moving the data through the network. As the size of the code is smaller than the data itself, this helps in the use of fewer network resources.

There are several levels of locality based on the data’s current location. Below are the locality levels in order from closest to farthest.

Table of Contents

PROCESS_LOCAL

At this level, data is in the same JVM as the running code. This is the best locality possible in Spark.

NODE_LOCAL

At this level, data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes

NO_PREF

At this level, data is accessed equally quickly from anywhere and has no locality preference

RACK_LOCAL

At this level, data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch

ANY

At this level, data is elsewhere on the network and not in the same rack

References

https://spark.apache.org/docs/latest/tuning.html#data-locality

Tags: Spark

PROCESS_LOCAL

NODE_LOCAL

NO_PREF

RACK_LOCAL

ANY

References

Share this:

Like this:

You Might Also Like

Spark Tutorials

Accessing Hive in HDP3 using Apache Spark

Introduction to Apache Spark Streaming

What are the types of Cluster Manager in Spark?