What is Spark Shared Variables?

Post category:Spark
Post comments:0 Comments
Post author:nitendratech
Post last modified:February 18, 2024

Shared variables are an abstraction in Apache Spark which is used in parallel operations in different nodes. When Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program.

Spark supports two types of shared variables: broadcast variables and accumulators.

Table of Contents

Accumulator variables

Accumulators are variables that can be added through associative operations. They are variables that are only “added” to, such as counters and sums. It is a shared variable that tasks can only add to, like counters in MapReduce. After a job has been completed, the accumulator’s final value can be retrieved from the driver program.

Spark natively supports two types of accumulators.

Numeric value types
Standard mutable collections.

AggregrateByKey() and combineByKey() uses accumulators. They are Spark’s offline debuggers, which are similar to Hadoop Counters.

Broadcast Variables

A broadcast variable is used to cache a value in memory on all nodes. It is serialized and sent to each executor, where it is cached so that later tasks can access it if needed.

Broadcast variables play a similar role to distributed caches in MapReduce. In Spark, the broadcasted variable is first stored in memory and spilled to disk only when memory is exhausted. It is created by passing the variable to be broadcast to the broadcast() method on Spark Context.

Shared Variables

Tags: Spark

Accumulator variables

Broadcast Variables

Share this:

Like this:

You Might Also Like

What is Parallelism in Apache Spark?

Introduction to Apache Spark SQL

Spark Tutorials

User-Defined Aggregate Functions(UDAF) Using Apache Spark