User-Defined Aggregate Functions(UDAF) Using Apache Spark

Post author:nitendratech
Post category:Spark
Post comments:0 Comments
Post published:May 31, 2019

UDAF stands for User Defined Aggregate functions. Aggregate functions are used to perform a calculation on a set of values and return a single value. It is difficult to write an aggregate function compared to writing a User Defined Functions(UDF) as we need to aggregate on multiple rows and columns. Apache Spark UDAF operates on more than one row or Column while returning a single value results

Submit Apache Spark Job with REST API

Post author:nitendratech
Post category:Spark
Post comments:1 Comment
Post published:April 29, 2018

When working with Apache spark ,there are times when you need to trigger a Spark job on demand from withing and outside the cluster.There are two ways in which we can submit Apache spark job in a cluster which includes bash script and REST API.

Introduction to Apache Spark SQL

Post author:nitendratech
Post category:Spark
Post comments:1 Comment
Post published:October 10, 2017

Introduction to Apache Spark Streaming

Post author:nitendratech
Post category:Spark
Post comments:1 Comment
Post published:October 4, 2017

Apache Spark Streaming execution model is advantageous over traditional streaming systems for its fast recovery from failures, dynamic load balancing, streaming and interactive analytics, and native integration.

Spark Resilient Distributed Dataset(RDD)

Post author:nitendratech
Post category:Spark
Post comments:5 Comments
Post published:September 30, 2017

Resilient Distributed Dataset (RDD) is the fault-tolerant and immutable primary data structure/abstraction in Apache Spark. It is a distributed collection of objects. The term ‘resilient’ in ‘Resilient Distributed Dataset’ refers…

What is Spark Shared Variables?

Post author:nitendratech
Post category:Spark
Post comments:0 Comments
Post published:September 5, 2017

Shared variables are an abstraction in Apache Spark which is used in parallel operations in different nodes. When Spark runs a function in parallel as a set of tasks on…

What is Apache Hadoop? An In-depth Look at This Big Data Tool

Post author:nitendratech
Post category:Hadoop
Post comments:18 Comments
Post published:July 1, 2017

What exactly is Apache Hadoop? Apache Hadoop is an open-source distributed processing framework that is used to store and process large datasets whose size ranges from gigabytes to petabytes of…

What is Apache Spark? The Unified engine for large-scale data analytics.

Post author:nitendratech
Post category:Spark
Post comments:28 Comments
Post published:May 10, 2017

Apache Spark is a distributed, in-memory and disk based optimized system which does real-time analytics using Resilient Distributed Data(RDD) Sets.Spark includes a streaming library, and a rich set of programming interfaces to make data processing and transformation easier.