What is Apache NiFi? An Introduction

Background

When I was working with Kafka, I did a lot of research into event-driven messaging and event-based architecture. While doing this research, I stumbled upon Apache NiFi, which helps to create complex data flows for a distributed or Internet of Things (IoT) based application. I decided to do this write-up which introduces NiFi which will be a key player in the IoT-based application in the future.

Introduction

Apache NiFi is an open-source tool for automating and managing the flow of data between systems (Databases, Sensors, Hadoop, Data platforms, and other sources). It solves the problem of real-time collecting and transporting data from a multitude of data sources and also provides an interactive user interface and control of live flows with full and automated data provenance.

It is a data source agnostic framework. Furthermore, it supports disparate and distributed sources of differing formats, and schemas that can follow protocols and can travel at varying speeds, and sizes. These different data sources can be below.

  • Machines
  • geolocation devices
  • clickstreams
  • files
  • social feeds
  • log files
  • and videos

It is configurable plumbing for moving data around, similar to how FedEx, UPS, or other courier/ delivery services move parcels around. And just like those services, Apache NiFi allows you to trace your data in real-time, just like you could trace a delivery.

This project is written using flow-based programming using Java and provides a web-based user interface to manage data flows in real-time. NiFi provides the data acquisition, simple event processing, transport, and delivery mechanism designed to accommodate the diverse data flows generated by a world of connected people, systems, and things.

This project was classified by, United States National Security Agency (NSA) for 8 years and was named Niagra files. The NSA made this application open-source through Apache Source Foundation in 2014 via its technology transfer program.

NiFi is helpful in creating DataFlow. It means you can transfer data from one system to another system, as well as process the data in between.

Apache Nifi Architecture

Figure: Apache NiFi Architecture

NiFi Real World Use Case

NiFi is used for data ingestion to pull data into NiFi, from numerous data sources and create FlowFiles. It can process extremely large data, extremely large data sets, tiny data with high rates, and variable-sized data. It can be used for various use cases, some of which are given below.

NiFi vs Kafka

Both Apache NiFi and Apache Kafka provide a broker to connect producers and consumers, but they do so in a way that is quite different from one another and complementary when looking holistically at what it takes to connect the enterprise.

With Kafka, the logic of the data flow lives in systems that produce data and systems that consume data. NiFi decouples the producer and consumer further and allows as much of the dataflow logic as possible or desired to live in a broker itself. This is why NiFi has interactive command and control to effect immediate change and why NiFi offers the processor API to operate on, alter, and route the data streams as they flow. It is also why NiFi provides powerful back-pressure and congestion control features. The model NiFi offers means you do have a point of central control with distributed execution, where you can address cross-cutting concerns; where you can tackle things like compliance checks and track which you would not want on the producer/consumers.

Push vs Pull Data Ingestion Pattern

In terms of this data ingestion pattern, Kafka producers push data to the Kafka broker and Kafka consumers pull data from the Kafka broker. Though it is a clean and scalable model, it requires that system to accept and follow that protocol. In contrast, NiFi does not have that specific protocol. It supports both push/pull data ingestion patterns to get data in and out of NiFi

High Availability

On the data plane, NiFi does not offer distributed data durability today, as Kafka does. As Lars pointed out, the NiFi community is adding distributed durability, but the value of it for NiFi’s use cases will be less vital than it is for Kafka, as NiFi isn’t holding the data for the arbitrary consumer pattern that Kafka supports. If a NiFi node goes down, the data is delayed while it is down. Avoiding data loss, though, is easily solved thanks to tried-and-true RAID or distributed block storage. NiFi’s control plane does already provide high availability as the cluster manager and even multiple nodes in a cluster can be lost while the live flow can continue operating normally.

Performance

Kafka offers an impressive balance of both high throughput and low latency. But comparing the performance of Kafka and NiFi is not very meaningful, given that they do very different things. It would be best to discuss performance tradeoffs in the context of a particular use case.

Programming Language Supported by Apache NiFi

NiFi is implemented in the Java programming language and allows extensions (processors, controller services, and reporting tasks) to be implemented in Java. In addition, NiFi supports processors that execute scripts written in Groovy, Jython, and several other popular scripting languages.

Conclusion

In this blog post, we learned about Apache NiFi and its real-world use case.

Please share this blog post on social media and leave a comment with any questions or suggestions.

References:

Apache NiFi

Real World Use Cases of Real-Time DataFlows in Record Time – Hortonworks