Important Big Data Terms Everyone Should Know

When someone works with Big data, they need to know many concepts and terms. I will introduce some important Big data terms everyone working with Big data should know.

Table of Contents

A

Artificial Intelligence

Artificial Intelligence or AI is the ability of machines to use intelligence to think, learn, and act by itself. The backbone of AI is machine learning that learns from data, identifies various patterns in it, solves problems, and makes decisions by itself. It is used in various sectors of technology like speech recognition, decision-making, predicting outcomes, and communication with human beings.

Analytics

It is the study of data using algorithms and strategies to find meaning from raw data.

Aggregation

It is the process of searching, gathering, and processing data so that it can be used for reporting.

Avro

Avro is a data serialization system that can be used for encoding the Hadoop files and is used for parsing data.

Audits

Good auditing practice in an organization allows the identification of sources of data and application & data errors, as well as recognition of security events.

B

Batch Processing

Batch processing is the method in which the data is processed at regular intervals.

Big Data

Big data is a phrase used to denote the amount of data that is so large and is generated so quickly that it is difficult to store and analyze using traditional programming and storage techniques.

Business Intelligence

Business Intelligence is a process of processing raw data and analyzing it to extract valuable information for improving and better understanding the business. The use of Business Intelligence in an organization helps to make accurate and fast business decisions by creating reports, charts, dashboards, and graphs from the processed data.

C

Cloud Computing

Cloud computing is a term that uses distributed computing system that uses computers over a network for storing data off-premises. Some popular Cloud computing platforms are AWS, Microsoft Azure, Google Cloud, Oracle Cloud, and IBM Cloud.

Cassandra

Apache Cassandra is a distributed and open source-based NoSQL (Non-Relational) database that is designed to handle large amounts of distributed data. Furthermore, it is a key-value store that is installed across commodity servers while providing highly available services.

D

Data cleansing

It is the process of cleaning the data to add any missing data and delete duplicate entries, making sure the data is more consistent.

Data Scientist

The person who is responsible for developing the algorithms to make sense of big data.

Database

It is the organized collection of data that is stored and retrieved digitally from a remote or local computer system.

Data Lake

It is the central data repository in an organization that stores raw data collected from different parts of that organization. Data Lake can contain data from Billing, Finance, Retail, Wholesale, etc.

Distributed System

It is a system in which an application or a process is executed across multiple computers.

Dashboard

It is a graphical representation or visualization of the analyses of the data in the form of a dashboard. Some popular tools for this are Tableau, MicroStrategy, and Power BI.

Data aggregation

It is the process of collecting, ingesting, and analyzing the data from multiple data sources for reporting or analysis purposes.

E

ETL

ETL stands for Extract, transform, and Load. It is a process used in data warehousing based on the Extract, Transform, and Load phases. Furthermore, it is used to process the input data that can be later used for reporting or analytics.

G

Governance

Governance provides the oversight of data assets, standards, and policies for the information model and classification and data quality processes in an organization.

H

Hadoop

Apache Hadoop is the open-source data processing library as part of the Apache Software Foundation. It uses a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model called MapReduce.

HDFS (Hadoop Distributed File system)

HDFS is the distributed, scalable, Java-based file system used for storing large volumes of unstructured in Hadoop. It is the storage layer in the Hadoop framework.

I

In-Memory Database

It is that type of Database that relies on memory for data storage.

K

Kerberos

Kerberos provides Single Sign-On(SSO) via a ticket-based authentication mechanism. T

M

MapReduce

It is the data processing framework in Apache Hadoop. Not only that, but it mainly works in a two-phase called Map and Reduce. The map phase divides the query into multiple parts and processes the input data at the data node level. The reduced phase aggregates the result of the Map phase and writes the result back to HDFS. There is a shuffle and sort phase in between, which sorts the data before giving it to the reduce phase.

Metadata

Metadata is the information that describes other data, or, simply speaking, it is data about the data. It is the descriptive, administrative, and structural data that defines a firm’s data assets.

N

NoSQL

NoSQL or Not Only SQL is a database that does not follow the ACID property as like relational database system. Some popular NoSQL databases are Apache HBase, Apache Cassandra, GraphQL, Redis, MongoDB, DynamoDB, and Riak.

O

On-Premise

It is a distributed computing system that uses the computers within the organization network for storing data off-premises.

R

RDBMS (Relational Database Management System)

RDBMS stands for Relational Database Management System. It stores data in the form of a collection of tables and relations. Relations can be defined between the common fields of these tables. Some popular RDBMS are MySQL, Oracle SQL, Oracle Exadata, Microsoft SQL Server, PostgreSQL, MariaDB, IBM DB2, Microsoft Azure SQL Database.

S

Serialization

Serialization refers to the process of turning data structures into byte streams, either for storage or transmission over a network.

Sequence Files

Sequence files store data in a binary format with a similar structure to CSV. Like CSV, sequence files do not store metadata with the data, so the only schema evolution option is appending new fields.

Stream Processing

It is the process in which data streams are analyzed or processed in real or near-real time.

Scalability

It’s the ability by which a system or application can increase its performance when workload or requirement increases.

SQL

SQL stands for Structured Query Language(SQL). It is the standard language used for retrieving data from relational database management systems.

Spark

Apache Spark is a distributed, in-memory, and disk-based optimized open-source framework that does real-time analytics using Resilient Distributed Data(RDD) sets.

Structured Data

It is the data that has some kind of defined structure. Example flat files(CSV and TSV files)

U

Unstructured Data

It is the data that does not have a defined schema or identifiable structure. Example: Email, audio, and video-based data.

Conclusion

In this blog post, we learned about the important terms one needs to understand while working in the big data domain.

Please share this blog post on social media and leave a comment with any questions or suggestions.