Introduction to Hive Query Language(HQL)

Apache Hive is a data warehouse infrastructure project built on top of Apache Hadoop for providing data summarization, ad-hoc queries, data aggregation, and analysis of datasets. The Hive Query Language (HiveQL) is a query language for Hive to process and analyze structured data stored in Apache Hadoop and perform analysis on top of it.

Changing the default Metastore in Hive

Even though the Derby database is the default metastore in the hive, we can change it by editing hive-site.xml in the hive installation directory.

<configuration>
<property>
<name>hive.metastore.local</name>
<value>true</value></property>
<property><name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://<ip address of mysqlmachine>/hive?createDatabaseIfNotExist=true</value>
</property>
<property><name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value></property>
<property><name>javax.jdo.option.ConnectionUserName</name>
<value>username</value></property>
<property><name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
</property>
</configuration>

Hive Queries

Create Database

First, create the database testdb if does not exist and list the databases.

-- Create Database if does not exists
CREATE DATABASE IF NOT EXISTS testdb;

-- Use the newly craete Database
USE testdb;

-- List all the databases
SHOW DATABASES;

Create Database

SELECT Unique Records in Hive

Given below Sample Table

 hive> SELECT * FROM employees LIMIT 10;
+--------+------------+------------+-----------+--------+------------+
| emp_no | birth_date | first_name | last_name | gender | hire_date  |
+--------+------------+------------+-----------+--------+------------+
|  10001 | 1953-09-02 | Georgi     | Facello   | M      | 1986-06-26 |
|  10002 | 1964-06-02 | Bezalel    | Simmel    | F      | 1985-11-21 |
|  10003 | 1959-12-03 | Parto      | Bamford   | M      | 1986-08-28 |
|  10004 | 1954-05-01 | Chirstian  | Koblick   | M      | 1986-12-01 |
|  10005 | 1955-01-21 | Kyoichi    | Maliniak  | M      | 1989-09-12 |
|  10006 | 1953-04-20 | Anneke     | Preusig   | F      | 1989-06-02 |
|  10007 | 1957-05-23 | Tzvetan    | Zielinski | F      | 1989-02-10 |
|  10008 | 1958-02-19 | Saniya     | Kalloufi  | M      | 1994-09-15 |
|  10009 | 1952-04-19 | Sumant     | Peac      | F      | 1985-02-18 |
|  10010 | 1963-06-01 | Duangkaew  | Piveteau  | F      | 1989-08-24 |
+--------+------------+------------+-----------+--------+------------+
SELECT DISTINCT birth_date,first_name,last_name,gender,hire_date
FROM employees 
GROUP BY last_name;

Upload Data in HDFS

Upload the drivers.csv and timesheet data into HDFS(/user/maria-dev/) from the below data source

Data Source

Now Upload the Files

Upload File First Step
Upload File Second Step

Create an Internal Table for the Driver’s Data

As the database is recently created, no tables exist in the new database. The following query will result in an empty result.

-- Show all the Tables
SHOW TABLES;

Show tables

Now create a Table using the Hive DDL statement.

-- Create Table drivers if does not exists

CREATE TABLE IF NOT EXISTS drivers_int(driverId INT ,name STRING ,ssn BIGINT, location STRING ,certified STRING ,wageplan STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES('skip.header.line.count' = '1');

SHOW TABLES;

In this DDL statement notice the following things

  • FIELDS TERMINATED BY ','
    It means that the data for this table will be separated by a comma or the data will be in CSV format
  • STORED AS TEXTFILE
  • ROW FORMAT DELIMITED
  • TBLPROPERTIES

Table Definition in Hive

describe <table name>; 

Describe table schema in detail

describe extended <tablename>; 
-- Describes Table
-- describe extended <tablename>; 
DESCRIBE EXTENDED drivers_int;
Describe Table

Load the data by overwriting the existing data

-- Insert the drivers_int table with the drivers.csv data by overwriting existing data 
LOAD DATA INPATH '/user/maria_dev/drivers.csv' OVERWRITE INTO TABLE drivers_int;

--List the data from Tables
SELECT * FROM temp_drivers LIMIT 10; 

Load Data

After executing LOAD DATA we can see the table drivers_int was populated with data from drivers.csv.

Note that Hive consumed the data file drivers.csv during this step. If you look in the File Browser, you will see drivers.csv is no longer there.

As we are creating an internal table, data from HDFS will be moved to the Hive warehouse.

Create a table for the driver timesheet

-- Create Table TimeSheet
CREATE TABLE IF NOT EXISTS timesheet_int( driverId INT ,week INT ,hours_logged INT ,miles_loggged INT)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES('skip.header.line.count' = '1');

SHOW TABLES;
Create Table TimeSheet

Load CSV data into the Table timesheet

-- Load data into Time Sheet Table
LOAD DATA INPATH '/user/maria_dev/timesheet.csv' OVERWRITE INTO TABLE timesheet_int;

-- List the data from TimeSheet Table
SELECT * FROM timesheet_int LIMIT 10;

Load Data TimeSheet

Find the sum of hours and miles logged score for a year for each driver

SELECT driverId ,sum(hours_logged) ,sum(miles_loggged) FROM timesheet_int GROUP BY driverId LIMIT 10;

Drivers hourly mile run

Join the two tables to get the driver Name for aggregated data

-- Join the two tables to get the aggregrated hours and miles data along with the respective driver name

SELECT d.driverId, d.name ,t.total_hours ,t.total_miles
FROM drivers_int d
JOIN
(SELECT driverId ,sum(hours_logged) total_hours ,sum(miles_loggged) total_miles
 FROM timesheet_int
 GROUP BY driverId) t
 
 ON (d.driverId = t.driverId)
 LIMIT 10;

Join two tables

Change Column Data Type in Hive

ALTER TABLE <Table Name> CHANGE <column name> <column name> <New Data Type>;

//Example
ALTER TABLE Employee CHANGE salary salary BIGINT;

Rename a Table in Hive

ALTER TABLE hive_table_name RENAME TO new_table_name;

Retrieve Maximum Salary in Hive Table

Let us suppose we have a table employee consisting of name and salary

  • Second-Highest SalarySELECT * FROM (SELECT salary ,ROW_NUMBER() over (ORDER BY salary) as row_no FROM employee GROUP BY SALARY) res WHERE res.row_no = 2 ;
  • Third-Highest Salary
SELECT * FROM 
(SELECT salary ,ROW_NUMBER() over (ORDER BY salary) as row_num FROM 
employee GROUP BY SALARY) res
WHERE res.row_num = 3;

Reference

[1] Data Processing with Hive