Learn Hadoop and NoSQL: Hive

Source and further read: programming hive by edward capriolo

Hive Port: 3306

Chapter 2 - Getting Started:

Command line Interface:
Variables and properties:
$ hive --define foo=barhive> set foo;foo=bar;
hive> set hivevar:foo;hivevar:foo=bar;
note: hivevar is the namespace
hive> create table toss1(i int, ${hivevar:foo} string);hive> describe toss1;
i intbar2 string
hive> create table toss2(i2 int, ${foo} string);

hive> describe toss2;

i2 intbar2 string

hive --hiveconf hive.cli.print.current.db=truehive> set hive.cli.print.current.db=true;
hive (default)>

$ hive --hiveconf y=5

hive> set y;

y=5

hive> SELECT * FROM whatsit WHERE i = ${hiveconf:y};

$ YEAR=2012 hive -e "SELECT * FROM mytable WHERE year = ${env:YEAR}";

Note: The query processor will see the literal number 2012 in the WHERE clause.

$ hive -S -e "select * FROM mytable LIMIT 3" > /tmp/myquery

-S silent mode - ignore Ok and time taken

$ hive -S -e "set" | grep warehouse

hive.metastore.warehouse.dir=/user/hive/warehouse

hive.warehouse.subdir.inherit.perms=false

$ hive -f /path/to/file/withqueries.hql

$ hive

hive> source /path/to/file/withqueries.hql;

hive> SELECT xpath(\'<a><b id="foo">b1</b><b id="bar">b2</b></a>\',\'//@id\')> FROM src LIMIT 1;[foo","bar]

Also the source table must have at least one row of content in it:

$ echo "one row" > /tmp/myfile

$ hive -e "LOAD DATA LOCAL INPATH '/tmp/myfile' INTO TABLE src;

Hive automatically looks for a file named .hiverc in your HOME directory and runs the commands.

The following shows an example of a typical $HOME/.hiverc file:

ADD JAR /path/to/custom_hive_extensions.jar;

set hive.cli.print.current.db=true;

set hive.exec.mode.local.auto=true;

Autocomplete is supported in hive.

Command History

Hive saves the last 100,00 lines into a file $HOME/.hivehistory.

Shell Execution

You don’t need to leave the hive CLI to run simple bash shell commands. Simplytype ! followed by the command and terminate the line with a semicolon (;)

hive> ! /bin/echo "what up dog";

"what up dog"

hive> ! pwd;

/home/me/hiveplay

Hadoop dfs Commands from Inside Hive

hive> dfs -ls / ;

Found 3 items

drwxr-xr-x - root supergroup 0 2011-08-17 16:27 /etl
drwxr-xr-x - edward supergroup 0 2012-01-18 15:51 /flag
drwxrwxr-x - hadoop supergroup 0 2010-02-03 17:50 /users

Comments in Hive Scripts

SELECT * FROM massive_table;

The CLI does not parse these comment lines. If you paste them into the CLI, you’ll get errors. They only work when used in scripts executed with hive -f script_name.
Query Column Headers

hive> set hive.cli.print.header=true;

hive> SELECT * FROM system_logs LIMIT 3;

Chapter 3: Data types and File Formats

Primitive types: Tinyint, smallint, int, bigint, Boolean, float, double, string, Timestamp, Binary
Collection data types: Array, Map (Key-Value ""), Struct

TIMESTAMPS are interpreted as UTC times. Unix epoch time starts on 01-Jan-1970

Built-in functions for conversion to and from timezones are provided by Hive, to_utc_timestamp and from_utc_timestamp, respectively.

Example:

CREATE TABLE IF NOT EXISTS mydb.employees (
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT>
COMMENT 'Keys are deductions names, values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
COMMENT 'Home address')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00', ...)
LOCATION '/user/hive/warehouse/mydb.db/employees';

The BINARY type is similar to the VARBINARY type found in many relational databases.

Hive will implicitly cast any integer to the larger of the two integer types, cast FLOAT to DOUBLE, and cast any integer value to DOUBLE.

Hive doesn’t have the concept of keys. However, you can index tables

the default file format is TEXTFILE

Schema on Read

Chapter 4 (HiveQL: Data Definition)

The Hive concept of a database is essentially just a catalog or namespace of tables.

hive> CREATE DATABASE IF NOT EXISTS financials;

You can also use the keyword SCHEMA instead of DATABASE in all the database-related
commands.

hive> SHOW DATABASES;
default
financials

hive> SHOW DATABASES LIKE 'h.*';

Hive will create a directory for each database. Tables in that database will be stored in
subdirectories of the database directory. The exception is tables in the default database,
which doesn’t have its own directory.

hive> CREATE DATABASE financials
> LOCATION '/my/preferred/directory';

hive> DESCRIBE DATABASE financials;

financials Holds all financial tables

hdfs://master-server/user/hive/warehouse/financials.db

hive> USE financials;

hive> DROP DATABASE IF EXISTS financials;

hive> DROP DATABASE IF EXISTS financials CASCADE;

CASCADE keyword to the command, which will cause the Hive to drop the tables in the database first:

When a database is dropped, its directory is also deleted.

Alter Database

No other metadata about the database can be changed, including its name and directory location

Creating Tables:

CREATE TABLE IF NOT EXISTS mydb.employees (
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT>
COMMENT 'Keys are deductions names, values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
COMMENT 'Home address')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00', ...)
LOCATION '/user/hive/warehouse/mydb.db/employees';

CREATE TABLE IF NOT EXISTS mydb.employees2 LIKE mydb.employees

hive> SHOW TABLES;

hive> SHOW TABLES IN mydb;

hive> DESCRIBE EXTENDED mydb.employees;

Replacing EXTENDED with FORMATTED provides more readable but also more verbose output.

Managed Tables:

The tables we have created so far are called managed tables or sometimes called internal
tables, because Hive controls the lifecycle of their data (more or less). As we’ve seen,
Hive stores the data for these tables in a subdirectory under the directory defined by
hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default.

When we drop a managed table (see “Dropping Tables” on page 66), Hive deletes
the data in the table. However, managed tables are less convenient for sharing with other tools. For example, suppose we have data that is created and used primarily by Pig or other tools, but we
want to run some queries against it, but not give Hive ownership of the data. We can
define an external table that points to that data, but doesn’t take ownership of it.

External Tables

Let’s assume the data files are in the distributed filesystem directory /data/stocks. The following table declaration creates an external table that can read all the data files for this comma-delimited data in /data/stocks:

CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';

Because it’s external, Hive does not assume it owns the data. Therefore, dropping the table does not delete the data, although the metadata for the table will be deleted.

Even for managed tables, you know where they are located, so you can use other tools, hadoop dfs commands, etc., to modify and even delete the files in the directories for managed tables. Hive may technically own these directories and files, but it doesn’t have full control over them! Recall, in “Schema
on Read” on page 48, we said that Hive really has no control over the integrity of the
files used for storage and whether or not their contents are consistent with the table
schema. Even managed tables don’t give us this control.
Still, a general principle of good software design is to express intent. If the data is shared
between tools, then creating an external table makes this ownership explicit.

As for managed tables, you can also copy the schema (but not the data) of an existing
table:

CREATE EXTERNAL TABLE IF NOT EXISTS mydb.employees3
LIKE mydb.employees
LOCATION '/path/to/data';

If you omit the EXTERNAL keyword and the original table is external, the new table will also be external. If you omit EXTERNAL and the original table is managed, the new table will also be managed. However, if you
include the EXTERNAL keyword and the original table is managed, the new table will be external. Even in this scenario, the LOCATION clause will still be optional.

Partitioned, Managed Tables

Hive will now create subdirectories reflecting the partitioning structure. For
example:
...
.../employees/country=CA/state=AB
.../employees/country=CA/state=BC
...
.../employees/country=US/state=AL
.../employees/country=US/state=AK

users of the table don’t need to care if these “columns” are partitions or not, except when they want to optimize query performance.

the following query selects all employees in the state of Illinois in the United States:

SELECT * FROM employees
WHERE country = 'US' AND state = 'IL';

Note that because the country and state values are encoded in directory names, there is no reason to have this data in the data files themselves

When we add predicates to WHERE clauses that filter on partition values, these predicates are called partition filters.

However, a query across all partitions could trigger an enormous MapReduce job if the table data and number of partitions are large.

A highly suggested safety measure is putting Hive into “strict” mode, which prohibits queries of partitioned tables without a WHERE clause that filters on partitions. You can set the mode to “nonstrict,” as in the
following session:

hive> set hive.mapred.mode=strict;

hive> SELECT e.name, e.salary FROM employees e LIMIT 100;
FAILED: Error in semantic analysis: No partition predicate found for
Alias "e" Table "employees"

hive> set hive.mapred.mode=nonstrict;

hive> SELECT e.name, e.salary FROM employees e LIMIT 100;

You can see the partitions that exist with the SHOW PARTITIONS command:
hive> SHOW PARTITIONS employees;
...
Country=CA/state=AB
country=CA/state=BC
...
country=US/state=AL
country=US/state=AK

hive> SHOW PARTITIONS employees PARTITION(country='US');
country=US/state=AL
country=US/state=AK
...
hive> SHOW PARTITIONS employees PARTITION(country='US', state='AK');
country=US/state=AK

The following example creates a US and CA (California) partition while loading data into it from a local
directory, $HOME/california-employees. You must specify a value for each partition column. Notice how we reference the HOME environment variable in HiveQL:

LOAD DATA LOCAL INPATH '${env:HOME}/california-employees'
INTO TABLE employees
PARTITION (country = 'US', state = 'CA');

External Partitioned Tables

CREATE EXTERNAL TABLE IF NOT EXISTS log_messages (
hms INT,
severity STRING,
server STRING,
process_id INT,
message STRING)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

Recall that when we created the nonpartitioned external stocks table, a LOCATION … clause was required. It isn’t used for external partitioned tables. Instead, an ALTER TABLE statement is used to add each partition separately. It must specify a value for each partition key, the year, month, and day,

ALTER TABLE log_messages ADD PARTITION(year = 2012, month = 1, day = 2)
LOCATION 'hdfs://master_server/data/log_messages/2012/01/02';

For example, each day we might use the following procedure to move data older than a month to S3:

• Copy the data for the partition being moved to S3. For example, you can use the hadoop distcp command:
hadoop distcp /data/log_messages/2011/12/02 s3n://ourbucket/logs/2011/12/02

• Alter the table to point the partition to the S3 location:

ALTER TABLE log_messages PARTITION(year = 2011, month = 12, day = 2)
SET LOCATION 's3n://ourbucket/logs/2011/01/02';

• Remove the HDFS copy of the partition using the hadoop fs -rmr command:

hadoop fs -rmr /data/log_messages/2011/01/02

As for managed partitioned tables, you can see an external table’s partitions with SHOW
PARTITIONS:
hive> SHOW PARTITIONS log_messages;
...
year=2011/month=12/day=31
year=2012/month=1/day=1
year=2012/month=1/day=2

ALTER TABLE … ADD PARTITION is not limited to external tables. You can use it with
managed tables, too, when you have (or will have) data for partitions in directories
created outside of the LOAD and INSERT options we discussed above. You’ll need to
remember that not all of the table’s data will be under the usual Hive “warehouse”
directory, and this data won’t be deleted when you drop the managed table! Hence,
from a “sanity” perspective, it’s questionable whether you should dare to use this feature
with managed tables.

**Customizing Table Storage Formats

CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

You can replace TEXTFILE with one of the other built-in file formats supported by Hive,
including SEQUENCEFILE and RCFILE, both of which optimize disk space usage and I/O
bandwidth performance using binary encoding and optional compression

CLUSTERED BY SORY

CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
CLUSTERED BY (exchange, symbol)
SORTED BY (ymd ASC)
INTO 96 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks'

The CLUSTERED BY … INTO … BUCKETS clause, with an optional SORTED BY … clause is used to optimize certain kinds of queries

**Dropping Tables

DROP TABLE IF EXISTS employees;

Actually, if you enable the Hadoop Trash feature, which is not on by
default, the data is moved to the .Trash directory in the distributed
filesystem for the user, which in HDFS is /user/$USER/.Trash. To enable
this feature, set the property fs.trash.interval to a reasonable positive
number. It’s the number of minutes between “trash checkpoints”; 1,440
would be 24 hours. While it’s not guaranteed to work for all versions of
all distributed filesystems, if you accidentally drop a managed table with
important data, you may be able to re-create the table, re-create any
partitions, and then move the files from .Trash to the correct directories
(using the filesystem commands) to restore the data.

For external tables, the metadata is deleted but the data is not.

**Alter Table
ALTER TABLE modifies table metadata only. The data for the table is
untouched

**Renaming a Table
ALTER TABLE log_messages RENAME TO logmsgs;

**Adding, Modifying, and Dropping a Table Partition

ALTER TABLE log_messages ADD IF NOT EXISTS
PARTITION (year = 2011, month = 1, day = 1) LOCATION '/logs/2011/01/01'
PARTITION (year = 2011, month = 1, day = 2) LOCATION '/logs/2011/01/02'
PARTITION (year = 2011, month = 1, day = 3) LOCATION '/logs/2011/01/03'

Similarly, you can change a partition location, effectively moving it:

ALTER TABLE log_messages PARTITION(year = 2011, month = 12, day = 2)
SET LOCATION 's3n://ourbucket/logs/2011/01/02';

This command does not move the data from the old location, nor does it delete the old
data.

Finally, you can drop a partition:
ALTER TABLE log_messages DROP IF EXISTS PARTITION(year = 2011, month = 12, day = 2);
The IF EXISTS clause is optional, as usual. For managed tables, the data for the partition
is deleted, along with the metadata, even if the partition was created using ALTER TABLE
… ADD PARTITION.
For external tables, the data is not deleted.

You can rename a column, change its position, type, or comment:

ALTER TABLE log_messages
CHANGE COLUMN hms hours_minutes_seconds INT
COMMENT 'The hours, minutes, and seconds part of the timestamp'
AFTER severity;

You have to specify the old name, a new name, and the type, even if the name or type
is not changing.

As always, this command changes metadata only. If you are moving columns, the data
must already match the new schema or you must change it to match by some other
means.

**Adding Columns
You can add new columns to the end of the existing columns, before any partition
columns.

ALTER TABLE log_messages ADD COLUMNS (
app_name STRING COMMENT 'Application name',
session_id LONG COMMENT 'The current session id');

**Deleting or Replacing Columns

The following example removes all the existing columns and replaces them with the
new columns specified:

ALTER TABLE log_messages REPLACE COLUMNS (
hours_mins_secs INT COMMENT 'hour, minute, seconds from timestamp',
severity STRING COMMENT 'The message severity'
message STRING COMMENT 'The rest of the message');

This statement effectively renames the original hms column and removes the server and
process_id columns from the original schema definition. As for all ALTER statements,
only the table metadata is changed.

**Alter Storage Properties
The following statement changes the storage format for a partition to be SEQUENCE
FILE,

ALTER TABLE log_messages
PARTITION(year = 2012, month = 1, day = 1)
SET FILEFORMAT SEQUENCEFILE;

The PARTITION clause is required if the table is partitioned.
You can specify a new SerDe along with SerDe properties or change the properties for
the existing SerDe

The ALTER TABLE … ARCHIVE PARTITION statement captures the partition files into a Hadoop
archive (HAR) file.
This only reduces the number of files in the filesystem, reducing
the load on the NameNode, but doesn’t provide any space savings (e.g., through
compression):

ALTER TABLE log_messages ARCHIVE
PARTITION(year = 2012, month = 1, day = 1);

To reverse the operation, substitute UNARCHIVE for ARCHIVE.
This feature is only available
for individual partitions of partitioned tables.

The following statements prevent the partition
from being dropped and queried:

ALTER TABLE log_messages
PARTITION(year = 2012, month = 1, day = 1) ENABLE NO_DROP;

ALTER TABLE log_messages
PARTITION(year = 2012, month = 1, day = 1) ENABLE OFFLINE;

To reverse either operation, replace ENABLE with DISABLE. These operations also can’t
be used with nonpartitioned tables.

****HiveQL: Data Manipulation

**Loading Data into Managed Tables

LOAD DATA LOCAL INPATH '${env:HOME}/california-employees'
OVERWRITE INTO TABLE employees
PARTITION (country = 'US', state = 'CA');

This command will first create the directory for the partition, if it doesn’t already exist,
then copy the data to it.

It is conventional practice to specify a path that is a directory, rather than an individual
file. Hive will copy all the files in the directory,

If the LOCAL keyword is used, the path is assumed to be in the local filesystem. The data
is copied into the final location. If LOCAL is omitted, the path is assumed to be in the
distributed filesystem. In this case, the data is moved from the path to the final location.

For distributed or pseudo-distributed mode, the path
is interpreted relative to the user’s home directory in the distributed filesystem, which
is /user/$USER by default in HDFS and MapRFS.

The PARTITION clause is required if the table is partitioned and you must specify a value
for each partition key.

the data will now exist in the following directory:
hdfs://master_server/user/hive/warehouse/mydb.db/employees/country=US/state=CA

Hive does not verify that the data you are loading matches the schema for the table.
However, it will verify that the file format matches the table definition. For example,
if the table was created with SEQUENCEFILE storage, the loaded files must be sequence
files.

**Inserting Data into Tables from Queries

INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'OR')
SELECT * FROM staged_employees se
WHERE se.cnty = 'US' AND se.st = 'OR';

This example suggests one common scenario where this feature is useful: data has been
staged in a directory, exposed to Hive as an external table, and now you want to put it
into the final, partitioned table. A workflow like this is also useful if you want the target
table to have a different record format than the source table.

Hive offers an
alternative INSERT syntax that allows you to scan the input data once and split it multiple
ways. The following example shows this feature for creating the employees partitions
for three states:

FROM staged_employees se
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'OR')
SELECT * WHERE se.cnty = 'US' AND se.st = 'OR'
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'CA')
SELECT * WHERE se.cnty = 'US' AND se.st = 'CA'
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'IL')
SELECT * WHERE se.cnty = 'US' AND se.st = 'IL';

If a record satisfied a given SELECT … WHERE … clause, it gets written to the specified table
and partition. To be clear, each INSERT clause can insert into a different table, when
desired, and some of those tables could be partitioned while others aren’t.

**Dynamic Partition Inserts

INSERT OVERWRITE TABLE employees
PARTITION (country, state)
SELECT ..., se.cnty, se.st
FROM staged_employees se;

Hive determines the values of the partition keys, country and state, from the last two
columns in the SELECT clause

Suppose that staged_employees has data for a total of 100 country and state pairs. After
running this query, employees will have 100 partitions!

You can also mix dynamic and static partitions. This variation of the previous query
specifies a static value for the country (US) and a dynamic value for the state:

INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state)
SELECT ..., se.cnty, se.st
FROM staged_employees se
WHERE se.cnty = 'US';

Dynamic partitioning is not enabled by default. When it is enabled, it works in “strict”
mode by default, where it expects at least some columns to be static. This helps protect
against a badly designed query that generates a gigantic number of partitions. For example,
you partition by timestamp and generate a separate partition for each second!
Perhaps you meant to partition by day or maybe hour instead. Several other properties
are also used to limit excess resource utilization.

So, for example, our first example using dynamic partitioning for all partitions might
actually look this, where we set the desired properties just before use:

hive> set hive.exec.dynamic.partition=true;

hive> set hive.exec.dynamic.partition.mode=nonstrict;

hive> set hive.exec.max.dynamic.partitions.pernode=1000;

hive> INSERT OVERWRITE TABLE employees
> PARTITION (country, state)
> SELECT ..., se.cty, se.st
> FROM staged_employees se;

**Creating Tables and Loading Them in One Query

CREATE TABLE ca_employees
AS SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';

This feature can’t be used with external tables

Recall that “populating” a partition for
an external table is done with an ALTER TABLE statement, where we aren’t “loading”
data, per se, but pointing metadata to a location where the data can be found.

**Exporting Data

hadoop fs -cp source_path target_path

Otherwise, you can use INSERT … DIRECTORY …, as in this example:

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/ca_employees'
SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';

Just like inserting data to tables, you can specify multiple inserts to directories:

FROM staged_employees se
INSERT OVERWRITE DIRECTORY '/tmp/or_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'OR'
INSERT OVERWRITE DIRECTORY '/tmp/ca_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'CA'
INSERT OVERWRITE DIRECTORY '/tmp/il_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'IL';

**** HiveQL: Queries
SELECT name, salary FROM employees;

hive> SELECT name, subordinates FROM employees;
John Doe ["Mary Smith","Todd Jones"]
Mary Smith ["Bill King"]
Todd Jones []
Bill King []

The deductions is a MAP, where the JSON representation for maps is used, namely a
comma-separated list of key:value pairs, surrounded with {...}:

hive> SELECT name, deductions FROM employees;
John Doe {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
Mary Smith {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
Todd Jones {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
Bill King {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}

Finally, the address is a STRUCT, which is also written using the JSON map format:

hive> SELECT name, address FROM employees;
John Doe {"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600}
Mary Smith {"street":"100 Ontario St.","city":"Chicago","state":"IL","zip":60601}
Todd Jones {"street":"200 Chicago Ave.","city":"Oak Park","state":"IL","zip":60700}
Bill King {"street":"300 Obscure Dr.","city":"Obscuria","state":"IL","zip":60100}

hive> SELECT name, subordinates[0] FROM employees;

hive> SELECT name, deductions["State Taxes"] FROM employees;

hive> SELECT name, address.city FROM employees;

** Specify Columns with Regular Expressions

The following query
selects the symbol column and all columns from stocks whose names start with the
prefix price:1

hive> SELECT symbol, `price.*` FROM stocks;

** Computing with Column Values

SELECT upper(name), salary, deductions["Federal Taxes"],
> round(salary * (1 - deductions["Federal Taxes"])) FROM employees;

** MATH Functions

round(d)
round(d, N)
floor(d)
ceil(d), ceiling(DOUBLE d)
rand(), rand(seed)
exp(d)
ln(d)
log10(d)
log2(d)
log(base, d)
pow(d, p), power(d, p)
sqrt(d)
bin(i)
hex(i)
hex(str)
unhex(i)
conv(i, from_base, to_base)
conv(str, from_base,
to_base)
abs(d)
pmod(i1, i2)
pmod(d1, d2)
sin(d)
asin(d)
cos(d)
acos(d)
tan(d)
atan(d)
degrees(d)
radians(d)
positive(i)
positive(d)
negative(i)
negative(d)
sign(d)
e()
pi()

AGGREGATE Functions

count(*)
count(expr)
count(DISTINCT expr[, expr_.])
sum(col)
sum(DISTINCT col)
avg(col)
avg(DISTINCT col)
min(col)
max(col)
variance(col), var_pop(col)
var_samp(col)
stddev_pop(col)
stddev_samp(col)
covar_pop(col1, col2)
covar_samp(col1, col2)

You can usually improve the performance of aggregation by setting the following property
to true, hive.map.aggr, as shown here:

hive> SET hive.map.aggr=true;
hive> SELECT count(*), avg(salary) FROM employees;

This setting will attempt to do “top-level” aggregation in the map phase, as in this
example. (An aggregation that isn’t top-level would be aggregation after performing a
GROUP BY.) However, this setting will require more memory.

** Other built-in functions

test in(val1, val2, …)
length(s)
reverse(s)
concat(s1, s2, …)
concat_ws(separator, s1, s2,
…)
substr(s, start_index)
substr(s, int start, int
length)
upper(s)
ucase(s)
lower(s)
lcase(s)
trim(s)
ltrim(s)
rtrim(s)
regexp_replace(s, regex,
replacement)
regexp_extract(subject,
regex_pattern, index)
parse_url(url, partname, key)
size(map<K.V>)
size(array<T>)
cast(<expr> as <type>)
from_unixtime(int unixtime)
to_date(timestamp)
year(timestamp)
month(timestamp)
day(timestamp)
get_json_object(json_string,path)
space(n)
repeat(s, n)
ascii(s)
lpad(s, len, pad)
rpad(s, len, pad)
split(s, pattern)
find_in_set(s, commaSeparated
String)
locate(substr, str, pos])
instr(str, substr)
str_to_map(s, delim1, delim2)
sentences(s, lang, locale)
ngrams(array<array<string>>,
N, K, pf)
con
text_ngrams(array<array<string>>,array<string>,int K, intpf)
in_file(s, filename)

**Other Query Samples

hive> SELECT upper(name), salary, deductions["Federal Taxes"],
> round(salary * (1 - deductions["Federal Taxes"])) FROM employees
> LIMIT 2;

COlumn Alias:

SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes,
> round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes
> FROM employees LIMIT 2;

Nested SELECT Statements:

hive> FROM (
> SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes,
> round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes
> FROM employees
> ) e
> SELECT e.name, e.salary_minus_fed_taxes
> WHERE e.salary_minus_fed_taxes > 70000;

CASE … WHEN … THEN Statements:
hive> SELECT name, salary,
> CASE
> WHEN salary < 50000.0 THEN 'low'
> WHEN salary >= 50000.0 AND salary < 70000.0 THEN 'middle'
> WHEN salary >= 70000.0 AND salary < 100000.0 THEN 'high'
> ELSE 'very high'
> END AS bracket FROM employees;

**When Hive Can Avoid MapReduce:

Hive implements some kinds of queries
without using MapReduce, in so-called local mode, for example:

SELECT * FROM employees;

In this case, Hive can simply read the records from employees and dump the formatted
output to the console.

This even works for WHERE clauses that only filter on partition keys, with or without
LIMIT clauses

SELECT * FROM employees
WHERE country = 'US' AND state = 'CA'
LIMIT 100;

Trust us, you want to add set hive.exec.mode.local.auto=true;
to your $HOME/.hiverc file

** WHERE Clauses

SELECT * FROM employees
WHERE country = 'US' AND state = 'CA';

hive> SELECT name, salary, deductions["Federal Taxes"],
> salary * (1 - deductions["Federal Taxes"])
> FROM employees
> WHERE round(salary * (1 - deductions["Federal Taxes"])) > 70000;

hive> SELECT e.* FROM
> (SELECT name, salary, deductions["Federal Taxes"] as ded,
> salary * (1 - deductions["Federal Taxes"]) as salary_minus_fed_taxes
> FROM employees) e
> WHERE round(e.salary_minus_fed_taxes) > 70000;

**Gotchas with Floating-Point Comparisons

hive> SELECT name, salary, deductions['Federal Taxes']
> FROM employees WHERE deductions['Federal Taxes'] > 0.2;

John Doe 100000.0 0.2
Mary Smith 80000.0 0.2

Actually, it doesn’t work. Here’s why. The number 0.2 can’t be represented exactly in
a FLOAT or DOUBLE

To simplify things a bit, let’s say that 0.2 is actually 0.2000001 for FLOAT and
0.200000000001 for DOUBLE, because an 8-byte DOUBLE has more significant digits (after
the decimal point). When the FLOAT value from the table is converted to DOUBLE by Hive,
it produces the DOUBLE value 0.200000100000, which is greater than 0.200000000001.
That’s why the query results appear to use >= not >

Workaround:
hive> SELECT name, salary, deductions['Federal Taxes'] FROM employees
> WHERE deductions['Federal Taxes'] > cast(0.2 AS FLOAT);

**LIKE and RLIKE

hive> SELECT name, address.street FROM employees WHERE address.street LIKE '%Chi%';

A Hive extension is the RLIKE clause, which lets us use Java regular expressions, a more
powerful minilanguage for specifying matches

hive> SELECT name, address.street
> FROM employees WHERE address.street RLIKE '.*(Chicago|Ontario).*';

** GROUP BY..HAVING Clauses

hive> SELECT year(ymd), avg(price_close) FROM stocks
> WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
> GROUP BY year(ymd)
> HAVING avg(price_close) > 50.0;

**JOIN Statements:

INNER JOIN:

hive> SELECT a.ymd, a.price_close, b.price_close
> FROM stocks a JOIN stocks b ON a.ymd = b.ymd
> WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM';

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL';

**Join Optimizations:

hive> SELECT a.ymd, a.price_close, b.price_close , c.price_close
> FROM stocks a JOIN stocks b ON a.ymd = b.ymd
> JOIN stocks c ON a.ymd = c.ymd
> WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM' AND c.symbol = 'GE';

In above example, every ON clause uses a.ymd as one of the join keys. In this case,
Hive can apply an optimization where it joins all three tables in a single MapReduce
job.

When joining three or more tables, if every ON clause uses the same join
key, a single MapReduce job will be used

Hive also assumes that the last table in the query is the largest. It attempts to buffer the
other tables and then stream the last table through, while performing joins on individual
records. Therefore, you should structure your join queries so the largest table is last.

**LEFT OUTER JOIN

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL';

**RIGHT OUTER JOIN Gotcha

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM dividends d RIGHT OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = s.symbol
> WHERE s.symbol = 'AAPL';

** FULL Outer join

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM dividends d FULL OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = s.symbol
> WHERE s.symbol = 'AAPL';

** LEFT SEMI-JOIN:

A left semi-join returns records from the lefthand table if records are found in the righthand
table that satisfy the ON predicates.

SELECT s.ymd, s.symbol, s.price_close FROM stocks s
WHERE s.ymd, s.symbol IN
(SELECT d.ymd, d.symbol FROM dividends d);

**Cartesian Product JOINs
SELECTS * FROM stocks JOIN dividends;

Learn Hadoop and NoSQL

Tuesday, September 1, 2015

Hive Quick Notes from Programming Hive book