Source and further read: programming hive by edward capriolo
Hive Port: 3306
Chapter 2 - Getting Started:
Chapter 2 - Getting Started:
Command line Interface:
Variables and properties:
$ hive --define foo=barhive> set foo;foo=bar;
hive> set hivevar:foo;hivevar:foo=bar;
note: hivevar is the namespace
hive> create table toss1(i int, ${hivevar:foo} string);hive> describe toss1;
i intbar2 string
hive> create table toss2(i2 int, ${foo} string);
Variables and properties:
$ hive --define foo=barhive> set foo;foo=bar;
hive> set hivevar:foo;hivevar:foo=bar;
note: hivevar is the namespace
hive> create table toss1(i int, ${hivevar:foo} string);hive> describe toss1;
i intbar2 string
hive> create table toss2(i2 int, ${foo} string);
hive> describe toss2;
i2 intbar2 string
hive --hiveconf hive.cli.print.current.db=truehive> set hive.cli.print.current.db=true;
hive (default)>
hive --hiveconf hive.cli.print.current.db=truehive> set hive.cli.print.current.db=true;
hive (default)>
$ hive --hiveconf y=5
hive> set y;
y=5
hive> SELECT * FROM whatsit WHERE
i = ${hiveconf:y};
$ YEAR=2012 hive -e "SELECT *
FROM mytable WHERE year = ${env:YEAR}";
Note: The query processor will see the
literal number 2012 in the WHERE clause.
$ hive -S -e "select * FROM
mytable LIMIT 3" > /tmp/myquery
-S silent mode - ignore Ok and time
taken
$ hive -S -e "set" | grep
warehouse
hive.metastore.warehouse.dir=/user/hive/warehouse
hive.warehouse.subdir.inherit.perms=false
$ hive -f
/path/to/file/withqueries.hql
$ hive
hive> source
/path/to/file/withqueries.hql;
hive> SELECT xpath(\'<a><b
id="foo">b1</b><b
id="bar">b2</b></a>\',\'//@id\')> FROM src LIMIT 1;[foo","bar]
Also the source table must have at
least one row of content in it:
$ echo "one row" >
/tmp/myfile
$ hive -e "LOAD DATA LOCAL
INPATH '/tmp/myfile' INTO TABLE src;
Hive automatically looks for a file
named .hiverc in your HOME directory and runs the commands.
The following shows an example of a
typical $HOME/.hiverc file:
ADD JAR
/path/to/custom_hive_extensions.jar;
set hive.cli.print.current.db=true;
set hive.exec.mode.local.auto=true;
Autocomplete is supported in hive.
Command History
Hive saves the last 100,00 lines
into a file $HOME/.hivehistory.
Shell Execution
You don’t need to leave the hive
CLI to run simple bash shell commands. Simplytype ! followed by the command and
terminate the line with a semicolon (;)
hive> ! /bin/echo "what up
dog";
"what up dog"
hive> ! pwd;
/home/me/hiveplay
Hadoop dfs Commands from Inside
Hive
hive> dfs -ls / ;
Found 3 items
drwxr-xr-x - root supergroup 0
2011-08-17 16:27 /etl
drwxr-xr-x - edward supergroup 0 2012-01-18 15:51 /flag
drwxrwxr-x - hadoop supergroup 0 2010-02-03 17:50 /users
drwxr-xr-x - edward supergroup 0 2012-01-18 15:51 /flag
drwxrwxr-x - hadoop supergroup 0 2010-02-03 17:50 /users
Comments in Hive Scripts
-- Copyright (c) 2012 Megacorp, LLC.-- This is the best Hive script
evar!!
SELECT * FROM massive_table;
The CLI does not parse these comment
lines. If you paste them into the CLI, you’ll get errors. They only
work when used in scripts executed with hive -f script_name.
Query Column Headers
Query Column Headers
hive> set
hive.cli.print.header=true;
hive> SELECT * FROM system_logs
LIMIT 3;
Chapter 3: Data types and File Formats
Chapter 3: Data types and File Formats
Primitive types: Tinyint, smallint, int, bigint, Boolean, float, double, string, Timestamp, Binary
Collection data types: Array, Map (Key-Value ""), Struct
TIMESTAMPS are interpreted as UTC times. Unix epoch time starts on 01-Jan-1970
Built-in functions for conversion to and from timezones are provided by Hive, to_utc_timestamp and from_utc_timestamp, respectively.
Collection data types: Array, Map (Key-Value ""), Struct
TIMESTAMPS are interpreted as UTC times. Unix epoch time starts on 01-Jan-1970
Built-in functions for conversion to and from timezones are provided by Hive, to_utc_timestamp and from_utc_timestamp, respectively.
Example:
CREATE TABLE IF NOT EXISTS mydb.employees (
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT>
COMMENT 'Keys are deductions names, values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
COMMENT 'Home address')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00', ...)
LOCATION '/user/hive/warehouse/mydb.db/employees';
The BINARY type is similar to the VARBINARY type found in many relational databases.
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT>
COMMENT 'Keys are deductions names, values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
COMMENT 'Home address')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00', ...)
LOCATION '/user/hive/warehouse/mydb.db/employees';
The BINARY type is similar to the VARBINARY type found in many relational databases.
Hive will implicitly cast any
integer to the larger of the two integer types, cast FLOAT to DOUBLE, and cast any integer value to DOUBLE.
Hive doesn’t have the concept
of keys. However, you can index tables
the default file format is TEXTFILE
Schema on Read
Chapter 4 (HiveQL: Data Definition)
The Hive concept of a database is
essentially just a catalog or namespace of tables.
hive> CREATE DATABASE IF NOT
EXISTS financials;
You can also use the keyword SCHEMA
instead of DATABASE in all the database-related
commands.
commands.
hive> SHOW DATABASES;
default
financials
default
financials
hive> SHOW DATABASES LIKE 'h.*';
Hive will create a directory for
each database. Tables in that database will be stored in
subdirectories of the database directory. The exception is tables in the default database,
which doesn’t have its own directory.
subdirectories of the database directory. The exception is tables in the default database,
which doesn’t have its own directory.
hive> CREATE DATABASE financials
> LOCATION '/my/preferred/directory';
> LOCATION '/my/preferred/directory';
hive> DESCRIBE DATABASE
financials;
financials Holds all financial
tables
hdfs://master-server/user/hive/warehouse/financials.db
hive> USE financials;
hive> DROP DATABASE IF EXISTS
financials;
hive> DROP DATABASE IF EXISTS
financials CASCADE;
CASCADE keyword to the command,
which will cause the Hive to drop the tables in the
database first:
When a database is dropped, its
directory is also deleted.
Alter Database
No other metadata about the database can be changed, including its name and directory location
Creating Tables:
CREATE TABLE IF NOT EXISTS mydb.employees (
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT>
COMMENT 'Keys are deductions names, values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
COMMENT 'Home address')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00', ...)
LOCATION '/user/hive/warehouse/mydb.db/employees';
CREATE TABLE IF NOT EXISTS mydb.employees2 LIKE mydb.employees
CREATE TABLE IF NOT EXISTS mydb.employees (
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT>
COMMENT 'Keys are deductions names, values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
COMMENT 'Home address')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00', ...)
LOCATION '/user/hive/warehouse/mydb.db/employees';
CREATE TABLE IF NOT EXISTS mydb.employees2 LIKE mydb.employees
hive>
SHOW
TABLES;
hive>
SHOW
TABLES
IN
mydb;
hive>
DESCRIBE
EXTENDED
mydb.employees;
Replacing EXTENDED
with FORMATTED
provides more readable but
also more verbose output.
Managed Tables:
The tables we have created so far are called managed tables or sometimes called internal
tables, because Hive controls the lifecycle of their data (more or less). As we’ve seen,
Hive stores the data for these tables in a subdirectory under the directory defined by
hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default.
Managed Tables:
The tables we have created so far are called managed tables or sometimes called internal
tables, because Hive controls the lifecycle of their data (more or less). As we’ve seen,
Hive stores the data for these tables in a subdirectory under the directory defined by
hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default.
When we drop a managed table (see “Dropping Tables” on page 66), Hive deletes
the data in the table. However, managed tables are less convenient for sharing with other tools. For example, suppose we have data that is created and used primarily by Pig or other tools, but we
want to run some queries against it, but not give Hive ownership of the data. We can
define an external table that points to that data, but doesn’t take ownership of it.
the data in the table. However, managed tables are less convenient for sharing with other tools. For example, suppose we have data that is created and used primarily by Pig or other tools, but we
want to run some queries against it, but not give Hive ownership of the data. We can
define an external table that points to that data, but doesn’t take ownership of it.
External Tables
Let’s assume the data files are in the distributed filesystem directory /data/stocks. The following table declaration creates an external table that can read all the data files for this comma-delimited data in /data/stocks:
CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';
exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';
Because it’s external, Hive does not assume it owns the data. Therefore, dropping the table does not delete the data, although the metadata for the table will be deleted.
Even for managed tables, you know where they are located, so you can use other tools, hadoop dfs commands, etc., to modify and even delete the files in the directories for managed tables. Hive may technically own these directories and files, but it doesn’t have full control over them! Recall, in “Schema
on Read” on page 48, we said that Hive really has no control over the integrity of the
files used for storage and whether or not their contents are consistent with the table
schema. Even managed tables don’t give us this control.
Still, a general principle of good software design is to express intent. If the data is shared
between tools, then creating an external table makes this ownership explicit.
on Read” on page 48, we said that Hive really has no control over the integrity of the
files used for storage and whether or not their contents are consistent with the table
schema. Even managed tables don’t give us this control.
Still, a general principle of good software design is to express intent. If the data is shared
between tools, then creating an external table makes this ownership explicit.
As for managed tables, you can also copy the schema (but not the data) of an existing
table:
table:
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.employees3
LIKE mydb.employees
LOCATION '/path/to/data';
LIKE mydb.employees
LOCATION '/path/to/data';
If you omit the EXTERNAL keyword and the original table is external, the new table will also be external. If you omit EXTERNAL and the original table is managed, the new table will also be managed. However, if you
include the EXTERNAL keyword and the original table is managed, the new table will be external. Even in this scenario, the LOCATION clause will still be optional.
include the EXTERNAL keyword and the original table is managed, the new table will be external. Even in this scenario, the LOCATION clause will still be optional.
Partitioned, Managed Tables
CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
PARTITIONED BY (country STRING, state STRING);
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
PARTITIONED BY (country STRING, state STRING);
Hive will now create subdirectories reflecting the partitioning structure. For
example:
...
.../employees/country=CA/state=AB
.../employees/country=CA/state=BC
...
.../employees/country=US/state=AL
.../employees/country=US/state=AK
example:
...
.../employees/country=CA/state=AB
.../employees/country=CA/state=BC
...
.../employees/country=US/state=AL
.../employees/country=US/state=AK
users of the table don’t need to care if these “columns” are partitions or not, except when they want to optimize query performance.
the following query selects all employees in the state of Illinois in the United States:
SELECT * FROM employees
WHERE country = 'US' AND state = 'IL';
WHERE country = 'US' AND state = 'IL';
Note that because the country and state values are encoded in directory names, there is no reason to have this data in the data files themselves
When we add predicates to WHERE clauses that filter on partition values, these predicates are called partition filters.
However, a query across all partitions could trigger an enormous MapReduce job if the table data and number of partitions are large.
A highly suggested safety measure is putting Hive into “strict” mode, which prohibits queries of partitioned tables without a WHERE clause that filters on partitions. You can set the mode to “nonstrict,” as in the
following session:
following session:
hive> set hive.mapred.mode=strict;
hive> SELECT e.name, e.salary FROM employees e LIMIT 100;
FAILED: Error in semantic analysis: No partition predicate found for
Alias "e" Table "employees"
FAILED: Error in semantic analysis: No partition predicate found for
Alias "e" Table "employees"
hive> set hive.mapred.mode=nonstrict;
hive> SELECT e.name, e.salary FROM employees e LIMIT 100;
You can see the partitions that exist with the SHOW PARTITIONS command:
hive> SHOW PARTITIONS employees;
...
Country=CA/state=AB
country=CA/state=BC
...
country=US/state=AL
country=US/state=AK
hive> SHOW PARTITIONS employees;
...
Country=CA/state=AB
country=CA/state=BC
...
country=US/state=AL
country=US/state=AK
hive> SHOW PARTITIONS employees PARTITION(country='US');
country=US/state=AL
country=US/state=AK
...
hive> SHOW PARTITIONS employees PARTITION(country='US', state='AK');
country=US/state=AK
country=US/state=AL
country=US/state=AK
...
hive> SHOW PARTITIONS employees PARTITION(country='US', state='AK');
country=US/state=AK
The following example creates a US and CA (California) partition while loading data into it from a local
directory, $HOME/california-employees. You must specify a value for each partition column. Notice how we reference the HOME environment variable in HiveQL:
directory, $HOME/california-employees. You must specify a value for each partition column. Notice how we reference the HOME environment variable in HiveQL:
LOAD DATA LOCAL INPATH '${env:HOME}/california-employees'
INTO TABLE employees
PARTITION (country = 'US', state = 'CA');
INTO TABLE employees
PARTITION (country = 'US', state = 'CA');
External Partitioned Tables
CREATE EXTERNAL TABLE IF NOT EXISTS log_messages (
hms INT,
severity STRING,
server STRING,
process_id INT,
message STRING)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
hms INT,
severity STRING,
server STRING,
process_id INT,
message STRING)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
Recall that when we created the nonpartitioned external stocks table, a LOCATION … clause was required. It isn’t used for external partitioned tables. Instead, an ALTER TABLE statement is used to add each partition separately. It must specify a value for each partition key, the year, month, and day,
ALTER TABLE log_messages ADD PARTITION(year = 2012, month = 1, day = 2)
LOCATION 'hdfs://master_server/data/log_messages/2012/01/02';
LOCATION 'hdfs://master_server/data/log_messages/2012/01/02';
For example, each day we might use the following procedure to move data older than a month to S3:
• Copy the data for the partition being moved to S3. For example, you can use the hadoop distcp command:
hadoop distcp /data/log_messages/2011/12/02 s3n://ourbucket/logs/2011/12/02
hadoop distcp /data/log_messages/2011/12/02 s3n://ourbucket/logs/2011/12/02
• Alter the table to point the partition to the S3 location:
ALTER TABLE log_messages PARTITION(year = 2011, month = 12, day = 2)
SET LOCATION 's3n://ourbucket/logs/2011/01/02';
SET LOCATION 's3n://ourbucket/logs/2011/01/02';
• Remove the HDFS copy of the partition using the hadoop fs -rmr command:
hadoop fs -rmr /data/log_messages/2011/01/02
As for managed partitioned tables, you can see an external table’s partitions with SHOW
PARTITIONS:
hive> SHOW PARTITIONS log_messages;
...
year=2011/month=12/day=31
year=2012/month=1/day=1
year=2012/month=1/day=2
PARTITIONS:
hive> SHOW PARTITIONS log_messages;
...
year=2011/month=12/day=31
year=2012/month=1/day=1
year=2012/month=1/day=2
ALTER TABLE … ADD PARTITION is not limited to external tables. You can use it with
managed tables, too, when you have (or will have) data for partitions in directories
created outside of the LOAD and INSERT options we discussed above. You’ll need to
remember that not all of the table’s data will be under the usual Hive “warehouse”
directory, and this data won’t be deleted when you drop the managed table! Hence,
from a “sanity” perspective, it’s questionable whether you should dare to use this feature
with managed tables.
managed tables, too, when you have (or will have) data for partitions in directories
created outside of the LOAD and INSERT options we discussed above. You’ll need to
remember that not all of the table’s data will be under the usual Hive “warehouse”
directory, and this data won’t be deleted when you drop the managed table! Hence,
from a “sanity” perspective, it’s questionable whether you should dare to use this feature
with managed tables.
**Customizing Table Storage Formats
CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
You can replace TEXTFILE with one of the other built-in file formats supported by Hive,
including SEQUENCEFILE and RCFILE, both of which optimize disk space usage and I/O
bandwidth performance using binary encoding and optional compression
including SEQUENCEFILE and RCFILE, both of which optimize disk space usage and I/O
bandwidth performance using binary encoding and optional compression
CLUSTERED BY SORY
CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
CLUSTERED BY (exchange, symbol)
SORTED BY (ymd ASC)
INTO 96 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks'
exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
CLUSTERED BY (exchange, symbol)
SORTED BY (ymd ASC)
INTO 96 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks'
The CLUSTERED BY … INTO … BUCKETS clause, with an optional SORTED BY … clause is used to optimize certain kinds of queries
**Dropping Tables
DROP TABLE IF EXISTS employees;
Actually, if you enable the Hadoop Trash feature, which is not on by
default, the data is moved to the .Trash directory in the distributed
filesystem for the user, which in HDFS is /user/$USER/.Trash. To enable
this feature, set the property fs.trash.interval to a reasonable positive
number. It’s the number of minutes between “trash checkpoints”; 1,440
would be 24 hours. While it’s not guaranteed to work for all versions of
all distributed filesystems, if you accidentally drop a managed table with
important data, you may be able to re-create the table, re-create any
partitions, and then move the files from .Trash to the correct directories
(using the filesystem commands) to restore the data.
default, the data is moved to the .Trash directory in the distributed
filesystem for the user, which in HDFS is /user/$USER/.Trash. To enable
this feature, set the property fs.trash.interval to a reasonable positive
number. It’s the number of minutes between “trash checkpoints”; 1,440
would be 24 hours. While it’s not guaranteed to work for all versions of
all distributed filesystems, if you accidentally drop a managed table with
important data, you may be able to re-create the table, re-create any
partitions, and then move the files from .Trash to the correct directories
(using the filesystem commands) to restore the data.
For external tables, the metadata is deleted but the data is not.
**Alter Table
ALTER TABLE modifies table metadata only. The data for the table is
untouched
ALTER TABLE modifies table metadata only. The data for the table is
untouched
**Renaming a Table
ALTER TABLE log_messages RENAME TO logmsgs;
ALTER TABLE log_messages RENAME TO logmsgs;
**Adding, Modifying, and Dropping a Table Partition
ALTER TABLE log_messages ADD IF NOT EXISTS
PARTITION (year = 2011, month = 1, day = 1) LOCATION '/logs/2011/01/01'
PARTITION (year = 2011, month = 1, day = 2) LOCATION '/logs/2011/01/02'
PARTITION (year = 2011, month = 1, day = 3) LOCATION '/logs/2011/01/03'
PARTITION (year = 2011, month = 1, day = 1) LOCATION '/logs/2011/01/01'
PARTITION (year = 2011, month = 1, day = 2) LOCATION '/logs/2011/01/02'
PARTITION (year = 2011, month = 1, day = 3) LOCATION '/logs/2011/01/03'
Similarly, you can change a partition location, effectively moving it:
ALTER TABLE log_messages PARTITION(year = 2011, month = 12, day = 2)
SET LOCATION 's3n://ourbucket/logs/2011/01/02';
SET LOCATION 's3n://ourbucket/logs/2011/01/02';
This command does not move the data from the old location, nor does it delete the old
data.
data.
Finally, you can drop a partition:
ALTER TABLE log_messages DROP IF EXISTS PARTITION(year = 2011, month = 12, day = 2);
The IF EXISTS clause is optional, as usual. For managed tables, the data for the partition
is deleted, along with the metadata, even if the partition was created using ALTER TABLE
… ADD PARTITION.
For external tables, the data is not deleted.
You can rename a column, change its position, type, or comment:
ALTER TABLE log_messages
CHANGE COLUMN hms hours_minutes_seconds INT
COMMENT 'The hours, minutes, and seconds part of the timestamp'
AFTER severity;
CHANGE COLUMN hms hours_minutes_seconds INT
COMMENT 'The hours, minutes, and seconds part of the timestamp'
AFTER severity;
You have to specify the old name, a new name, and the type, even if the name or type
is not changing.
is not changing.
As always, this command changes metadata only. If you are moving columns, the data
must already match the new schema or you must change it to match by some other
means.
must already match the new schema or you must change it to match by some other
means.
**Adding Columns
You can add new columns to the end of the existing columns, before any partition
columns.
You can add new columns to the end of the existing columns, before any partition
columns.
ALTER TABLE log_messages ADD COLUMNS (
app_name STRING COMMENT 'Application name',
session_id LONG COMMENT 'The current session id');
app_name STRING COMMENT 'Application name',
session_id LONG COMMENT 'The current session id');
**Deleting or Replacing Columns
The following example removes all the existing columns and replaces them with the
new columns specified:
new columns specified:
ALTER TABLE log_messages REPLACE COLUMNS (
hours_mins_secs INT COMMENT 'hour, minute, seconds from timestamp',
severity STRING COMMENT 'The message severity'
message STRING COMMENT 'The rest of the message');
hours_mins_secs INT COMMENT 'hour, minute, seconds from timestamp',
severity STRING COMMENT 'The message severity'
message STRING COMMENT 'The rest of the message');
This statement effectively renames the original hms column and removes the server and
process_id columns from the original schema definition. As for all ALTER statements,
only the table metadata is changed.
process_id columns from the original schema definition. As for all ALTER statements,
only the table metadata is changed.
**Alter Storage Properties
The following statement changes the storage format for a partition to be SEQUENCE
FILE,
The following statement changes the storage format for a partition to be SEQUENCE
FILE,
ALTER TABLE log_messages
PARTITION(year = 2012, month = 1, day = 1)
SET FILEFORMAT SEQUENCEFILE;
PARTITION(year = 2012, month = 1, day = 1)
SET FILEFORMAT SEQUENCEFILE;
The PARTITION clause is required if the table is partitioned.
You can specify a new SerDe along with SerDe properties or change the properties for
the existing SerDe
You can specify a new SerDe along with SerDe properties or change the properties for
the existing SerDe
The ALTER TABLE … ARCHIVE PARTITION statement captures the partition files into a Hadoop
archive (HAR) file.
This only reduces the number of files in the filesystem, reducing
the load on the NameNode, but doesn’t provide any space savings (e.g., through
compression):
archive (HAR) file.
This only reduces the number of files in the filesystem, reducing
the load on the NameNode, but doesn’t provide any space savings (e.g., through
compression):
ALTER TABLE log_messages ARCHIVE
PARTITION(year = 2012, month = 1, day = 1);
PARTITION(year = 2012, month = 1, day = 1);
To reverse the operation, substitute UNARCHIVE for ARCHIVE.
This feature is only available
for individual partitions of partitioned tables.
The following statements prevent the partition
from being dropped and queried:
ALTER TABLE log_messages
PARTITION(year = 2012, month = 1, day = 1) ENABLE NO_DROP;
PARTITION(year = 2012, month = 1, day = 1) ENABLE NO_DROP;
ALTER TABLE log_messages
PARTITION(year = 2012, month = 1, day = 1) ENABLE OFFLINE;
PARTITION(year = 2012, month = 1, day = 1) ENABLE OFFLINE;
To reverse either operation, replace ENABLE with DISABLE. These operations also can’t
be used with nonpartitioned tables.
be used with nonpartitioned tables.
****HiveQL: Data Manipulation
**Loading Data into Managed Tables
LOAD DATA LOCAL INPATH '${env:HOME}/california-employees'
OVERWRITE INTO TABLE employees
PARTITION (country = 'US', state = 'CA');
OVERWRITE INTO TABLE employees
PARTITION (country = 'US', state = 'CA');
This command will first create the directory for the partition, if it doesn’t already exist,
then copy the data to it.
then copy the data to it.
It is conventional practice to specify a path that is a directory, rather than an individual
file. Hive will copy all the files in the directory,
file. Hive will copy all the files in the directory,
If the LOCAL keyword is used, the path is assumed to be in the local filesystem. The data
is copied into the final location. If LOCAL is omitted, the path is assumed to be in the
distributed filesystem. In this case, the data is moved from the path to the final location.
is copied into the final location. If LOCAL is omitted, the path is assumed to be in the
distributed filesystem. In this case, the data is moved from the path to the final location.
For distributed or pseudo-distributed mode, the path
is interpreted relative to the user’s home directory in the distributed filesystem, which
is /user/$USER by default in HDFS and MapRFS.
is interpreted relative to the user’s home directory in the distributed filesystem, which
is /user/$USER by default in HDFS and MapRFS.
The PARTITION clause is required if the table is partitioned and you must specify a value
for each partition key.
for each partition key.
the data will now exist in the following directory:
hdfs://master_server/user/hive/warehouse/mydb.db/employees/country=US/state=CA
hdfs://master_server/user/hive/warehouse/mydb.db/employees/country=US/state=CA
Hive does not verify that the data you are loading matches the schema for the table.
However, it will verify that the file format matches the table definition. For example,
if the table was created with SEQUENCEFILE storage, the loaded files must be sequence
files.
However, it will verify that the file format matches the table definition. For example,
if the table was created with SEQUENCEFILE storage, the loaded files must be sequence
files.
**Inserting Data into Tables from Queries
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'OR')
SELECT * FROM staged_employees se
WHERE se.cnty = 'US' AND se.st = 'OR';
PARTITION (country = 'US', state = 'OR')
SELECT * FROM staged_employees se
WHERE se.cnty = 'US' AND se.st = 'OR';
This example suggests one common scenario where this feature is useful: data has been
staged in a directory, exposed to Hive as an external table, and now you want to put it
into the final, partitioned table. A workflow like this is also useful if you want the target
table to have a different record format than the source table.
staged in a directory, exposed to Hive as an external table, and now you want to put it
into the final, partitioned table. A workflow like this is also useful if you want the target
table to have a different record format than the source table.
Hive offers an
alternative INSERT syntax that allows you to scan the input data once and split it multiple
ways. The following example shows this feature for creating the employees partitions
for three states:
alternative INSERT syntax that allows you to scan the input data once and split it multiple
ways. The following example shows this feature for creating the employees partitions
for three states:
FROM staged_employees se
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'OR')
SELECT * WHERE se.cnty = 'US' AND se.st = 'OR'
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'CA')
SELECT * WHERE se.cnty = 'US' AND se.st = 'CA'
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'IL')
SELECT * WHERE se.cnty = 'US' AND se.st = 'IL';
If a record satisfied a given SELECT … WHERE … clause, it gets written to the specified table
and partition. To be clear, each INSERT clause can insert into a different table, when
desired, and some of those tables could be partitioned while others aren’t.
and partition. To be clear, each INSERT clause can insert into a different table, when
desired, and some of those tables could be partitioned while others aren’t.
**Dynamic Partition Inserts
INSERT OVERWRITE TABLE employees
PARTITION (country, state)
SELECT ..., se.cnty, se.st
FROM staged_employees se;
PARTITION (country, state)
SELECT ..., se.cnty, se.st
FROM staged_employees se;
Hive determines the values of the partition keys, country and state, from the last two
columns in the SELECT clause
columns in the SELECT clause
Suppose that staged_employees has data for a total of 100 country and state pairs. After
running this query, employees will have 100 partitions!
running this query, employees will have 100 partitions!
You can also mix dynamic and static partitions. This variation of the previous query
specifies a static value for the country (US) and a dynamic value for the state:
specifies a static value for the country (US) and a dynamic value for the state:
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state)
SELECT ..., se.cnty, se.st
FROM staged_employees se
WHERE se.cnty = 'US';
PARTITION (country = 'US', state)
SELECT ..., se.cnty, se.st
FROM staged_employees se
WHERE se.cnty = 'US';
Dynamic partitioning is not enabled by default. When it is enabled, it works in “strict”
mode by default, where it expects at least some columns to be static. This helps protect
against a badly designed query that generates a gigantic number of partitions. For example,
you partition by timestamp and generate a separate partition for each second!
Perhaps you meant to partition by day or maybe hour instead. Several other properties
are also used to limit excess resource utilization.
mode by default, where it expects at least some columns to be static. This helps protect
against a badly designed query that generates a gigantic number of partitions. For example,
you partition by timestamp and generate a separate partition for each second!
Perhaps you meant to partition by day or maybe hour instead. Several other properties
are also used to limit excess resource utilization.
So, for example, our first example using dynamic partitioning for all partitions might
actually look this, where we set the desired properties just before use:
actually look this, where we set the desired properties just before use:
hive> set hive.exec.dynamic.partition=true;
hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> set hive.exec.max.dynamic.partitions.pernode=1000;
hive> INSERT OVERWRITE TABLE employees
> PARTITION (country, state)
> SELECT ..., se.cty, se.st
> FROM staged_employees se;
> PARTITION (country, state)
> SELECT ..., se.cty, se.st
> FROM staged_employees se;
**Creating Tables and Loading Them in One Query
CREATE TABLE ca_employees
AS SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';
AS SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';
This feature can’t be used with external tables
Recall that “populating” a partition for
an external table is done with an ALTER TABLE statement, where we aren’t “loading”
data, per se, but pointing metadata to a location where the data can be found.
an external table is done with an ALTER TABLE statement, where we aren’t “loading”
data, per se, but pointing metadata to a location where the data can be found.
**Exporting Data
hadoop fs -cp source_path target_path
Otherwise, you can use INSERT … DIRECTORY …, as in this example:
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/ca_employees'
SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';
SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';
Just like inserting data to tables, you can specify multiple inserts to directories:
FROM staged_employees se
INSERT OVERWRITE DIRECTORY '/tmp/or_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'OR'
INSERT OVERWRITE DIRECTORY '/tmp/ca_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'CA'
INSERT OVERWRITE DIRECTORY '/tmp/il_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'IL';
INSERT OVERWRITE DIRECTORY '/tmp/or_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'OR'
INSERT OVERWRITE DIRECTORY '/tmp/ca_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'CA'
INSERT OVERWRITE DIRECTORY '/tmp/il_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'IL';
**** HiveQL: Queries
SELECT name, salary FROM employees;
hive> SELECT name, subordinates FROM employees;
John Doe ["Mary Smith","Todd Jones"]
Mary Smith ["Bill King"]
Todd Jones []
Bill King []
John Doe ["Mary Smith","Todd Jones"]
Mary Smith ["Bill King"]
Todd Jones []
Bill King []
The deductions is a MAP, where the JSON representation for maps is used, namely a
comma-separated list of key:value pairs, surrounded with {...}:
comma-separated list of key:value pairs, surrounded with {...}:
hive> SELECT name, deductions FROM employees;
John Doe {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
Mary Smith {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
Todd Jones {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
Bill King {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
John Doe {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
Mary Smith {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
Todd Jones {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
Bill King {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
Finally, the address is a STRUCT, which is also written using the JSON map format:
hive> SELECT name, address FROM employees;
John Doe {"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600}
Mary Smith {"street":"100 Ontario St.","city":"Chicago","state":"IL","zip":60601}
Todd Jones {"street":"200 Chicago Ave.","city":"Oak Park","state":"IL","zip":60700}
Bill King {"street":"300 Obscure Dr.","city":"Obscuria","state":"IL","zip":60100}
John Doe {"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600}
Mary Smith {"street":"100 Ontario St.","city":"Chicago","state":"IL","zip":60601}
Todd Jones {"street":"200 Chicago Ave.","city":"Oak Park","state":"IL","zip":60700}
Bill King {"street":"300 Obscure Dr.","city":"Obscuria","state":"IL","zip":60100}
hive> SELECT name, subordinates[0] FROM employees;
hive> SELECT name, deductions["State Taxes"] FROM employees;
hive> SELECT name, address.city FROM employees;
** Specify Columns with Regular Expressions
The following query
selects the symbol column and all columns from stocks whose names start with the
prefix price:1
selects the symbol column and all columns from stocks whose names start with the
prefix price:1
hive> SELECT symbol, `price.*` FROM stocks;
** Computing with Column Values
SELECT upper(name), salary, deductions["Federal Taxes"],
> round(salary * (1 - deductions["Federal Taxes"])) FROM employees;
> round(salary * (1 - deductions["Federal Taxes"])) FROM employees;
** MATH Functions
round(d)
round(d, N)
floor(d)
ceil(d), ceiling(DOUBLE d)
rand(), rand(seed)
exp(d)
ln(d)
log10(d)
log2(d)
log(base, d)
pow(d, p), power(d, p)
sqrt(d)
bin(i)
hex(i)
hex(str)
unhex(i)
conv(i, from_base, to_base)
conv(str, from_base,
to_base)
abs(d)
pmod(i1, i2)
pmod(d1, d2)
sin(d)
asin(d)
cos(d)
acos(d)
tan(d)
atan(d)
degrees(d)
radians(d)
positive(i)
positive(d)
negative(i)
negative(d)
sign(d)
e()
pi()
round(d, N)
floor(d)
ceil(d), ceiling(DOUBLE d)
rand(), rand(seed)
exp(d)
ln(d)
log10(d)
log2(d)
log(base, d)
pow(d, p), power(d, p)
sqrt(d)
bin(i)
hex(i)
hex(str)
unhex(i)
conv(i, from_base, to_base)
conv(str, from_base,
to_base)
abs(d)
pmod(i1, i2)
pmod(d1, d2)
sin(d)
asin(d)
cos(d)
acos(d)
tan(d)
atan(d)
degrees(d)
radians(d)
positive(i)
positive(d)
negative(i)
negative(d)
sign(d)
e()
pi()
AGGREGATE Functions
count(*)
count(expr)
count(DISTINCT expr[, expr_.])
sum(col)
sum(DISTINCT col)
avg(col)
avg(DISTINCT col)
min(col)
max(col)
variance(col), var_pop(col)
var_samp(col)
stddev_pop(col)
stddev_samp(col)
covar_pop(col1, col2)
covar_samp(col1, col2)
count(expr)
count(DISTINCT expr[, expr_.])
sum(col)
sum(DISTINCT col)
avg(col)
avg(DISTINCT col)
min(col)
max(col)
variance(col), var_pop(col)
var_samp(col)
stddev_pop(col)
stddev_samp(col)
covar_pop(col1, col2)
covar_samp(col1, col2)
You can usually improve the performance of aggregation by setting the following property
to true, hive.map.aggr, as shown here:
to true, hive.map.aggr, as shown here:
hive> SET hive.map.aggr=true;
hive> SELECT count(*), avg(salary) FROM employees;
hive> SELECT count(*), avg(salary) FROM employees;
This setting will attempt to do “top-level” aggregation in the map phase, as in this
example. (An aggregation that isn’t top-level would be aggregation after performing a
GROUP BY.) However, this setting will require more memory.
example. (An aggregation that isn’t top-level would be aggregation after performing a
GROUP BY.) However, this setting will require more memory.
** Other built-in functions
test in(val1, val2, …)
length(s)
reverse(s)
concat(s1, s2, …)
concat_ws(separator, s1, s2,
…)
substr(s, start_index)
substr(s, int start, int
length)
upper(s)
ucase(s)
lower(s)
lcase(s)
trim(s)
ltrim(s)
rtrim(s)
regexp_replace(s, regex,
replacement)
regexp_extract(subject,
regex_pattern, index)
parse_url(url, partname, key)
size(map<K.V>)
size(array<T>)
cast(<expr> as <type>)
from_unixtime(int unixtime)
to_date(timestamp)
year(timestamp)
month(timestamp)
day(timestamp)
get_json_object(json_string,path)
space(n)
repeat(s, n)
ascii(s)
lpad(s, len, pad)
rpad(s, len, pad)
split(s, pattern)
find_in_set(s, commaSeparated
String)
locate(substr, str, pos])
instr(str, substr)
str_to_map(s, delim1, delim2)
sentences(s, lang, locale)
ngrams(array<array<string>>,
N, K, pf)
con
text_ngrams(array<array<string>>,array<string>,int K, intpf)
in_file(s, filename)
length(s)
reverse(s)
concat(s1, s2, …)
concat_ws(separator, s1, s2,
…)
substr(s, start_index)
substr(s, int start, int
length)
upper(s)
ucase(s)
lower(s)
lcase(s)
trim(s)
ltrim(s)
rtrim(s)
regexp_replace(s, regex,
replacement)
regexp_extract(subject,
regex_pattern, index)
parse_url(url, partname, key)
size(map<K.V>)
size(array<T>)
cast(<expr> as <type>)
from_unixtime(int unixtime)
to_date(timestamp)
year(timestamp)
month(timestamp)
day(timestamp)
get_json_object(json_string,path)
space(n)
repeat(s, n)
ascii(s)
lpad(s, len, pad)
rpad(s, len, pad)
split(s, pattern)
find_in_set(s, commaSeparated
String)
locate(substr, str, pos])
instr(str, substr)
str_to_map(s, delim1, delim2)
sentences(s, lang, locale)
ngrams(array<array<string>>,
N, K, pf)
con
text_ngrams(array<array<string>>,array<string>,int K, intpf)
in_file(s, filename)
**Other Query Samples
hive> SELECT upper(name), salary, deductions["Federal Taxes"],
> round(salary * (1 - deductions["Federal Taxes"])) FROM employees
> LIMIT 2;
> round(salary * (1 - deductions["Federal Taxes"])) FROM employees
> LIMIT 2;
COlumn Alias:
SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes,
> round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes
> FROM employees LIMIT 2;
> round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes
> FROM employees LIMIT 2;
Nested SELECT Statements:
hive> FROM (
> SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes,
> round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes
> FROM employees
> ) e
> SELECT e.name, e.salary_minus_fed_taxes
> WHERE e.salary_minus_fed_taxes > 70000;
> SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes,
> round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes
> FROM employees
> ) e
> SELECT e.name, e.salary_minus_fed_taxes
> WHERE e.salary_minus_fed_taxes > 70000;
CASE … WHEN … THEN Statements:
hive> SELECT name, salary,
> CASE
> WHEN salary < 50000.0 THEN 'low'
> WHEN salary >= 50000.0 AND salary < 70000.0 THEN 'middle'
> WHEN salary >= 70000.0 AND salary < 100000.0 THEN 'high'
> ELSE 'very high'
> END AS bracket FROM employees;
**When Hive Can Avoid MapReduce:
Hive implements some kinds of queries
without using MapReduce, in so-called local mode, for example:
without using MapReduce, in so-called local mode, for example:
SELECT * FROM employees;
In this case, Hive can simply read the records from employees and dump the formatted
output to the console.
output to the console.
This even works for WHERE clauses that only filter on partition keys, with or without
LIMIT clauses
SELECT * FROM employees
WHERE country = 'US' AND state = 'CA'
LIMIT 100;
WHERE country = 'US' AND state = 'CA'
LIMIT 100;
Trust us, you want to add set hive.exec.mode.local.auto=true;
to your $HOME/.hiverc file
to your $HOME/.hiverc file
** WHERE Clauses
SELECT * FROM employees
WHERE country = 'US' AND state = 'CA';
WHERE country = 'US' AND state = 'CA';
hive> SELECT name, salary, deductions["Federal Taxes"],
> salary * (1 - deductions["Federal Taxes"])
> FROM employees
> WHERE round(salary * (1 - deductions["Federal Taxes"])) > 70000;
> salary * (1 - deductions["Federal Taxes"])
> FROM employees
> WHERE round(salary * (1 - deductions["Federal Taxes"])) > 70000;
hive> SELECT e.* FROM
> (SELECT name, salary, deductions["Federal Taxes"] as ded,
> salary * (1 - deductions["Federal Taxes"]) as salary_minus_fed_taxes
> FROM employees) e
> WHERE round(e.salary_minus_fed_taxes) > 70000;
**Gotchas with Floating-Point Comparisons
hive> SELECT name, salary, deductions['Federal Taxes']
> FROM employees WHERE deductions['Federal Taxes'] > 0.2;
> FROM employees WHERE deductions['Federal Taxes'] > 0.2;
John Doe 100000.0 0.2
Mary Smith 80000.0 0.2
Mary Smith 80000.0 0.2
Actually, it doesn’t work. Here’s why. The number 0.2 can’t be represented exactly in
a FLOAT or DOUBLE
a FLOAT or DOUBLE
To simplify things a bit, let’s say that 0.2 is actually 0.2000001 for FLOAT and
0.200000000001 for DOUBLE, because an 8-byte DOUBLE has more significant digits (after
the decimal point). When the FLOAT value from the table is converted to DOUBLE by Hive,
it produces the DOUBLE value 0.200000100000, which is greater than 0.200000000001.
That’s why the query results appear to use >= not >
0.200000000001 for DOUBLE, because an 8-byte DOUBLE has more significant digits (after
the decimal point). When the FLOAT value from the table is converted to DOUBLE by Hive,
it produces the DOUBLE value 0.200000100000, which is greater than 0.200000000001.
That’s why the query results appear to use >= not >
Workaround:
hive> SELECT name, salary, deductions['Federal Taxes'] FROM employees
> WHERE deductions['Federal Taxes'] > cast(0.2 AS FLOAT);
hive> SELECT name, salary, deductions['Federal Taxes'] FROM employees
> WHERE deductions['Federal Taxes'] > cast(0.2 AS FLOAT);
**LIKE and RLIKE
hive> SELECT name, address.street FROM employees WHERE address.street LIKE '%Chi%';
A Hive extension is the RLIKE clause, which lets us use Java regular expressions, a more
powerful minilanguage for specifying matches
powerful minilanguage for specifying matches
hive> SELECT name, address.street
> FROM employees WHERE address.street RLIKE '.*(Chicago|Ontario).*';
> FROM employees WHERE address.street RLIKE '.*(Chicago|Ontario).*';
** GROUP BY..HAVING Clauses
hive> SELECT year(ymd), avg(price_close) FROM stocks
> WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
> GROUP BY year(ymd)
> HAVING avg(price_close) > 50.0;
> WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
> GROUP BY year(ymd)
> HAVING avg(price_close) > 50.0;
**JOIN Statements:
INNER JOIN:
hive> SELECT a.ymd, a.price_close, b.price_close
> FROM stocks a JOIN stocks b ON a.ymd = b.ymd
> WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM';
> FROM stocks a JOIN stocks b ON a.ymd = b.ymd
> WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM';
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL';
> FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL';
**Join Optimizations:
hive> SELECT a.ymd, a.price_close, b.price_close , c.price_close
> FROM stocks a JOIN stocks b ON a.ymd = b.ymd
> JOIN stocks c ON a.ymd = c.ymd
> WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM' AND c.symbol = 'GE';
> FROM stocks a JOIN stocks b ON a.ymd = b.ymd
> JOIN stocks c ON a.ymd = c.ymd
> WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM' AND c.symbol = 'GE';
In above example, every ON clause uses a.ymd as one of the join keys. In this case,
Hive can apply an optimization where it joins all three tables in a single MapReduce
job.
Hive can apply an optimization where it joins all three tables in a single MapReduce
job.
When joining three or more tables, if every ON clause uses the same join
key, a single MapReduce job will be used
key, a single MapReduce job will be used
Hive also assumes that the last table in the query is the largest. It attempts to buffer the
other tables and then stream the last table through, while performing joins on individual
records. Therefore, you should structure your join queries so the largest table is last.
other tables and then stream the last table through, while performing joins on individual
records. Therefore, you should structure your join queries so the largest table is last.
**LEFT OUTER JOIN
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL';
> FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL';
**RIGHT OUTER JOIN Gotcha
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM dividends d RIGHT OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = s.symbol
> WHERE s.symbol = 'AAPL';
> FROM dividends d RIGHT OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = s.symbol
> WHERE s.symbol = 'AAPL';
** FULL Outer join
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM dividends d FULL OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = s.symbol
> WHERE s.symbol = 'AAPL';
> FROM dividends d FULL OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = s.symbol
> WHERE s.symbol = 'AAPL';
** LEFT SEMI-JOIN:
A left semi-join returns records from the lefthand table if records are found in the righthand
table that satisfy the ON predicates.
table that satisfy the ON predicates.
SELECT s.ymd, s.symbol, s.price_close FROM stocks s
WHERE s.ymd, s.symbol IN
(SELECT d.ymd, d.symbol FROM dividends d);
WHERE s.ymd, s.symbol IN
(SELECT d.ymd, d.symbol FROM dividends d);
**Cartesian Product JOINs
SELECTS * FROM stocks JOIN dividends;