Saturday, November 14, 2015

Sqoop Commands Complete Reference

SQOOP Commands - Complete reference


sqoop version


sqoop help

sqoop list-databases --connect jdbc:mysql://localhost/ --username root --password ram

 

/first time error/
Warning: /usr/lib/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
16/01/15 22:57:28 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
16/01/15 22:57:28 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/01/15 22:57:28 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
16/01/15 22:57:28 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
 


Error
Warning: /usr/lib/sqoop/../hcatalog does not exist! HCatalog jobs will fail.

Fix1:
gedit .bashrc
export HCAT_HOME=${HIVE_HOME}/hcatalog/
export PATH=$HCAT_HOME/bin:$PATH

source .bashrc

Error 2:
16/01/15 22:57:28 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver

note:
1. download connector for MySQL here http://dev.mysql.com/downloads/file/?id=459312
2. Extract the .gz and copy the .jar file in /usr/lib/sqoop/lib folder

sqoop list-tables --connect jdbc:mysql://localhost/sales --username root --password ram
Mysql: setting up environment

mysql -u root -p
create database sales
use sales
create table employee (empno int, empname varchar(50), empcity varchar(50));
insert into employee values (1001, 'Karthik', 'Chennai');
insert into employee values (1002, 'Saravanan', 'NY');
select * from employee;


sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee

 
ERROR tool.ImportTool: Error during import: No primary key could be found for table employee.
Please specify one with --split-by or perform a sequential import with '-m 1'.

sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee -m1
hadoop fs -ls /user/ram/employee
Found 2 items
-rw-r--r--   1 ram supergroup          0 2015-11-14 08:58 /user/ram/employee/_SUCCESS
-rw-r--r--   1 ram supergroup         39 2015-11-14 08:58 /user/ram/employee/part-m-00000

ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m-00000
1001,Karthik,Chennai
1002,Saravanan,NY

When you run the same import again:
sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee -m1
ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory hdfs://localhost:9000/user/ram/employee already exists

Appending the records in hdfs:

 
sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee -m1 --append
or
sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee --autoreset-to-one-mapper --append

ram@hadoop:~$ hadoop fs -ls /user/ram/employee/
15/11/14 09:13:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r--   1 ram supergroup          0 2015-11-14 08:58 /user/ram/employee/_SUCCESS
-rw-r--r--   1 ram supergroup         39 2015-11-14 08:58 /user/ram/employee/part-m-00000
-rw-r--r--   1 ram supergroup         39 2015-11-14 09:13 /user/ram/employee/part-m-00001


sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee --split-by empno --append
15/11/14 09:17:45 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=267100
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=221
        HDFS: Number of bytes written=39
        HDFS: Number of read operations=8
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=4
    Job Counters
        Launched map tasks=2
        Other local map tasks=2
        Total time spent by all maps in occupied slots (ms)=12800
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=12800
        Total vcore-seconds taken by all map tasks=12800
        Total megabyte-seconds taken by all map tasks=13107200
    Map-Reduce Framework
        Map input records=2
        Map output records=2
        Input split bytes=221
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=142
        CPU time spent (ms)=2880
        Physical memory (bytes) snapshot=363220992
        Virtual memory (bytes) snapshot=1681674240
        Total committed heap usage (bytes)=191889408
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=39


ram@hadoop:~$ hadoop fs -ls /user/ram/employee/
15/11/14 09:19:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 5 items
-rw-r--r--   1 ram supergroup          0 2015-11-14 08:58 /user/ram/employee/_SUCCESS
-rw-r--r--   1 ram supergroup         39 2015-11-14 08:58 /user/ram/employee/part-m-00000
-rw-r--r--   1 ram supergroup         39 2015-11-14 09:13 /user/ram/employee/part-m-00001
-rw-r--r--   1 ram supergroup         21 2015-11-14 09:17 /user/ram/employee/part-m-00002
-rw-r--r--   1 ram supergroup         18 2015-11-14 09:17 /user/ram/employee/part-m-00003
ram@hadoop:~$

Note: two files are imported as two mappers executed (002 and 003)

sqoop --options-file sqoopconf.conf

sqoopconf.conf
list-tables
--connect
jdbc:mysql://localhost/sales
--username
root
--password-file
/usr/local/hadoop/password.txt

Error while loading password file: The provided password file .password does not exist!ram@hadoop:~$ cat .password
ram
ram@hadoop:~$

note: You should save the password in a file on the users home directory with 400 permissions and
specify the path to that file using the --password-file argument

<<Need to check>>


sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram --table employee \
     --split-by empno --append \
     --target-dir /user/ram/employee_result \
     --query "select empno, empname from employee" \
    --where "where empno=1001"

Error: Cannot specify --query and --table together.


sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --split-by empno --append \
     --target-dir /user/ram/employee_result \
     --query "select empno, empname from employee" \
    --where "where empno=1001"

Error:
ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException:
Query [select empno, empname from employee] must contain '$CONDITIONS' in WHERE clause.

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --target-dir /user/ram/employee_result \
     --table employee \
     --split-by empno \
     --append \
      --where " empno=1001 and empname = 'karthik' "

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --target-dir /user/ram/employee_result \
     --table employee \
     --columns "empno, empname" \
     --split-by empno \
     --append \
      --where " empno=1001 and empname = 'karthik' "

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --target-dir /user/ram/employee_result2 \
     --query 'select empno, empname from employee where empname =  and \$CONDITIONS  \
     --num-mappers 1 \
     --append

Free-form Query Imports

When importing a free-form query, you must specify a destination directory with --target-dir.

If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --target-dir /user/ram/employee_result2 \
     --query "select empno, empname from employee where empno < 100 and \$CONDITIONS " \
     --num-mappers 1 \
     --append


Note:

If you are issuing the query wrapped with double quotes ("), you will have to use \$CONDITIONS instead of just $CONDITIONS to disallow your shell from treating it as a shell variable. For example, a double quoted query may look like: "SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"

The facility of using free-form query in the current version of Sqoop is limited to simple queries where there are no ambiguous projections and no OR conditions in the WHERE clause. Use of complex queries such as queries that have sub-queries or joins leading to ambiguous projections can lead to unexpected results.

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --target-dir /user/ram/employee_result2 \
     --query 'select empno, empname from employee where $CONDITIONS ' \
     --append \
     --num-mappers 1 



sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --num-mappers 1 \
     --fetch-size 1000 \
     --fields-terminated-by '|' \
     --lines-terminated-by '\n'

ram@hadoop:~$ hadoop fs -ls /user/ram/employee
Found 2 items
-rw-r--r--   1 ram supergroup          0 2015-11-14 22:05 /user/ram/employee/_SUCCESS
-rw-r--r--   1 ram supergroup         39 2015-11-14 22:05 /user/ram/employee/part-m-00000
ram@hadoop:~$


ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m-00000
1001|Karthik|Chennai
1002|Saravanan|NY
ram@hadoop:~$

Boundary Query and Null Handling:

Boundary Query:

By default sqoop will use query select min(<split-by>), max(<split-by>) from <table name> to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using --boundary-query argument.

Null handling:
Sqoop will by default import NULL values as string null. Hive is however using string \N to denote NULL values and therefore predicates dealing with NULL (like IS NULL) will not work correctly. You should append parameters --null-string and --null-non-string in case of import job or --input-null-string and --input-null-non-string in case of an export job if you wish to properly preserve NULL values.


mysql> insert into employee (empno, empname) values (1003,'kavitha');
Query OK, 1 row affected (0.07 sec)

mysql> select * from employee;
+-------+-----------+---------+
| empno | empname   | empcity |
+-------+-----------+---------+
|  1001 | Karthik   | Chennai |
|  1002 | Saravanan | NY      |
|  1003 | kavitha   | NULL    |
+-------+-----------+---------+
3 rows in set (0.00 sec)


sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --boundary-query "select min(empno), max(empno) from employee" \
     --split-by empno \
     --fetch-size 1000 \
     --fields-terminated-by '|' \
     --lines-terminated-by '\n' \
     --null-string '\\N' \
     --null-non-string '\\N'
--null-string '\\N': The string to be interpreted as null for string columns.
--null-non-string '\\N' :  The string to be interpreted as null for non string columns.

note:
select min(empno), max(empno) from employee -- will work
select min (empno), max (empno) from employee -- will throw function not found error


ram@hadoop:~$ hadoop fs -ls /user/ram/employee/
Found 4 items
-rw-r--r--   1 ram supergroup          0 2015-11-14 22:29 /user/ram/employee/_SUCCESS
-rw-r--r--   1 ram supergroup         21 2015-11-14 22:29 /user/ram/employee/part-m-00000
-rw-r--r--   1 ram supergroup         18 2015-11-14 22:29 /user/ram/employee/part-m-00001
-rw-r--r--   1 ram supergroup         16 2015-11-14 22:29 /user/ram/employee/part-m-00002

ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m-00000
1001|Karthik|Chennai
ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m-00002
1003|kavitha|\N

Relaxed Isolation:

By default Sqoop uses Read Comitted isolation level
--relaxed isolation to use read uncommitted (not supported by all rdbms)

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --boundary-query "select min(empno), max(empno) from employee" \
     --split-by empno \
     --fetch-size 1000 \
     --fields-terminated-by '|' \
     --lines-terminated-by '\n' \
     --null-string '\\N' \
     --null-non-string '\\N' \
     --relaxed-isolation
ram@hadoop:~$ hadoop dfs -cat /user/ram/employee/part-m-*1001|Karthik|Chennai
1002|Saravanan|NY
1003|kavitha|\N
ram@hadoop:~$ 




 --as-avrodatafile:

Avro data files are a compact, efficient binary format that provides interoperability with applications written in other programming languages. Avro also supports versioning, so that when, e.g., columns are added or removed from a table, previously imported data files can be processed along with new ones.

By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the -z or --compress argument, or specify any Hadoop compression codec using the --compression-codec argument. This applies to SequenceFile, text, and Avro files.


sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --split-by empno \
     --as-avrodatafile

ram@hadoop:~$ hadoop dfs -cat /user/ram/employee/part-*
Obj avro.schema� {"type":"record","name":"employee","doc":"Sqoop import of employee","fields":[{"name":"empno","type":["null","int"],"default":null,"columnName":"empno","sqlType":"4"},{"name":"empname","type":["null","string"],"default":null,"columnName":"empname","sqlType":"12"},{"namc =Obj avro.schema� {"type":"record","name":"employee","doc":"Sqoop import of employee","fields":[{"name":"empno","type":["null","int"],"default":null,"columnName":"empno","sqlType":"4"},{"name":"empname","type":["null","string"],"default":null,"columnName":"empname","sqlType":"12"},{"name":"empcity","type":["null","string"],"default":null,"columnName":"empcity","sqlType":"12"}],"tableName":"employee"} � ���'h�S�T���� $ Saravanan NY � ���'h�S�T����Obj avro.schema� {"type":"record","name":"employee","doc":"Sqoop import of employee","fields":[{"name":"empno","type":["null","int"],"default":null,"columnName":"empno","sqlType":"4"},{"name":"empname","type":["null","string"],"default":null,"columnName":"empname","sqlType":"12"},{"name":"empcity","type":["null","string"],"default":null,"columnName":"empcity","sqlType":"12"}],"tableName":"employee"}򐘩N�&��~�c�H�, kavitha򐘩N�&��~�c�H�,ram@hadoop:~$


$ hadoop dfs -text /user/ram/employee/part-*
{"empno":{"int":1001},"empname":{"string":"Karthik"},"empcity":{"string":"Chennai"}}
{"empno":{"int":1002},"empname":{"string":"Saravanan"},"empcity":{"string":"NY"}}
{"empno":{"int":1003},"empname":{"string":"kavitha"},"empcity":null}
ram@hadoop:~$


 --as-sequencefile:
SequenceFiles are a binary format that store individual records in custom record-specific data types. These data types are manifested as Java classes. Sqoop will automatically generate these data types for you

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --split-by empno \
     --as-sequencefile

ram@hadoop:~$ hadoop dfs -cat /user/ram/employee/part-*
SEQ !org.apache.hadoop.io.LongWritablemployee;��|ǀ�@E�A�'qf �KarthikChennaiSEQ !org.apache.hadoop.io.LongWritablemployee�
�      Saravanan NYSEQ !org.apache.hadoop.io.LongWritablemployeecb�[N���sc �<� �kavitha

further reference: reading sequencial file via java program
http://hadooptutorial.info/reading-and-writing-sequencefile-example/

Compression:
By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the -z or --compress argument, or specify any Hadoop compression codec using the --compression-codec argument. This applies to SequenceFile, text, and Avro files.

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --split-by empno \
     --as-avrodatafile \
     --compress
ram@hadoop:~$ hadoop dfs -ls /user/ram/employee/
Found 4 items
-rw-r--r--   1 ram supergroup          0 2015-11-14 23:11 /user/ram/employee/_SUCCESS
-rw-r--r--   1 ram supergroup        481 2015-11-14 23:11 /user/ram/employee/part-m-00000.avro
-rw-r--r--   1 ram supergroup        478 2015-11-14 23:11 /user/ram/employee/part-m-00001.avro
-rw-r--r--   1 ram supergroup        473 2015-11-14 23:11 /user/ram/employee/part-m-00002.avro
ram@hadoop:~$

ram@hadoop:~$ hadoop dfs -text /user/ram/employee/part*
{"empno":{"int":1001},"empname":{"string":"Karthik"},"empcity":{"string":"Chennai"}}
{"empno":{"int":1002},"empname":{"string":"Saravanan"},"empcity":{"string":"NY"}}
{"empno":{"int":1003},"empname":{"string":"kavitha"},"empcity":null}
ram@hadoop:~$


sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --split-by empno \
     --compress \
    --compression-codec org.apache.hadoop.io.compress.BZip2Codec

ram@hadoop:~$ hadoop dfs -ls /user/ram/employee/
Found 4 items
-rw-r--r--   1 ram supergroup          0 2015-11-14 23:20 /user/ram/employee/_SUCCESS
-rw-r--r--   1 ram supergroup         62 2015-11-14 23:20 /user/ram/employee/part-m-00000.bz2
-rw-r--r--   1 ram supergroup         62 2015-11-14 23:20 /user/ram/employee/part-m-00001.bz2
-rw-r--r--   1 ram supergroup         59 2015-11-14 23:20 /user/ram/employee/part-m-00002.bz2
ram@hadoop:~$

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --split-by empno \
     --compress \
    --compression-codec org.apache.hadoop.io.compress.SnappyCodec
Error: native snappy library not available: SnappyCompressor has not been loaded.
<<how to download and configure snappy:>>



--direct to Use direct connector if exists for the database

For example, MySQL provides the mysqldump tool which can export data from MySQL to other systems very quickly. By supplying the --direct argument, you are specifying that Sqoop should attempt the direct import channel. This channel may be higher performance than using JDBC.

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --split-by empno \
     --direct

ram@hadoop:~$ hadoop dfs -ls /user/ram/employee/
Found 4 items
-rw-r--r--   1 ram supergroup          0 2015-11-14 23:32 /user/ram/employee/_SUCCESS
-rw-r--r--   1 ram supergroup         21 2015-11-14 23:32 /user/ram/employee/part-m-00000
-rw-r--r--   1 ram supergroup         18 2015-11-14 23:32 /user/ram/employee/part-m-00001
-rw-r--r--   1 ram supergroup         18 2015-11-14 23:32 /user/ram/employee/part-m-00002

ram@hadoop:~$ hadoop dfs -ls /user/ram/employee/part-m*
-rw-r--r--   1 ram supergroup         21 2015-11-14 23:32 /user/ram/employee/part-m-00000
-rw-r--r--   1 ram supergroup         18 2015-11-14 23:32 /user/ram/employee/part-m-00001
-rw-r--r--   1 ram supergroup         18 2015-11-14 23:32 /user/ram/employee/part-m-00002
ram@hadoop:~$


incremental imports

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --split-by empno \
     --direct    \
     --incremental lastmodified \
     --last-value 0 \
     --check-column empno
note: --delete-target-dir can not be used with incremental imports.

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --table employee \
     --split-by empno \
     --direct    \
     --check-column empno \
     --incremental append \
     --last-value 0
Error: java.lang.RuntimeException: Column type is neither timestamp nor date!


sqoop import --connect jdbc:mysql://localhost/sales \
    --username root --password ram \
    --table employee \
    --split-by empno \
    --direct\
    --check-column empno \
    --incremental append \
    --last-value 0

15/11/15 05:19:26 INFO tool.ImportTool: Incremental import complete! To run another incremental import of all data following this import, supply the following arguments:
15/11/15 05:19:26 INFO tool.ImportTool:  --incremental append
15/11/15 05:19:26 INFO tool.ImportTool:   --check-column empno
15/11/15 05:19:26 INFO tool.ImportTool:   --last-value 1003
15/11/15 05:19:26 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create')



ram@hadoop:~$ hadoop fs -ls /user/ram/
Found 4 items
drwxr-xr-x   - ram supergroup          0 2015-11-15 05:19 /user/ram/_sqoop
drwxr-xr-x   - ram supergroup          0 2015-11-15 05:19 /user/ram/employee
drwxr-xr-x   - ram supergroup          0 2015-11-14 16:10 /user/ram/employee_result
drwxr-xr-x   - ram supergroup          0 2015-11-14 16:28 /user/ram/employee_result2

ram@hadoop:~$ hadoop fs -ls /user/ram/employee
Found 3 items
-rw-r--r--   1 ram supergroup         21 2015-11-15 05:19 /user/ram/employee/part-m-00000
-rw-r--r--   1 ram supergroup         18 2015-11-15 05:19 /user/ram/employee/part-m-00001
-rw-r--r--   1 ram supergroup         18 2015-11-15 05:19 /user/ram/employee/part-m-00002

ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m*
1001,Karthik,Chennai
1002,Saravanan,NY
1003,kavitha,NULL
ram@hadoop:~$


Insert 2 more records:

mysql> insert into employee values (1004, 'ramkumar','NJ');
Query OK, 1 row affected (0.07 sec)

mysql> insert into employee values (1005, 'gopal','NJ');
Query OK, 1 row affected (0.05 sec)


sqoop import --connect jdbc:mysql://localhost/sales \
    --username root --password ram \
    --table employee \
    --split-by empno \
    --direct \
    --check-column empno \
    --incremental append \
    --last-value 1003
15/11/15 05:27:21 INFO tool.ImportTool:  --incremental append
15/11/15 05:27:21 INFO tool.ImportTool:   --check-column empno
15/11/15 05:27:21 INFO tool.ImportTool:   --last-value 1005
15/11/15 05:27:21 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create')

ram@hadoop:~$ hadoop fs -ls /user/ram/employee
Found 5 items
-rw-r--r--   1 ram supergroup         21 2015-11-15 05:19 /user/ram/employee/part-m-00000
-rw-r--r--   1 ram supergroup         18 2015-11-15 05:19 /user/ram/employee/part-m-00001
-rw-r--r--   1 ram supergroup         18 2015-11-15 05:19 /user/ram/employee/part-m-00002
-rw-r--r--   1 ram supergroup         17 2015-11-15 05:27 /user/ram/employee/part-m-00003
-rw-r--r--   1 ram supergroup         14 2015-11-15 05:27 /user/ram/employee/part-m-00004

ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m*1001,Karthik,Chennai
1002,Saravanan,NY
1003,kavitha,NULL
1004,ramkumar,NJ
1005,gopal,NJ
ram@hadoop:~$


Create and execute Sqoop jobs:

sqoop job --create employee_bulk_import \
     -- import \
     --username root \
     --password ram \
     --connect jdbc:mysql://localhost/sales \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --split-by empno

sqoop job --list
sqoop job --show employee_bulk_import
sqoop job --exec employee_bulk_import
sqoop job --delete employee_bulk_import




Hive Import


- If you have a Hive metastore associated with your HDFS cluster, Sqoop can also import the data into Hive by generating and executing a CREATE TABLE statement to define the data’s layout in Hive. Importing data into Hive is as simple as adding the --hive-import option to your Sqoop command line.

if the Hive table already exists, you can specify the --hive-overwrite

Sqoop will generate a Hive script containing a CREATE TABLE operation defining your columns using Hive’s types, and a LOAD DATA INPATH statement to move the data files into Hive’s warehouse directory.

Even though Hive supports escaping characters, it does not handle escaping of new-line character. Also, it does not support the notion of enclosing characters that may include field delimiters in the enclosed string

t is therefore recommended that you choose unambiguous field and record-terminating delimiters without the help of escaping and enclosing characters when working with Hive; this is due to limitations of Hive’s input parsing abilities. If you do use --escaped-by, --enclosed-by, or --optionally-enclosed-by when importing data into Hive, Sqoop will print a warning message.


Hive will have problems using Sqoop-imported data if your database’s rows contain string fields that have Hive’s default row delimiters (\n and \r characters) or column delimiters (\01 characters) present in them. You can use the --hive-drop-import-delims option to drop those characters on import to give Hive-compatible text data. Alternatively, you can use the --hive-delims-replacement option to replace those characters with a user-defined string on import to give Hive-compatible text data

Sqoop will pass the field and record delimiters through to Hive. If you do not set any delimiters and do use --hive-import, the field delimiter will be set to ^A and the record delimiter will be set to \n to be consistent with Hive’s defaults.

You should append parameters --null-string and --null-non-string in case of import job or --input-null-string and --input-null-non-string in case of an export job if you wish to properly preserve NULL values. 

The table name used in Hive is, by default, the same as that of the source table. You can control the output table name with the --hive-table option.

Hive can put data into partitions for more efficient query performance. You can tell a Sqoop job to import data for Hive into a particular partition by specifying the --hive-partition-key and --hive-partition-value arguments.

You can import compressed tables into Hive using the --compress and --compression-codec options. One downside to compressing tables imported into Hive is that many codecs cannot be split for processing by parallel map tasks. The lzop codec, however, does support splitting. When importing tables with this codec, Sqoop will automatically index the files for splitting and configuring a new Hive table with the correct InputFormat.
 


Note: External tables/partitions are not supported
Note: Partitions are created under the default location
Note: Existing Hive table can’t be truncated before the import
The --delete-target-dir option is not respected.


sqoop import \
     --username root \
     --password ram \
     --connect jdbc:mysql://localhost/sales \
     --delete-target-dir \
     --table employee \
     --split-by empno \
     --direct \
     --hive-import \
     --hive-table=hive_employee


15/11/15 06:17:35 INFO mapreduce.ImportJobBase: Transferred 88 bytes in 35.049 seconds (2.5108 bytes/sec)
15/11/15 06:17:35 INFO mapreduce.ImportJobBase: Retrieved 5 records.
15/11/15 06:17:35 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1
15/11/15 06:17:35 INFO hive.HiveImport: Loading uploaded data into Hive
15/11/15 06:17:54 INFO hive.HiveImport: Time taken: 2.839 seconds
15/11/15 06:17:54 INFO hive.HiveImport: Loading data to table default.hive_employee
15/11/15 06:17:55 INFO hive.HiveImport: Table default.hive_employee stats: [numFiles=5, totalSize=88]
15/11/15 06:17:55 INFO hive.HiveImport: Hive import complete.
15/11/15 06:17:55 INFO hive.HiveImport: Export directory is contains the _SUCCESS file only, removing the directory.
ram@hadoop:~$

ram@hadoop:~$ hadoop fs -ls /user/hive/warehouse/hive_employeeFound 5 items
-rwxrwxr-x   1 ram supergroup         21 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00000
-rwxrwxr-x   1 ram supergroup         18 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00001
-rwxrwxr-x   1 ram supergroup         18 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00002
-rwxrwxr-x   1 ram supergroup         17 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00003
-rwxrwxr-x   1 ram supergroup         14 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00004

ram@hadoop:~$ hadoop fs -cat /user/hive/warehouse/hive_employee/part-m*1001 Karthik Chennai
1002 Saravanan NY
1003 kavitha NULL
1004 ramkumar NJ
1005 gopal NJ
ram@hadoop:~$

ram@hadoop:~$ hive -e "select * from hive_employee"OK
1001    Karthik    Chennai
1002    Saravanan    NY
1003    kavitha    NULL
1004    ramkumar    NJ
1005    gopal    NJ
Time taken: 2.727 seconds, Fetched: 5 row(s)

ram@hadoop:~$ hive -e "select * from hive_employee where empname = 'gopal'"
1005    gopal    NJ
Time taken: 3.182 seconds, Fetched: 1 row(s)
ram@hadoop:~$


What happens when you run the import again:
The --delete-target-dir option is not respected.

sqoop import \
     --username root \
     --password ram \
     --connect jdbc:mysql://localhost/sales \
     --delete-target-dir \
     --table employee \
     --split-by empno \
     --direct \
     --hive-import \
     --hive-table=hive_employee


ram@hadoop:~$ hadoop fs -ls /user/hive/warehouse/hive_employee
Found 10 items
-rwxrwxr-x   1 ram supergroup         21 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00000
-rwxrwxr-x   1 ram supergroup         21 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00000_copy_1
-rwxrwxr-x   1 ram supergroup         18 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00001
-rwxrwxr-x   1 ram supergroup         18 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00001_copy_1
-rwxrwxr-x   1 ram supergroup         18 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00002
-rwxrwxr-x   1 ram supergroup         18 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00002_copy_1
-rwxrwxr-x   1 ram supergroup         17 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00003
-rwxrwxr-x   1 ram supergroup         17 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00003_copy_1
-rwxrwxr-x   1 ram supergroup         14 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00004
-rwxrwxr-x   1 ram supergroup         14 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00004_copy_1
 

ram@hadoop:~$ hive -e "select * from hive_employee"
 OK
1001    Karthik    Chennai
1001    Karthik    Chennai
1002    Saravanan    NY
1002    Saravanan    NY
1003    kavitha    NULL
1003    kavitha    NULL
1004    ramkumar    NJ
1004    ramkumar    NJ
1005    gopal    NJ
1005    gopal    NJ
Time taken: 2.56 seconds, Fetched: 10 row(s)
ram@hadoop:~$



sqoop import \
     --username root \
     --password ram \
     --connect jdbc:mysql://localhost/sales \
     --delete-target-dir \
     --table employee \
     --split-by empno \
     --direct \
     --hive-import \
     --hive-overwrite




Note: Hive-Overwrite will delete all the files employee folder before loading the data

hive -e "drop table employee"

sqoop import \
     --username root \
     --password ram \
     --connect jdbc:mysql://localhost/sales \
     --table employee \
     --split-by empno \
     --direct \
     --hive-import \
     --hive-overwrite \
     --hive-drop-import-delims  \
     --hive-partition-key empno \
    --hive-partition-value 1002
ERROR: Direct import currently do not support dropping hive delimiters, please remove parameter --hive-drop-import-delims.


hive -e "drop table employee"

sqoop import \
     --username root \
     --password ram \
     --connect jdbc:mysql://localhost/sales \
     --table employee \
     --split-by empno \
     --direct \
     --hive-import \
     --hive-overwrite \
     --hive-partition-key "empcity" \
     --hive-partition-value "NJ"

<<this creates files under /user/ram/employee. need to check>>

ram@hadoop:~$ hadoop fs -ls /user/ram/employee
15/11/15 15:36:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 6 items
-rw-r--r--   1 ram supergroup          0 2015-11-15 15:34 /user/ram/employee/_SUCCESS
-rw-r--r--   1 ram supergroup         21 2015-11-15 15:34 /user/ram/employee/part-m-00000
-rw-r--r--   1 ram supergroup         18 2015-11-15 15:34 /user/ram/employee/part-m-00001
-rw-r--r--   1 ram supergroup         18 2015-11-15 15:34 /user/ram/employee/part-m-00002
-rw-r--r--   1 ram supergroup         17 2015-11-15 15:34 /user/ram/employee/part-m-00003
-rw-r--r--   1 ram supergroup         14 2015-11-15 15:34 /user/ram/employee/part-m-00004



Friday, November 13, 2015

Installing MySQL on Ubuntu 14.04


Simple Steps:

1. Install mySql by entering below command
sudo apt-get install mysql-server
 
Note: you have to feed the root password during the installation

2. We are done. MySQL is installed

3. Now lets login to MySQL and do some basic CURD operations

mysql -u root -p
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> 

4.  

mysql> create table employee (empno int, empname varchar(50), empcity varchar(50));
ERROR 1046 (3D000): No database selected
mysql> create database sales
    -> ;
Query OK, 1 row affected (0.00 sec)


mysql> use sales
Database changed
mysql> create table employee (empno int, empname varchar(50), empcity varchar(50));
Query OK, 0 rows affected (0.10 sec)

mysql> insert into employee values (1001, 'Karthik', 'Chennai');
Query OK, 1 row affected (0.06 sec)

mysql> insert into employee values (1002, 'Saravanan', 'NY');
Query OK, 1 row affected (0.06 sec)

mysql> select * from employee;
+-------+-----------+---------+
| empno | empname   | empcity |
+-------+-----------+---------+
|  1001 | Karthik   | Chennai |
|  1002 | Saravanan | NY      |
+-------+-----------+---------+
2 rows in set (0.00 sec)

mysql> exit
Aborted
ram@hadoop:~$ 

How to install Sqoop 1.4.6 on Hadoop 2.7.1 in Ubuntu 14.04


Steps:
1. Download the Sqoop .tar.gz file on below path
http://sqoop.apache.org/

2. extract the .tar.gz file, you will get sqoop setup folder

3. move the sqoop setup folder to /usr/local/sqoop path
sudo mv hadoop/sqoop-1.4.6.bin__hadoop-0.23 /usr/lib/sqoop

4. add below lines in .bashrc file
export SQOOP_HOME=/usr/lib/sqoop
export PATH=$PATH:$SQOOP_HOME/bin

5. run below command to update sqoop path
source .bashrc

6. check sqoop version
sqoop version

15/11/13 22:36:32 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
Sqoop 1.4.6git commit id c0c5a81723759fa575844a0a1eae8f510fa32c25
Compiled by root on Mon Apr 27 14:20:17 CST 2015



Tuesday, September 1, 2015

Hive Quick Notes from Programming Hive book


Source and further read:  programming hive by edward capriolo 
Hive Port: 3306

Chapter 2 - Getting Started:
Command line Interface:
Variables and properties:
$ hive --define foo=barhive> set foo;foo=bar;
hive> set hivevar:foo;hivevar:foo=bar;
note: hivevar is the namespace
hive> create table toss1(i int, ${hivevar:foo} string);hive> describe toss1;
i intbar2 string
hive> create table toss2(i2 int, ${foo} string);

hive> describe toss2;

i2 intbar2 string

hive --hiveconf hive.cli.print.current.db=truehive> set hive.cli.print.current.db=true;
hive (default)>

$ hive --hiveconf y=5

hive> set y;

y=5

hive> SELECT * FROM whatsit WHERE i = ${hiveconf:y};
$ YEAR=2012 hive -e "SELECT * FROM mytable WHERE year = ${env:YEAR}";

Note: The query processor will see the literal number 2012 in the WHERE clause.


$ hive -S -e "select * FROM mytable LIMIT 3" > /tmp/myquery

-S silent mode - ignore Ok and time taken
$ hive -S -e "set" | grep warehouse

hive.metastore.warehouse.dir=/user/hive/warehouse

hive.warehouse.subdir.inherit.perms=false
$ hive -f /path/to/file/withqueries.hql

$ hive

hive> source /path/to/file/withqueries.hql;
hive> SELECT xpath(\'<a><b id="foo">b1</b><b id="bar">b2</b></a>\',\'//@id\')> FROM src LIMIT 1;[foo","bar]
Also the source table must have at least one row of content in it:

$ echo "one row" > /tmp/myfile

$ hive -e "LOAD DATA LOCAL INPATH '/tmp/myfile' INTO TABLE src;

Hive automatically looks for a file named .hiverc in your HOME directory and runs the commands. 
The following shows an example of a typical $HOME/.hiverc file:

ADD JAR /path/to/custom_hive_extensions.jar;

set hive.cli.print.current.db=true;

set hive.exec.mode.local.auto=true;
Autocomplete is supported in hive.
Command History

Hive saves the last 100,00 lines into a file $HOME/.hivehistory.
Shell Execution

You don’t need to leave the hive CLI to run simple bash shell commands. Simplytype ! followed by the command and terminate the line with a semicolon (;)
hive> ! /bin/echo "what up dog";

"what up dog"

hive> ! pwd;

/home/me/hiveplay

Hadoop dfs Commands from Inside Hive

hive> dfs -ls / ;

Found 3 items

drwxr-xr-x - root supergroup 0 2011-08-17 16:27 /etl
drwxr-xr-x - edward supergroup 0 2012-01-18 15:51 /flag
drwxrwxr-x - hadoop supergroup 0 2010-02-03 17:50 /users

Comments in Hive Scripts


-- Copyright (c) 2012 Megacorp, LLC.-- This is the best Hive script evar!!

SELECT * FROM massive_table;

The CLI does not parse these comment lines. If you paste them into the CLI, you’ll get errors. They only work when used in scripts executed with hive -f script_name.
Query Column Headers

hive> set hive.cli.print.header=true;

hive> SELECT * FROM system_logs LIMIT 3;

Chapter 3: Data types and File Formats

Primitive types: Tinyint, smallint, int, bigint, Boolean, float, double, string, Timestamp, Binary
Collection data types: Array, Map (Key-Value ""), Struct

TIMESTAMPS are interpreted as UTC times. Unix epoch time starts on 01-Jan-1970

Built-in functions for conversion to and from timezones are provided by Hive, to_utc_timestamp and from_utc_timestamp, respectively.
Example:
CREATE TABLE IF NOT EXISTS mydb.employees (
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT>
COMMENT 'Keys are deductions names, values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
COMMENT 'Home address')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00', ...)
LOCATION '/user/hive/warehouse/mydb.db/employees';


The BINARY type is similar to the VARBINARY type found in many relational databases.
Hive will implicitly cast any integer to the larger of the two integer types, cast FLOAT to DOUBLE, and cast any integer value to DOUBLE.
Hive doesn’t have the concept of keys. However, you can index tables

the default file format is TEXTFILE 
Schema on Read
Chapter 4 (HiveQL: Data Definition)

The Hive concept of a database is essentially just a catalog or namespace of tables.
hive> CREATE DATABASE IF NOT EXISTS financials;

You can also use the keyword SCHEMA instead of DATABASE in all the database-related
commands.

hive> SHOW DATABASES;
default
financials

hive> SHOW DATABASES LIKE 'h.*';

Hive will create a directory for each database. Tables in that database will be stored in
subdirectories of the database directory. The exception is tables in the default database,
which doesn’t have its own directory.

hive> CREATE DATABASE financials
> LOCATION '/my/preferred/directory';

hive> DESCRIBE DATABASE financials;

financials Holds all financial tables

hdfs://master-server/user/hive/warehouse/financials.db
hive> USE financials;

hive> DROP DATABASE IF EXISTS financials;

hive> DROP DATABASE IF EXISTS financials CASCADE;

CASCADE keyword to the command, which will cause the Hive to drop the tables in the database first:
When a database is dropped, its directory is also deleted.
Alter Database
No other metadata about the database can be changed, including its name and directory location
Creating Tables:

CREATE TABLE IF NOT EXISTS mydb.employees (
name STRING COMMENT 'Employee name',
salary FLOAT COMMENT 'Employee salary',
subordinates ARRAY<STRING> COMMENT 'Names of subordinates',
deductions MAP<STRING, FLOAT>
COMMENT 'Keys are deductions names, values are percentages',
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
COMMENT 'Home address')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00', ...)
LOCATION '/user/hive/warehouse/mydb.db/employees';

CREATE TABLE IF NOT EXISTS mydb.employees2 LIKE mydb.employees

hive> SHOW TABLES;

hive> SHOW TABLES IN mydb;

hive> DESCRIBE EXTENDED mydb.employees;
Replacing EXTENDED with FORMATTED provides more readable but also more verbose output.

Managed Tables:

The tables we have created so far are called managed tables or sometimes called internal
tables, because Hive controls the lifecycle of their data (more or less). As we’ve seen,
Hive stores the data for these tables in a subdirectory under the directory defined by
hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default.

When we drop a managed table (see “Dropping Tables” on page 66), Hive deletes
the data in the table. However, managed tables are less convenient for sharing with other tools. For example, suppose we have data that is created and used primarily by Pig or other tools, but we
want to run some queries against it, but not give Hive ownership of the data. We can
define an external table that points to that data, but doesn’t take ownership of it.
External Tables

Let’s assume the data files are in the distributed filesystem directory /data/stocks. The following table declaration creates an external table that can read all the data files for this comma-delimited data in /data/stocks:
CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';

Because it’s external, Hive does not assume it owns the data. Therefore, dropping the table does not delete the data, although the metadata for the table will be deleted.
Even for managed tables, you know where they are located, so you can use other tools, hadoop dfs commands, etc., to modify and even delete the files in the directories for managed tables. Hive may technically own these directories and files, but it doesn’t have full control over them! Recall, in “Schema
on Read” on page 48, we said that Hive really has no control over the integrity of the
files used for storage and whether or not their contents are consistent with the table
schema. Even managed tables don’t give us this control.
Still, a general principle of good software design is to express intent. If the data is shared
between tools, then creating an external table makes this ownership explicit.

As for managed tables, you can also copy the schema (but not the data) of an existing
table:
CREATE EXTERNAL TABLE IF NOT EXISTS mydb.employees3
LIKE mydb.employees
LOCATION '/path/to/data';
If you omit the EXTERNAL keyword and the original table is external, the new table will also be external. If you omit EXTERNAL and the original table is managed, the new table will also be managed. However, if you
include the EXTERNAL keyword and the original table is managed, the new table will be external. Even in this scenario, the LOCATION clause will still be optional.

Partitioned, Managed Tables
CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
PARTITIONED BY (country STRING, state STRING);

Hive will now create subdirectories reflecting the partitioning structure. For
example:
...
.../employees/country=CA/state=AB
.../employees/country=CA/state=BC
...
.../employees/country=US/state=AL
.../employees/country=US/state=AK

users of the table don’t need to care if these “columns” are partitions or not, except when they want to optimize query performance.
the following query selects all employees in the state of Illinois in the United States:

SELECT * FROM employees
WHERE country = 'US' AND state = 'IL';
Note that because the country and state values are encoded in directory names, there is no reason to have this data in the data files themselves

When we add predicates to WHERE clauses that filter on partition values, these predicates are called partition filters.
However, a query across all partitions could trigger an enormous MapReduce job if the table data and number of partitions are large.

A highly suggested safety measure is putting Hive into “strict” mode, which prohibits queries of partitioned tables without a WHERE clause that filters on partitions. You can set the mode to “nonstrict,” as in the
following session:
hive> set hive.mapred.mode=strict;

hive> SELECT e.name, e.salary FROM employees e LIMIT 100;
FAILED: Error in semantic analysis: No partition predicate found for
Alias "e" Table "employees"
hive> set hive.mapred.mode=nonstrict;
hive> SELECT e.name, e.salary FROM employees e LIMIT 100;

You can see the partitions that exist with the SHOW PARTITIONS command:
hive> SHOW PARTITIONS employees;
...
Country=CA/state=AB
country=CA/state=BC
...
country=US/state=AL
country=US/state=AK

hive> SHOW PARTITIONS employees PARTITION(country='US');
country=US/state=AL
country=US/state=AK
...
hive> SHOW PARTITIONS employees PARTITION(country='US', state='AK');
country=US/state=AK

The following example creates a US and CA (California) partition while loading data into it from a local
directory, $HOME/california-employees. You must specify a value for each partition column. Notice how we reference the HOME environment variable in HiveQL:
LOAD DATA LOCAL INPATH '${env:HOME}/california-employees'
INTO TABLE employees
PARTITION (country = 'US', state = 'CA');


External Partitioned Tables

CREATE EXTERNAL TABLE IF NOT EXISTS log_messages (
hms INT,
severity STRING,
server STRING,
process_id INT,
message STRING)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
Recall that when we created the nonpartitioned external stocks table, a LOCATION … clause was required. It isn’t used for external partitioned tables. Instead, an ALTER TABLE statement is used to add each partition separately. It must specify a value for each partition key, the year, month, and day,

ALTER TABLE log_messages ADD PARTITION(year = 2012, month = 1, day = 2)
LOCATION 'hdfs://master_server/data/log_messages/2012/01/02';

For example, each day we might use the following procedure to move data older than a month to S3:
• Copy the data for the partition being moved to S3. For example, you can use the hadoop distcp command:
hadoop distcp /data/log_messages/2011/12/02 s3n://ourbucket/logs/2011/12/02

• Alter the table to point the partition to the S3 location:
ALTER TABLE log_messages PARTITION(year = 2011, month = 12, day = 2)
SET LOCATION 's3n://ourbucket/logs/2011/01/02';

• Remove the HDFS copy of the partition using the hadoop fs -rmr command:
hadoop fs -rmr /data/log_messages/2011/01/02

As for managed partitioned tables, you can see an external table’s partitions with SHOW
PARTITIONS:
hive> SHOW PARTITIONS log_messages;
...
year=2011/month=12/day=31
year=2012/month=1/day=1
year=2012/month=1/day=2
ALTER TABLE … ADD PARTITION is not limited to external tables. You can use it with
managed tables, too, when you have (or will have) data for partitions in directories
created outside of the LOAD and INSERT options we discussed above. You’ll need to
remember that not all of the table’s data will be under the usual Hive “warehouse”
directory, and this data won’t be deleted when you drop the managed table! Hence,
from a “sanity” perspective, it’s questionable whether you should dare to use this feature
with managed tables.

**Customizing Table Storage Formats

CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
You can replace TEXTFILE with one of the other built-in file formats supported by Hive,
including SEQUENCEFILE and RCFILE, both of which optimize disk space usage and I/O
bandwidth performance using binary encoding and optional compression

CLUSTERED BY SORY
CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
CLUSTERED BY (exchange, symbol)
SORTED BY (ymd ASC)
INTO 96 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks'

The CLUSTERED BY … INTO … BUCKETS clause, with an optional SORTED BY … clause is used to optimize certain kinds of queries


**Dropping Tables

DROP TABLE IF EXISTS employees;

Actually, if you enable the Hadoop Trash feature, which is not on by
default, the data is moved to the .Trash directory in the distributed
filesystem for the user, which in HDFS is /user/$USER/.Trash. To enable
this feature, set the property fs.trash.interval to a reasonable positive
number. It’s the number of minutes between “trash checkpoints”; 1,440
would be 24 hours. While it’s not guaranteed to work for all versions of
all distributed filesystems, if you accidentally drop a managed table with
important data, you may be able to re-create the table, re-create any
partitions, and then move the files from .Trash to the correct directories
(using the filesystem commands) to restore the data.


For external tables, the metadata is deleted but the data is not.

**Alter Table
ALTER TABLE modifies table metadata only. The data for the table is
untouched

**Renaming a Table
ALTER TABLE log_messages RENAME TO logmsgs;


**Adding, Modifying, and Dropping a Table Partition

ALTER TABLE log_messages ADD IF NOT EXISTS
PARTITION (year = 2011, month = 1, day = 1) LOCATION '/logs/2011/01/01'
PARTITION (year = 2011, month = 1, day = 2) LOCATION '/logs/2011/01/02'
PARTITION (year = 2011, month = 1, day = 3) LOCATION '/logs/2011/01/03'

Similarly, you can change a partition location, effectively moving it:

ALTER TABLE log_messages PARTITION(year = 2011, month = 12, day = 2)
SET LOCATION 's3n://ourbucket/logs/2011/01/02';

This command does not move the data from the old location, nor does it delete the old
data.


Finally, you can drop a partition:
ALTER TABLE log_messages DROP IF EXISTS PARTITION(year = 2011, month = 12, day = 2);
The IF EXISTS clause is optional, as usual. For managed tables, the data for the partition
is deleted, along with the metadata, even if the partition was created using ALTER TABLE
… ADD PARTITION.
For external tables, the data is not deleted.

You can rename a column, change its position, type, or comment:

ALTER TABLE log_messages
CHANGE COLUMN hms hours_minutes_seconds INT
COMMENT 'The hours, minutes, and seconds part of the timestamp'
AFTER severity;

You have to specify the old name, a new name, and the type, even if the name or type
is not changing.

As always, this command changes metadata only. If you are moving columns, the data
must already match the new schema or you must change it to match by some other
means.

**Adding Columns
You can add new columns to the end of the existing columns, before any partition
columns.

ALTER TABLE log_messages ADD COLUMNS (
app_name STRING COMMENT 'Application name',
session_id LONG COMMENT 'The current session id');

**Deleting or Replacing Columns

The following example removes all the existing columns and replaces them with the
new columns specified:

ALTER TABLE log_messages REPLACE COLUMNS (
hours_mins_secs INT COMMENT 'hour, minute, seconds from timestamp',
severity STRING COMMENT 'The message severity'
message STRING COMMENT 'The rest of the message');

This statement effectively renames the original hms column and removes the server and
process_id columns from the original schema definition. As for all ALTER statements,
only the table metadata is changed.



**Alter Storage Properties
The following statement changes the storage format for a partition to be SEQUENCE
FILE,

ALTER TABLE log_messages
PARTITION(year = 2012, month = 1, day = 1)
SET FILEFORMAT SEQUENCEFILE;

The PARTITION clause is required if the table is partitioned.
You can specify a new SerDe along with SerDe properties or change the properties for
the existing SerDe

The ALTER TABLE … ARCHIVE PARTITION statement captures the partition files into a Hadoop
archive (HAR) file.
This only reduces the number of files in the filesystem, reducing
the load on the NameNode, but doesn’t provide any space savings (e.g., through
compression):

ALTER TABLE log_messages ARCHIVE
PARTITION(year = 2012, month = 1, day = 1);


To reverse the operation, substitute UNARCHIVE for ARCHIVE.
This feature is only available
for individual partitions of partitioned tables.


The following statements prevent the partition
from being dropped and queried:

ALTER TABLE log_messages
PARTITION(year = 2012, month = 1, day = 1) ENABLE NO_DROP;

ALTER TABLE log_messages
PARTITION(year = 2012, month = 1, day = 1) ENABLE OFFLINE;

To reverse either operation, replace ENABLE with DISABLE. These operations also can’t
be used with nonpartitioned tables.


****HiveQL: Data Manipulation

**Loading Data into Managed Tables

LOAD DATA LOCAL INPATH '${env:HOME}/california-employees'
OVERWRITE INTO TABLE employees
PARTITION (country = 'US', state = 'CA');

This command will first create the directory for the partition, if it doesn’t already exist,
then copy the data to it.

It is conventional practice to specify a path that is a directory, rather than an individual
file. Hive will copy all the files in the directory,

If the LOCAL keyword is used, the path is assumed to be in the local filesystem. The data
is copied into the final location. If LOCAL is omitted, the path is assumed to be in the
distributed filesystem. In this case, the data is moved from the path to the final location.

For distributed or pseudo-distributed mode, the path
is interpreted relative to the user’s home directory in the distributed filesystem, which
is /user/$USER by default in HDFS and MapRFS.

The PARTITION clause is required if the table is partitioned and you must specify a value
for each partition key.

the data will now exist in the following directory:
hdfs://master_server/user/hive/warehouse/mydb.db/employees/country=US/state=CA

Hive does not verify that the data you are loading matches the schema for the table.
However, it will verify that the file format matches the table definition. For example,
if the table was created with SEQUENCEFILE storage, the loaded files must be sequence
files.

**Inserting Data into Tables from Queries

INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'OR')
SELECT * FROM staged_employees se
WHERE se.cnty = 'US' AND se.st = 'OR';

This example suggests one common scenario where this feature is useful: data has been
staged in a directory, exposed to Hive as an external table, and now you want to put it
into the final, partitioned table. A workflow like this is also useful if you want the target
table to have a different record format than the source table.

Hive offers an
alternative INSERT syntax that allows you to scan the input data once and split it multiple
ways. The following example shows this feature for creating the employees partitions
for three states:


FROM staged_employees se
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'OR')
SELECT * WHERE se.cnty = 'US' AND se.st = 'OR'
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'CA')
SELECT * WHERE se.cnty = 'US' AND se.st = 'CA'
INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state = 'IL')
SELECT * WHERE se.cnty = 'US' AND se.st = 'IL';

If a record satisfied a given SELECT … WHERE … clause, it gets written to the specified table
and partition. To be clear, each INSERT clause can insert into a different table, when
desired, and some of those tables could be partitioned while others aren’t.

**Dynamic Partition Inserts

INSERT OVERWRITE TABLE employees
PARTITION (country, state)
SELECT ..., se.cnty, se.st
FROM staged_employees se;

Hive determines the values of the partition keys, country and state, from the last two
columns in the SELECT clause

Suppose that staged_employees has data for a total of 100 country and state pairs. After
running this query, employees will have 100 partitions!

You can also mix dynamic and static partitions. This variation of the previous query
specifies a static value for the country (US) and a dynamic value for the state:

INSERT OVERWRITE TABLE employees
PARTITION (country = 'US', state)
SELECT ..., se.cnty, se.st
FROM staged_employees se
WHERE se.cnty = 'US';

Dynamic partitioning is not enabled by default. When it is enabled, it works in “strict”
mode by default, where it expects at least some columns to be static. This helps protect
against a badly designed query that generates a gigantic number of partitions. For example,
you partition by timestamp and generate a separate partition for each second!
Perhaps you meant to partition by day or maybe hour instead. Several other properties
are also used to limit excess resource utilization.

So, for example, our first example using dynamic partitioning for all partitions might
actually look this, where we set the desired properties just before use:

hive> set hive.exec.dynamic.partition=true;

hive> set hive.exec.dynamic.partition.mode=nonstrict;

hive> set hive.exec.max.dynamic.partitions.pernode=1000;

hive> INSERT OVERWRITE TABLE employees
> PARTITION (country, state)
> SELECT ..., se.cty, se.st
> FROM staged_employees se;


**Creating Tables and Loading Them in One Query

CREATE TABLE ca_employees
AS SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';

This feature can’t be used with external tables

Recall that “populating” a partition for
an external table is done with an ALTER TABLE statement, where we aren’t “loading”
data, per se, but pointing metadata to a location where the data can be found.

**Exporting Data

hadoop fs -cp source_path target_path

Otherwise, you can use INSERT … DIRECTORY …, as in this example:

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/ca_employees'
SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';

Just like inserting data to tables, you can specify multiple inserts to directories:

FROM staged_employees se
INSERT OVERWRITE DIRECTORY '/tmp/or_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'OR'
INSERT OVERWRITE DIRECTORY '/tmp/ca_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'CA'
INSERT OVERWRITE DIRECTORY '/tmp/il_employees'
SELECT * WHERE se.cty = 'US' and se.st = 'IL';


**** HiveQL: Queries
SELECT name, salary FROM employees;

hive> SELECT name, subordinates FROM employees;
John Doe ["Mary Smith","Todd Jones"]
Mary Smith ["Bill King"]
Todd Jones []
Bill King []

The deductions is a MAP, where the JSON representation for maps is used, namely a
comma-separated list of key:value pairs, surrounded with {...}:

hive> SELECT name, deductions FROM employees;
John Doe {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
Mary Smith {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
Todd Jones {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
Bill King {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}

Finally, the address is a STRUCT, which is also written using the JSON map format:

hive> SELECT name, address FROM employees;
John Doe {"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600}
Mary Smith {"street":"100 Ontario St.","city":"Chicago","state":"IL","zip":60601}
Todd Jones {"street":"200 Chicago Ave.","city":"Oak Park","state":"IL","zip":60700}
Bill King {"street":"300 Obscure Dr.","city":"Obscuria","state":"IL","zip":60100}

hive> SELECT name, subordinates[0] FROM employees;

hive> SELECT name, deductions["State Taxes"] FROM employees;

hive> SELECT name, address.city FROM employees;

** Specify Columns with Regular Expressions

The following query
selects the symbol column and all columns from stocks whose names start with the
prefix price:1

hive> SELECT symbol, `price.*` FROM stocks;

** Computing with Column Values

SELECT upper(name), salary, deductions["Federal Taxes"],
> round(salary * (1 - deductions["Federal Taxes"])) FROM employees;

** MATH Functions

round(d)
round(d, N)
floor(d)
ceil(d), ceiling(DOUBLE d)
rand(), rand(seed)
exp(d)
ln(d)
log10(d)
log2(d)
log(base, d)
pow(d, p), power(d, p)
sqrt(d)
bin(i)
hex(i)
hex(str)
unhex(i)
conv(i, from_base, to_base)
conv(str, from_base,
to_base)
abs(d)
pmod(i1, i2)
pmod(d1, d2)
sin(d)
asin(d)
cos(d)
acos(d)
tan(d)
atan(d)
degrees(d)
radians(d)
positive(i)
positive(d)
negative(i)
negative(d)
sign(d)
e()
pi()


AGGREGATE Functions

count(*)
count(expr)
count(DISTINCT expr[, expr_.])
sum(col)
sum(DISTINCT col)
avg(col)
avg(DISTINCT col)
min(col)
max(col)
variance(col), var_pop(col)
var_samp(col)
stddev_pop(col)
stddev_samp(col)
covar_pop(col1, col2)
covar_samp(col1, col2)



You can usually improve the performance of aggregation by setting the following property
to true, hive.map.aggr, as shown here:

hive> SET hive.map.aggr=true;
hive> SELECT count(*), avg(salary) FROM employees;

This setting will attempt to do “top-level” aggregation in the map phase, as in this
example. (An aggregation that isn’t top-level would be aggregation after performing a
GROUP BY.) However, this setting will require more memory.

** Other built-in functions

test in(val1, val2, …)
length(s)
reverse(s)
concat(s1, s2, …)
concat_ws(separator, s1, s2,
…)
substr(s, start_index)
substr(s, int start, int
length)
upper(s)
ucase(s)
lower(s)
lcase(s)
trim(s)
ltrim(s)
rtrim(s)
regexp_replace(s, regex,
replacement)
regexp_extract(subject,
regex_pattern, index)
parse_url(url, partname, key)
size(map<K.V>)
size(array<T>)
cast(<expr> as <type>)
from_unixtime(int unixtime)
to_date(timestamp)
year(timestamp)
month(timestamp)
day(timestamp)
get_json_object(json_string,path)
space(n)
repeat(s, n)
ascii(s)
lpad(s, len, pad)
rpad(s, len, pad)
split(s, pattern)
find_in_set(s, commaSeparated
String)
locate(substr, str, pos])
instr(str, substr)
str_to_map(s, delim1, delim2)
sentences(s, lang, locale)
ngrams(array<array<string>>,
N, K, pf)
con
text_ngrams(array<array<string>>,array<string>,int K, intpf)
in_file(s, filename)

**Other Query Samples

hive> SELECT upper(name), salary, deductions["Federal Taxes"],
> round(salary * (1 - deductions["Federal Taxes"])) FROM employees
> LIMIT 2;


COlumn Alias:

SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes,
> round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes
> FROM employees LIMIT 2;

Nested SELECT Statements:

hive> FROM (
> SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes,
> round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes
> FROM employees
> ) e
> SELECT e.name, e.salary_minus_fed_taxes
> WHERE e.salary_minus_fed_taxes > 70000;




CASE … WHEN … THEN Statements:
hive> SELECT name, salary,
> CASE
> WHEN salary < 50000.0 THEN 'low'
> WHEN salary >= 50000.0 AND salary < 70000.0 THEN 'middle'
> WHEN salary >= 70000.0 AND salary < 100000.0 THEN 'high'
> ELSE 'very high'
> END AS bracket FROM employees;


**When Hive Can Avoid MapReduce:

Hive implements some kinds of queries
without using MapReduce, in so-called local mode, for example:

SELECT * FROM employees;

In this case, Hive can simply read the records from employees and dump the formatted
output to the console.


This even works for WHERE clauses that only filter on partition keys, with or without
LIMIT clauses

SELECT * FROM employees
WHERE country = 'US' AND state = 'CA'
LIMIT 100;

Trust us, you want to add set hive.exec.mode.local.auto=true;
to your $HOME/.hiverc file

** WHERE Clauses

SELECT * FROM employees
WHERE country = 'US' AND state = 'CA';

hive> SELECT name, salary, deductions["Federal Taxes"],
> salary * (1 - deductions["Federal Taxes"])
> FROM employees
> WHERE round(salary * (1 - deductions["Federal Taxes"])) > 70000;


hive> SELECT e.* FROM
> (SELECT name, salary, deductions["Federal Taxes"] as ded,
> salary * (1 - deductions["Federal Taxes"]) as salary_minus_fed_taxes
> FROM employees) e
> WHERE round(e.salary_minus_fed_taxes) > 70000;


**Gotchas with Floating-Point Comparisons

hive> SELECT name, salary, deductions['Federal Taxes']
> FROM employees WHERE deductions['Federal Taxes'] > 0.2;

John Doe 100000.0 0.2
Mary Smith 80000.0 0.2

Actually, it doesn’t work. Here’s why. The number 0.2 can’t be represented exactly in
a FLOAT or DOUBLE

To simplify things a bit, let’s say that 0.2 is actually 0.2000001 for FLOAT and
0.200000000001 for DOUBLE, because an 8-byte DOUBLE has more significant digits (after
the decimal point). When the FLOAT value from the table is converted to DOUBLE by Hive,
it produces the DOUBLE value 0.200000100000, which is greater than 0.200000000001.
That’s why the query results appear to use >= not >

Workaround:
hive> SELECT name, salary, deductions['Federal Taxes'] FROM employees
> WHERE deductions['Federal Taxes'] > cast(0.2 AS FLOAT);

**LIKE and RLIKE

hive> SELECT name, address.street FROM employees WHERE address.street LIKE '%Chi%';

A Hive extension is the RLIKE clause, which lets us use Java regular expressions, a more
powerful minilanguage for specifying matches

hive> SELECT name, address.street
> FROM employees WHERE address.street RLIKE '.*(Chicago|Ontario).*';


** GROUP BY..HAVING Clauses

hive> SELECT year(ymd), avg(price_close) FROM stocks
> WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
> GROUP BY year(ymd)
> HAVING avg(price_close) > 50.0;


**JOIN Statements:

INNER JOIN:

hive> SELECT a.ymd, a.price_close, b.price_close
> FROM stocks a JOIN stocks b ON a.ymd = b.ymd
> WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM';

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL';

**Join Optimizations:

hive> SELECT a.ymd, a.price_close, b.price_close , c.price_close
> FROM stocks a JOIN stocks b ON a.ymd = b.ymd
> JOIN stocks c ON a.ymd = c.ymd
> WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM' AND c.symbol = 'GE';

In above example, every ON clause uses a.ymd as one of the join keys. In this case,
Hive can apply an optimization where it joins all three tables in a single MapReduce
job.

When joining three or more tables, if every ON clause uses the same join
key, a single MapReduce job will be used

Hive also assumes that the last table in the query is the largest. It attempts to buffer the
other tables and then stream the last table through, while performing joins on individual
records. Therefore, you should structure your join queries so the largest table is last.

**LEFT OUTER JOIN

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol
> WHERE s.symbol = 'AAPL';


**RIGHT OUTER JOIN Gotcha

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM dividends d RIGHT OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = s.symbol
> WHERE s.symbol = 'AAPL';

** FULL Outer join

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend
> FROM dividends d FULL OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = s.symbol
> WHERE s.symbol = 'AAPL';


** LEFT SEMI-JOIN:

A left semi-join returns records from the lefthand table if records are found in the righthand
table that satisfy the ON predicates.

SELECT s.ymd, s.symbol, s.price_close FROM stocks s
WHERE s.ymd, s.symbol IN
(SELECT d.ymd, d.symbol FROM dividends d);


**Cartesian Product JOINs
SELECTS * FROM stocks JOIN dividends;