Saturday, November 14, 2015

Sqoop Commands Complete Reference

SQOOP Commands - Complete reference


sqoop version


sqoop help

sqoop list-databases --connect jdbc:mysql://localhost/ --username root --password ram

 

/first time error/
Warning: /usr/lib/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
16/01/15 22:57:28 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
16/01/15 22:57:28 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/01/15 22:57:28 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
16/01/15 22:57:28 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
 


Error
Warning: /usr/lib/sqoop/../hcatalog does not exist! HCatalog jobs will fail.

Fix1:
gedit .bashrc
export HCAT_HOME=${HIVE_HOME}/hcatalog/
export PATH=$HCAT_HOME/bin:$PATH

source .bashrc

Error 2:
16/01/15 22:57:28 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver

note:
1. download connector for MySQL here http://dev.mysql.com/downloads/file/?id=459312
2. Extract the .gz and copy the .jar file in /usr/lib/sqoop/lib folder

sqoop list-tables --connect jdbc:mysql://localhost/sales --username root --password ram
Mysql: setting up environment

mysql -u root -p
create database sales
use sales
create table employee (empno int, empname varchar(50), empcity varchar(50));
insert into employee values (1001, 'Karthik', 'Chennai');
insert into employee values (1002, 'Saravanan', 'NY');
select * from employee;


sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee

 
ERROR tool.ImportTool: Error during import: No primary key could be found for table employee.
Please specify one with --split-by or perform a sequential import with '-m 1'.

sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee -m1
hadoop fs -ls /user/ram/employee
Found 2 items
-rw-r--r--   1 ram supergroup          0 2015-11-14 08:58 /user/ram/employee/_SUCCESS
-rw-r--r--   1 ram supergroup         39 2015-11-14 08:58 /user/ram/employee/part-m-00000

ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m-00000
1001,Karthik,Chennai
1002,Saravanan,NY

When you run the same import again:
sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee -m1
ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory hdfs://localhost:9000/user/ram/employee already exists

Appending the records in hdfs:

 
sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee -m1 --append
or
sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee --autoreset-to-one-mapper --append

ram@hadoop:~$ hadoop fs -ls /user/ram/employee/
15/11/14 09:13:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r--   1 ram supergroup          0 2015-11-14 08:58 /user/ram/employee/_SUCCESS
-rw-r--r--   1 ram supergroup         39 2015-11-14 08:58 /user/ram/employee/part-m-00000
-rw-r--r--   1 ram supergroup         39 2015-11-14 09:13 /user/ram/employee/part-m-00001


sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee --split-by empno --append
15/11/14 09:17:45 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=267100
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=221
        HDFS: Number of bytes written=39
        HDFS: Number of read operations=8
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=4
    Job Counters
        Launched map tasks=2
        Other local map tasks=2
        Total time spent by all maps in occupied slots (ms)=12800
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=12800
        Total vcore-seconds taken by all map tasks=12800
        Total megabyte-seconds taken by all map tasks=13107200
    Map-Reduce Framework
        Map input records=2
        Map output records=2
        Input split bytes=221
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=142
        CPU time spent (ms)=2880
        Physical memory (bytes) snapshot=363220992
        Virtual memory (bytes) snapshot=1681674240
        Total committed heap usage (bytes)=191889408
    File Input Format Counters
        Bytes Read=0
    File Output Format Counters
        Bytes Written=39


ram@hadoop:~$ hadoop fs -ls /user/ram/employee/
15/11/14 09:19:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 5 items
-rw-r--r--   1 ram supergroup          0 2015-11-14 08:58 /user/ram/employee/_SUCCESS
-rw-r--r--   1 ram supergroup         39 2015-11-14 08:58 /user/ram/employee/part-m-00000
-rw-r--r--   1 ram supergroup         39 2015-11-14 09:13 /user/ram/employee/part-m-00001
-rw-r--r--   1 ram supergroup         21 2015-11-14 09:17 /user/ram/employee/part-m-00002
-rw-r--r--   1 ram supergroup         18 2015-11-14 09:17 /user/ram/employee/part-m-00003
ram@hadoop:~$

Note: two files are imported as two mappers executed (002 and 003)

sqoop --options-file sqoopconf.conf

sqoopconf.conf
list-tables
--connect
jdbc:mysql://localhost/sales
--username
root
--password-file
/usr/local/hadoop/password.txt

Error while loading password file: The provided password file .password does not exist!ram@hadoop:~$ cat .password
ram
ram@hadoop:~$

note: You should save the password in a file on the users home directory with 400 permissions and
specify the path to that file using the --password-file argument

<<Need to check>>


sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram --table employee \
     --split-by empno --append \
     --target-dir /user/ram/employee_result \
     --query "select empno, empname from employee" \
    --where "where empno=1001"

Error: Cannot specify --query and --table together.


sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --split-by empno --append \
     --target-dir /user/ram/employee_result \
     --query "select empno, empname from employee" \
    --where "where empno=1001"

Error:
ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException:
Query [select empno, empname from employee] must contain '$CONDITIONS' in WHERE clause.

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --target-dir /user/ram/employee_result \
     --table employee \
     --split-by empno \
     --append \
      --where " empno=1001 and empname = 'karthik' "

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --target-dir /user/ram/employee_result \
     --table employee \
     --columns "empno, empname" \
     --split-by empno \
     --append \
      --where " empno=1001 and empname = 'karthik' "

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --target-dir /user/ram/employee_result2 \
     --query 'select empno, empname from employee where empname =  and \$CONDITIONS  \
     --num-mappers 1 \
     --append

Free-form Query Imports

When importing a free-form query, you must specify a destination directory with --target-dir.

If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --target-dir /user/ram/employee_result2 \
     --query "select empno, empname from employee where empno < 100 and \$CONDITIONS " \
     --num-mappers 1 \
     --append


Note:

If you are issuing the query wrapped with double quotes ("), you will have to use \$CONDITIONS instead of just $CONDITIONS to disallow your shell from treating it as a shell variable. For example, a double quoted query may look like: "SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"

The facility of using free-form query in the current version of Sqoop is limited to simple queries where there are no ambiguous projections and no OR conditions in the WHERE clause. Use of complex queries such as queries that have sub-queries or joins leading to ambiguous projections can lead to unexpected results.

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --target-dir /user/ram/employee_result2 \
     --query 'select empno, empname from employee where $CONDITIONS ' \
     --append \
     --num-mappers 1 



sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --num-mappers 1 \
     --fetch-size 1000 \
     --fields-terminated-by '|' \
     --lines-terminated-by '\n'

ram@hadoop:~$ hadoop fs -ls /user/ram/employee
Found 2 items
-rw-r--r--   1 ram supergroup          0 2015-11-14 22:05 /user/ram/employee/_SUCCESS
-rw-r--r--   1 ram supergroup         39 2015-11-14 22:05 /user/ram/employee/part-m-00000
ram@hadoop:~$


ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m-00000
1001|Karthik|Chennai
1002|Saravanan|NY
ram@hadoop:~$

Boundary Query and Null Handling:

Boundary Query:

By default sqoop will use query select min(<split-by>), max(<split-by>) from <table name> to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using --boundary-query argument.

Null handling:
Sqoop will by default import NULL values as string null. Hive is however using string \N to denote NULL values and therefore predicates dealing with NULL (like IS NULL) will not work correctly. You should append parameters --null-string and --null-non-string in case of import job or --input-null-string and --input-null-non-string in case of an export job if you wish to properly preserve NULL values.


mysql> insert into employee (empno, empname) values (1003,'kavitha');
Query OK, 1 row affected (0.07 sec)

mysql> select * from employee;
+-------+-----------+---------+
| empno | empname   | empcity |
+-------+-----------+---------+
|  1001 | Karthik   | Chennai |
|  1002 | Saravanan | NY      |
|  1003 | kavitha   | NULL    |
+-------+-----------+---------+
3 rows in set (0.00 sec)


sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --boundary-query "select min(empno), max(empno) from employee" \
     --split-by empno \
     --fetch-size 1000 \
     --fields-terminated-by '|' \
     --lines-terminated-by '\n' \
     --null-string '\\N' \
     --null-non-string '\\N'
--null-string '\\N': The string to be interpreted as null for string columns.
--null-non-string '\\N' :  The string to be interpreted as null for non string columns.

note:
select min(empno), max(empno) from employee -- will work
select min (empno), max (empno) from employee -- will throw function not found error


ram@hadoop:~$ hadoop fs -ls /user/ram/employee/
Found 4 items
-rw-r--r--   1 ram supergroup          0 2015-11-14 22:29 /user/ram/employee/_SUCCESS
-rw-r--r--   1 ram supergroup         21 2015-11-14 22:29 /user/ram/employee/part-m-00000
-rw-r--r--   1 ram supergroup         18 2015-11-14 22:29 /user/ram/employee/part-m-00001
-rw-r--r--   1 ram supergroup         16 2015-11-14 22:29 /user/ram/employee/part-m-00002

ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m-00000
1001|Karthik|Chennai
ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m-00002
1003|kavitha|\N

Relaxed Isolation:

By default Sqoop uses Read Comitted isolation level
--relaxed isolation to use read uncommitted (not supported by all rdbms)

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --boundary-query "select min(empno), max(empno) from employee" \
     --split-by empno \
     --fetch-size 1000 \
     --fields-terminated-by '|' \
     --lines-terminated-by '\n' \
     --null-string '\\N' \
     --null-non-string '\\N' \
     --relaxed-isolation
ram@hadoop:~$ hadoop dfs -cat /user/ram/employee/part-m-*1001|Karthik|Chennai
1002|Saravanan|NY
1003|kavitha|\N
ram@hadoop:~$ 




 --as-avrodatafile:

Avro data files are a compact, efficient binary format that provides interoperability with applications written in other programming languages. Avro also supports versioning, so that when, e.g., columns are added or removed from a table, previously imported data files can be processed along with new ones.

By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the -z or --compress argument, or specify any Hadoop compression codec using the --compression-codec argument. This applies to SequenceFile, text, and Avro files.


sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --split-by empno \
     --as-avrodatafile

ram@hadoop:~$ hadoop dfs -cat /user/ram/employee/part-*
Obj avro.schema� {"type":"record","name":"employee","doc":"Sqoop import of employee","fields":[{"name":"empno","type":["null","int"],"default":null,"columnName":"empno","sqlType":"4"},{"name":"empname","type":["null","string"],"default":null,"columnName":"empname","sqlType":"12"},{"namc =Obj avro.schema� {"type":"record","name":"employee","doc":"Sqoop import of employee","fields":[{"name":"empno","type":["null","int"],"default":null,"columnName":"empno","sqlType":"4"},{"name":"empname","type":["null","string"],"default":null,"columnName":"empname","sqlType":"12"},{"name":"empcity","type":["null","string"],"default":null,"columnName":"empcity","sqlType":"12"}],"tableName":"employee"} � ���'h�S�T���� $ Saravanan NY � ���'h�S�T����Obj avro.schema� {"type":"record","name":"employee","doc":"Sqoop import of employee","fields":[{"name":"empno","type":["null","int"],"default":null,"columnName":"empno","sqlType":"4"},{"name":"empname","type":["null","string"],"default":null,"columnName":"empname","sqlType":"12"},{"name":"empcity","type":["null","string"],"default":null,"columnName":"empcity","sqlType":"12"}],"tableName":"employee"}򐘩N�&��~�c�H�, kavitha򐘩N�&��~�c�H�,ram@hadoop:~$


$ hadoop dfs -text /user/ram/employee/part-*
{"empno":{"int":1001},"empname":{"string":"Karthik"},"empcity":{"string":"Chennai"}}
{"empno":{"int":1002},"empname":{"string":"Saravanan"},"empcity":{"string":"NY"}}
{"empno":{"int":1003},"empname":{"string":"kavitha"},"empcity":null}
ram@hadoop:~$


 --as-sequencefile:
SequenceFiles are a binary format that store individual records in custom record-specific data types. These data types are manifested as Java classes. Sqoop will automatically generate these data types for you

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --split-by empno \
     --as-sequencefile

ram@hadoop:~$ hadoop dfs -cat /user/ram/employee/part-*
SEQ !org.apache.hadoop.io.LongWritablemployee;��|ǀ�@E�A�'qf �KarthikChennaiSEQ !org.apache.hadoop.io.LongWritablemployee�
�      Saravanan NYSEQ !org.apache.hadoop.io.LongWritablemployeecb�[N���sc �<� �kavitha

further reference: reading sequencial file via java program
http://hadooptutorial.info/reading-and-writing-sequencefile-example/

Compression:
By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the -z or --compress argument, or specify any Hadoop compression codec using the --compression-codec argument. This applies to SequenceFile, text, and Avro files.

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --split-by empno \
     --as-avrodatafile \
     --compress
ram@hadoop:~$ hadoop dfs -ls /user/ram/employee/
Found 4 items
-rw-r--r--   1 ram supergroup          0 2015-11-14 23:11 /user/ram/employee/_SUCCESS
-rw-r--r--   1 ram supergroup        481 2015-11-14 23:11 /user/ram/employee/part-m-00000.avro
-rw-r--r--   1 ram supergroup        478 2015-11-14 23:11 /user/ram/employee/part-m-00001.avro
-rw-r--r--   1 ram supergroup        473 2015-11-14 23:11 /user/ram/employee/part-m-00002.avro
ram@hadoop:~$

ram@hadoop:~$ hadoop dfs -text /user/ram/employee/part*
{"empno":{"int":1001},"empname":{"string":"Karthik"},"empcity":{"string":"Chennai"}}
{"empno":{"int":1002},"empname":{"string":"Saravanan"},"empcity":{"string":"NY"}}
{"empno":{"int":1003},"empname":{"string":"kavitha"},"empcity":null}
ram@hadoop:~$


sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --split-by empno \
     --compress \
    --compression-codec org.apache.hadoop.io.compress.BZip2Codec

ram@hadoop:~$ hadoop dfs -ls /user/ram/employee/
Found 4 items
-rw-r--r--   1 ram supergroup          0 2015-11-14 23:20 /user/ram/employee/_SUCCESS
-rw-r--r--   1 ram supergroup         62 2015-11-14 23:20 /user/ram/employee/part-m-00000.bz2
-rw-r--r--   1 ram supergroup         62 2015-11-14 23:20 /user/ram/employee/part-m-00001.bz2
-rw-r--r--   1 ram supergroup         59 2015-11-14 23:20 /user/ram/employee/part-m-00002.bz2
ram@hadoop:~$

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --split-by empno \
     --compress \
    --compression-codec org.apache.hadoop.io.compress.SnappyCodec
Error: native snappy library not available: SnappyCompressor has not been loaded.
<<how to download and configure snappy:>>



--direct to Use direct connector if exists for the database

For example, MySQL provides the mysqldump tool which can export data from MySQL to other systems very quickly. By supplying the --direct argument, you are specifying that Sqoop should attempt the direct import channel. This channel may be higher performance than using JDBC.

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --split-by empno \
     --direct

ram@hadoop:~$ hadoop dfs -ls /user/ram/employee/
Found 4 items
-rw-r--r--   1 ram supergroup          0 2015-11-14 23:32 /user/ram/employee/_SUCCESS
-rw-r--r--   1 ram supergroup         21 2015-11-14 23:32 /user/ram/employee/part-m-00000
-rw-r--r--   1 ram supergroup         18 2015-11-14 23:32 /user/ram/employee/part-m-00001
-rw-r--r--   1 ram supergroup         18 2015-11-14 23:32 /user/ram/employee/part-m-00002

ram@hadoop:~$ hadoop dfs -ls /user/ram/employee/part-m*
-rw-r--r--   1 ram supergroup         21 2015-11-14 23:32 /user/ram/employee/part-m-00000
-rw-r--r--   1 ram supergroup         18 2015-11-14 23:32 /user/ram/employee/part-m-00001
-rw-r--r--   1 ram supergroup         18 2015-11-14 23:32 /user/ram/employee/part-m-00002
ram@hadoop:~$


incremental imports

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --split-by empno \
     --direct    \
     --incremental lastmodified \
     --last-value 0 \
     --check-column empno
note: --delete-target-dir can not be used with incremental imports.

sqoop import --connect jdbc:mysql://localhost/sales \
     --username root --password ram \
     --table employee \
     --split-by empno \
     --direct    \
     --check-column empno \
     --incremental append \
     --last-value 0
Error: java.lang.RuntimeException: Column type is neither timestamp nor date!


sqoop import --connect jdbc:mysql://localhost/sales \
    --username root --password ram \
    --table employee \
    --split-by empno \
    --direct\
    --check-column empno \
    --incremental append \
    --last-value 0

15/11/15 05:19:26 INFO tool.ImportTool: Incremental import complete! To run another incremental import of all data following this import, supply the following arguments:
15/11/15 05:19:26 INFO tool.ImportTool:  --incremental append
15/11/15 05:19:26 INFO tool.ImportTool:   --check-column empno
15/11/15 05:19:26 INFO tool.ImportTool:   --last-value 1003
15/11/15 05:19:26 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create')



ram@hadoop:~$ hadoop fs -ls /user/ram/
Found 4 items
drwxr-xr-x   - ram supergroup          0 2015-11-15 05:19 /user/ram/_sqoop
drwxr-xr-x   - ram supergroup          0 2015-11-15 05:19 /user/ram/employee
drwxr-xr-x   - ram supergroup          0 2015-11-14 16:10 /user/ram/employee_result
drwxr-xr-x   - ram supergroup          0 2015-11-14 16:28 /user/ram/employee_result2

ram@hadoop:~$ hadoop fs -ls /user/ram/employee
Found 3 items
-rw-r--r--   1 ram supergroup         21 2015-11-15 05:19 /user/ram/employee/part-m-00000
-rw-r--r--   1 ram supergroup         18 2015-11-15 05:19 /user/ram/employee/part-m-00001
-rw-r--r--   1 ram supergroup         18 2015-11-15 05:19 /user/ram/employee/part-m-00002

ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m*
1001,Karthik,Chennai
1002,Saravanan,NY
1003,kavitha,NULL
ram@hadoop:~$


Insert 2 more records:

mysql> insert into employee values (1004, 'ramkumar','NJ');
Query OK, 1 row affected (0.07 sec)

mysql> insert into employee values (1005, 'gopal','NJ');
Query OK, 1 row affected (0.05 sec)


sqoop import --connect jdbc:mysql://localhost/sales \
    --username root --password ram \
    --table employee \
    --split-by empno \
    --direct \
    --check-column empno \
    --incremental append \
    --last-value 1003
15/11/15 05:27:21 INFO tool.ImportTool:  --incremental append
15/11/15 05:27:21 INFO tool.ImportTool:   --check-column empno
15/11/15 05:27:21 INFO tool.ImportTool:   --last-value 1005
15/11/15 05:27:21 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create')

ram@hadoop:~$ hadoop fs -ls /user/ram/employee
Found 5 items
-rw-r--r--   1 ram supergroup         21 2015-11-15 05:19 /user/ram/employee/part-m-00000
-rw-r--r--   1 ram supergroup         18 2015-11-15 05:19 /user/ram/employee/part-m-00001
-rw-r--r--   1 ram supergroup         18 2015-11-15 05:19 /user/ram/employee/part-m-00002
-rw-r--r--   1 ram supergroup         17 2015-11-15 05:27 /user/ram/employee/part-m-00003
-rw-r--r--   1 ram supergroup         14 2015-11-15 05:27 /user/ram/employee/part-m-00004

ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m*1001,Karthik,Chennai
1002,Saravanan,NY
1003,kavitha,NULL
1004,ramkumar,NJ
1005,gopal,NJ
ram@hadoop:~$


Create and execute Sqoop jobs:

sqoop job --create employee_bulk_import \
     -- import \
     --username root \
     --password ram \
     --connect jdbc:mysql://localhost/sales \
     --delete-target-dir \
     --target-dir /user/ram/employee \
     --table employee \
     --split-by empno

sqoop job --list
sqoop job --show employee_bulk_import
sqoop job --exec employee_bulk_import
sqoop job --delete employee_bulk_import




Hive Import


- If you have a Hive metastore associated with your HDFS cluster, Sqoop can also import the data into Hive by generating and executing a CREATE TABLE statement to define the data’s layout in Hive. Importing data into Hive is as simple as adding the --hive-import option to your Sqoop command line.

if the Hive table already exists, you can specify the --hive-overwrite

Sqoop will generate a Hive script containing a CREATE TABLE operation defining your columns using Hive’s types, and a LOAD DATA INPATH statement to move the data files into Hive’s warehouse directory.

Even though Hive supports escaping characters, it does not handle escaping of new-line character. Also, it does not support the notion of enclosing characters that may include field delimiters in the enclosed string

t is therefore recommended that you choose unambiguous field and record-terminating delimiters without the help of escaping and enclosing characters when working with Hive; this is due to limitations of Hive’s input parsing abilities. If you do use --escaped-by, --enclosed-by, or --optionally-enclosed-by when importing data into Hive, Sqoop will print a warning message.


Hive will have problems using Sqoop-imported data if your database’s rows contain string fields that have Hive’s default row delimiters (\n and \r characters) or column delimiters (\01 characters) present in them. You can use the --hive-drop-import-delims option to drop those characters on import to give Hive-compatible text data. Alternatively, you can use the --hive-delims-replacement option to replace those characters with a user-defined string on import to give Hive-compatible text data

Sqoop will pass the field and record delimiters through to Hive. If you do not set any delimiters and do use --hive-import, the field delimiter will be set to ^A and the record delimiter will be set to \n to be consistent with Hive’s defaults.

You should append parameters --null-string and --null-non-string in case of import job or --input-null-string and --input-null-non-string in case of an export job if you wish to properly preserve NULL values. 

The table name used in Hive is, by default, the same as that of the source table. You can control the output table name with the --hive-table option.

Hive can put data into partitions for more efficient query performance. You can tell a Sqoop job to import data for Hive into a particular partition by specifying the --hive-partition-key and --hive-partition-value arguments.

You can import compressed tables into Hive using the --compress and --compression-codec options. One downside to compressing tables imported into Hive is that many codecs cannot be split for processing by parallel map tasks. The lzop codec, however, does support splitting. When importing tables with this codec, Sqoop will automatically index the files for splitting and configuring a new Hive table with the correct InputFormat.
 


Note: External tables/partitions are not supported
Note: Partitions are created under the default location
Note: Existing Hive table can’t be truncated before the import
The --delete-target-dir option is not respected.


sqoop import \
     --username root \
     --password ram \
     --connect jdbc:mysql://localhost/sales \
     --delete-target-dir \
     --table employee \
     --split-by empno \
     --direct \
     --hive-import \
     --hive-table=hive_employee


15/11/15 06:17:35 INFO mapreduce.ImportJobBase: Transferred 88 bytes in 35.049 seconds (2.5108 bytes/sec)
15/11/15 06:17:35 INFO mapreduce.ImportJobBase: Retrieved 5 records.
15/11/15 06:17:35 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1
15/11/15 06:17:35 INFO hive.HiveImport: Loading uploaded data into Hive
15/11/15 06:17:54 INFO hive.HiveImport: Time taken: 2.839 seconds
15/11/15 06:17:54 INFO hive.HiveImport: Loading data to table default.hive_employee
15/11/15 06:17:55 INFO hive.HiveImport: Table default.hive_employee stats: [numFiles=5, totalSize=88]
15/11/15 06:17:55 INFO hive.HiveImport: Hive import complete.
15/11/15 06:17:55 INFO hive.HiveImport: Export directory is contains the _SUCCESS file only, removing the directory.
ram@hadoop:~$

ram@hadoop:~$ hadoop fs -ls /user/hive/warehouse/hive_employeeFound 5 items
-rwxrwxr-x   1 ram supergroup         21 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00000
-rwxrwxr-x   1 ram supergroup         18 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00001
-rwxrwxr-x   1 ram supergroup         18 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00002
-rwxrwxr-x   1 ram supergroup         17 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00003
-rwxrwxr-x   1 ram supergroup         14 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00004

ram@hadoop:~$ hadoop fs -cat /user/hive/warehouse/hive_employee/part-m*1001 Karthik Chennai
1002 Saravanan NY
1003 kavitha NULL
1004 ramkumar NJ
1005 gopal NJ
ram@hadoop:~$

ram@hadoop:~$ hive -e "select * from hive_employee"OK
1001    Karthik    Chennai
1002    Saravanan    NY
1003    kavitha    NULL
1004    ramkumar    NJ
1005    gopal    NJ
Time taken: 2.727 seconds, Fetched: 5 row(s)

ram@hadoop:~$ hive -e "select * from hive_employee where empname = 'gopal'"
1005    gopal    NJ
Time taken: 3.182 seconds, Fetched: 1 row(s)
ram@hadoop:~$


What happens when you run the import again:
The --delete-target-dir option is not respected.

sqoop import \
     --username root \
     --password ram \
     --connect jdbc:mysql://localhost/sales \
     --delete-target-dir \
     --table employee \
     --split-by empno \
     --direct \
     --hive-import \
     --hive-table=hive_employee


ram@hadoop:~$ hadoop fs -ls /user/hive/warehouse/hive_employee
Found 10 items
-rwxrwxr-x   1 ram supergroup         21 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00000
-rwxrwxr-x   1 ram supergroup         21 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00000_copy_1
-rwxrwxr-x   1 ram supergroup         18 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00001
-rwxrwxr-x   1 ram supergroup         18 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00001_copy_1
-rwxrwxr-x   1 ram supergroup         18 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00002
-rwxrwxr-x   1 ram supergroup         18 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00002_copy_1
-rwxrwxr-x   1 ram supergroup         17 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00003
-rwxrwxr-x   1 ram supergroup         17 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00003_copy_1
-rwxrwxr-x   1 ram supergroup         14 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00004
-rwxrwxr-x   1 ram supergroup         14 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00004_copy_1
 

ram@hadoop:~$ hive -e "select * from hive_employee"
 OK
1001    Karthik    Chennai
1001    Karthik    Chennai
1002    Saravanan    NY
1002    Saravanan    NY
1003    kavitha    NULL
1003    kavitha    NULL
1004    ramkumar    NJ
1004    ramkumar    NJ
1005    gopal    NJ
1005    gopal    NJ
Time taken: 2.56 seconds, Fetched: 10 row(s)
ram@hadoop:~$



sqoop import \
     --username root \
     --password ram \
     --connect jdbc:mysql://localhost/sales \
     --delete-target-dir \
     --table employee \
     --split-by empno \
     --direct \
     --hive-import \
     --hive-overwrite




Note: Hive-Overwrite will delete all the files employee folder before loading the data

hive -e "drop table employee"

sqoop import \
     --username root \
     --password ram \
     --connect jdbc:mysql://localhost/sales \
     --table employee \
     --split-by empno \
     --direct \
     --hive-import \
     --hive-overwrite \
     --hive-drop-import-delims  \
     --hive-partition-key empno \
    --hive-partition-value 1002
ERROR: Direct import currently do not support dropping hive delimiters, please remove parameter --hive-drop-import-delims.


hive -e "drop table employee"

sqoop import \
     --username root \
     --password ram \
     --connect jdbc:mysql://localhost/sales \
     --table employee \
     --split-by empno \
     --direct \
     --hive-import \
     --hive-overwrite \
     --hive-partition-key "empcity" \
     --hive-partition-value "NJ"

<<this creates files under /user/ram/employee. need to check>>

ram@hadoop:~$ hadoop fs -ls /user/ram/employee
15/11/15 15:36:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 6 items
-rw-r--r--   1 ram supergroup          0 2015-11-15 15:34 /user/ram/employee/_SUCCESS
-rw-r--r--   1 ram supergroup         21 2015-11-15 15:34 /user/ram/employee/part-m-00000
-rw-r--r--   1 ram supergroup         18 2015-11-15 15:34 /user/ram/employee/part-m-00001
-rw-r--r--   1 ram supergroup         18 2015-11-15 15:34 /user/ram/employee/part-m-00002
-rw-r--r--   1 ram supergroup         17 2015-11-15 15:34 /user/ram/employee/part-m-00003
-rw-r--r--   1 ram supergroup         14 2015-11-15 15:34 /user/ram/employee/part-m-00004



4 comments:

  1. I am using the below command:
    sqoop import --connect jdbc:mysql://localhost:3306/siva --table authors --username root --password sivasakthi --hive-import --hive-table default.authors -m 1

    I was getting the bellow error
    ERROR tool.ImportTool: Encountered IOException running import job: java.io.FileNotFoundException: File does not exist: hdfs://localhost:9000/usr/local/sqoop-1.4.6/lib/ant-contrib-1.0b3.jar
    at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1122)
    at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
    at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:288)
    at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:224)
    at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestamps(ClientDistributedCacheManager.java:93)
    at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestampsAndCacheVisibilities(ClientDistributedCacheManager.java:57)


    What do to in this case.

    thanks in advance

    ReplyDelete
  2. Thanks for giving great kind of information. So useful and practical for me. Thanks for your excellent blog, nice work keep it up thanks for sharing the knowledge.
    AWS Training in Chennai | AWS Training Institute in Chennai

    ReplyDelete