SQOOP Commands - Complete reference
sqoop version
sqoop help
sqoop list-databases --connect jdbc:mysql://localhost/ --username root --password ram
/first time error/
Warning: /usr/lib/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
16/01/15 22:57:28 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
16/01/15 22:57:28 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/01/15 22:57:28 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
16/01/15 22:57:28 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
Error1
Warning: /usr/lib/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Fix1:
Error 2:
16/01/15 22:57:28 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
note:
1. download connector for MySQL here http://dev.mysql.com/downloads/file/?id=459312
2. Extract the .gz and copy the .jar file in /usr/lib/sqoop/lib folder
sqoop list-tables --connect jdbc:mysql://localhost/sales --username root --password ram
Mysql: setting up environment
mysql -u root -p
create database sales
use sales
create table employee (empno int, empname varchar(50), empcity varchar(50));
insert into employee values (1001, 'Karthik', 'Chennai');
insert into employee values (1002, 'Saravanan', 'NY');
select * from employee;
sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee
ERROR tool.ImportTool: Error during import: No primary key could be found for table employee.
Please specify one with --split-by or perform a sequential import with '-m 1'.
sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee -m1
hadoop fs -ls /user/ram/employee
Found 2 items
-rw-r--r-- 1 ram supergroup 0 2015-11-14 08:58 /user/ram/employee/_SUCCESS
-rw-r--r-- 1 ram supergroup 39 2015-11-14 08:58 /user/ram/employee/part-m-00000
ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m-00000
1001,Karthik,Chennai
1002,Saravanan,NY
When you run the same import again:
sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee -m1
ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory hdfs://localhost:9000/user/ram/employee already exists
Appending the records in hdfs:
sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee -m1 --append
or
sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee --autoreset-to-one-mapper --append
ram@hadoop:~$ hadoop fs -ls /user/ram/employee/
15/11/14 09:13:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r-- 1 ram supergroup 0 2015-11-14 08:58 /user/ram/employee/_SUCCESS
-rw-r--r-- 1 ram supergroup 39 2015-11-14 08:58 /user/ram/employee/part-m-00000
-rw-r--r-- 1 ram supergroup 39 2015-11-14 09:13 /user/ram/employee/part-m-00001
sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee --split-by empno --append
15/11/14 09:17:45 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=267100
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=221
HDFS: Number of bytes written=39
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Job Counters
Launched map tasks=2
Other local map tasks=2
Total time spent by all maps in occupied slots (ms)=12800
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=12800
Total vcore-seconds taken by all map tasks=12800
Total megabyte-seconds taken by all map tasks=13107200
Map-Reduce Framework
Map input records=2
Map output records=2
Input split bytes=221
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=142
CPU time spent (ms)=2880
Physical memory (bytes) snapshot=363220992
Virtual memory (bytes) snapshot=1681674240
Total committed heap usage (bytes)=191889408
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=39
ram@hadoop:~$ hadoop fs -ls /user/ram/employee/
15/11/14 09:19:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 5 items
-rw-r--r-- 1 ram supergroup 0 2015-11-14 08:58 /user/ram/employee/_SUCCESS
-rw-r--r-- 1 ram supergroup 39 2015-11-14 08:58 /user/ram/employee/part-m-00000
-rw-r--r-- 1 ram supergroup 39 2015-11-14 09:13 /user/ram/employee/part-m-00001
-rw-r--r-- 1 ram supergroup 21 2015-11-14 09:17 /user/ram/employee/part-m-00002
-rw-r--r-- 1 ram supergroup 18 2015-11-14 09:17 /user/ram/employee/part-m-00003
ram@hadoop:~$
Note: two files are imported as two mappers executed (002 and 003)
sqoop --options-file sqoopconf.conf
sqoopconf.conf
list-tables
--connect
jdbc:mysql://localhost/sales
--username
root
--password-file
/usr/local/hadoop/password.txt
Error while loading password file: The provided password file .password does not exist!ram@hadoop:~$ cat .password
ram
ram@hadoop:~$
note: You should save the password in a file on the users home directory with 400 permissions and
specify the path to that file using the --password-file argument
<<Need to check>>
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram --table employee \
--split-by empno --append \
--target-dir /user/ram/employee_result \
--query "select empno, empname from employee" \
--where "where empno=1001"
Error: Cannot specify --query and --table together.
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--split-by empno --append \
--target-dir /user/ram/employee_result \
--query "select empno, empname from employee" \
--where "where empno=1001"
Error:
ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException:
Query [select empno, empname from employee] must contain '$CONDITIONS' in WHERE clause.
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--target-dir /user/ram/employee_result \
--table employee \
--split-by empno \
--append \
--where " empno=1001 and empname = 'karthik' "
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--target-dir /user/ram/employee_result \
--table employee \
--columns "empno, empname" \
--split-by empno \
--append \
--where " empno=1001 and empname = 'karthik' "
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--target-dir /user/ram/employee_result2 \
--query 'select empno, empname from employee where empname = and \$CONDITIONS \
--num-mappers 1 \
--append
Free-form Query Imports
When importing a free-form query, you must specify a destination directory with --target-dir.
If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--target-dir /user/ram/employee_result2 \
--query "select empno, empname from employee where empno < 100 and \$CONDITIONS " \
--num-mappers 1 \
--append
Note:
If you are issuing the query wrapped with double quotes ("), you will have to use \$CONDITIONS instead of just $CONDITIONS to disallow your shell from treating it as a shell variable. For example, a double quoted query may look like: "SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"
The facility of using free-form query in the current version of Sqoop is limited to simple queries where there are no ambiguous projections and no OR conditions in the WHERE clause. Use of complex queries such as queries that have sub-queries or joins leading to ambiguous projections can lead to unexpected results.
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--target-dir /user/ram/employee_result2 \
--query 'select empno, empname from employee where $CONDITIONS ' \
--append \
--num-mappers 1
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--num-mappers 1 \
--fetch-size 1000 \
--fields-terminated-by '|' \
--lines-terminated-by '\n'
ram@hadoop:~$ hadoop fs -ls /user/ram/employeeFound 2 items
-rw-r--r-- 1 ram supergroup 0 2015-11-14 22:05 /user/ram/employee/_SUCCESS
-rw-r--r-- 1 ram supergroup 39 2015-11-14 22:05 /user/ram/employee/part-m-00000
ram@hadoop:~$
ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m-000001001|Karthik|Chennai
1002|Saravanan|NY
ram@hadoop:~$
Boundary Query and Null Handling:
Boundary Query:
By default sqoop will use query select min(<split-by>), max(<split-by>) from <table name> to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using --boundary-query argument.
Null handling:
Sqoop will by default import NULL values as string null. Hive is however using string \N to denote NULL values and therefore predicates dealing with NULL (like IS NULL) will not work correctly. You should append parameters --null-string and --null-non-string in case of import job or --input-null-string and --input-null-non-string in case of an export job if you wish to properly preserve NULL values.
mysql> insert into employee (empno, empname) values (1003,'kavitha');Query OK, 1 row affected (0.07 sec)
mysql> select * from employee;
+-------+-----------+---------+
| empno | empname | empcity |
+-------+-----------+---------+
| 1001 | Karthik | Chennai |
| 1002 | Saravanan | NY |
| 1003 | kavitha | NULL |
+-------+-----------+---------+
3 rows in set (0.00 sec)
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--boundary-query "select min(empno), max(empno) from employee" \
--split-by empno \
--fetch-size 1000 \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--null-string '\\N' \
--null-non-string '\\N'--null-string '\\N': The string to be interpreted as null for string columns.
--null-non-string '\\N' : The string to be interpreted as null for non string columns.
note:select min(empno), max(empno) from employee -- will work
select min (empno), max (empno) from employee -- will throw function not found error
ram@hadoop:~$ hadoop fs -ls /user/ram/employee/Found 4 items
-rw-r--r-- 1 ram supergroup 0 2015-11-14 22:29 /user/ram/employee/_SUCCESS
-rw-r--r-- 1 ram supergroup 21 2015-11-14 22:29 /user/ram/employee/part-m-00000
-rw-r--r-- 1 ram supergroup 18 2015-11-14 22:29 /user/ram/employee/part-m-00001
-rw-r--r-- 1 ram supergroup 16 2015-11-14 22:29 /user/ram/employee/part-m-00002
ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m-000001001|Karthik|Chennai
ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m-00002
1003|kavitha|\N
Relaxed Isolation:
By default Sqoop uses Read Comitted isolation level
--relaxed isolation to use read uncommitted (not supported by all rdbms)
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--boundary-query "select min(empno), max(empno) from employee" \
--split-by empno \
--fetch-size 1000 \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--null-string '\\N' \
--null-non-string '\\N' \
--relaxed-isolation ram@hadoop:~$ hadoop dfs -cat /user/ram/employee/part-m-*1001|Karthik|Chennai
1002|Saravanan|NY
1003|kavitha|\N
ram@hadoop:~$
--as-avrodatafile:
Avro data files are a compact, efficient binary format that provides interoperability with applications written in other programming languages. Avro also supports versioning, so that when, e.g., columns are added or removed from a table, previously imported data files can be processed along with new ones.
By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the -z or --compress argument, or specify any Hadoop compression codec using the --compression-codec argument. This applies to SequenceFile, text, and Avro files.
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--split-by empno \
--as-avrodatafile
ram@hadoop:~$ hadoop dfs -cat /user/ram/employee/part-*Obj avro.schema� {"type":"record","name":"employee","doc":"Sqoop import of employee","fields":[{"name":"empno","type":["null","int"],"default":null,"columnName":"empno","sqlType":"4"},{"name":"empname","type":["null","string"],"default":null,"columnName":"empname","sqlType":"12"},{"namc =Obj avro.schema� {"type":"record","name":"employee","doc":"Sqoop import of employee","fields":[{"name":"empno","type":["null","int"],"default":null,"columnName":"empno","sqlType":"4"},{"name":"empname","type":["null","string"],"default":null,"columnName":"empname","sqlType":"12"},{"name":"empcity","type":["null","string"],"default":null,"columnName":"empcity","sqlType":"12"}],"tableName":"employee"} � ���'h�S�T���� $ Saravanan NY � ���'h�S�T����Obj avro.schema� {"type":"record","name":"employee","doc":"Sqoop import of employee","fields":[{"name":"empno","type":["null","int"],"default":null,"columnName":"empno","sqlType":"4"},{"name":"empname","type":["null","string"],"default":null,"columnName":"empname","sqlType":"12"},{"name":"empcity","type":["null","string"],"default":null,"columnName":"empcity","sqlType":"12"}],"tableName":"employee"}N�&��~�c�H�, kavithaN�&��~�c�H�,ram@hadoop:~$
$ hadoop dfs -text /user/ram/employee/part-*{"empno":{"int":1001},"empname":{"string":"Karthik"},"empcity":{"string":"Chennai"}}
{"empno":{"int":1002},"empname":{"string":"Saravanan"},"empcity":{"string":"NY"}}
{"empno":{"int":1003},"empname":{"string":"kavitha"},"empcity":null}
ram@hadoop:~$
--as-sequencefile:SequenceFiles are a binary format that store individual records in custom record-specific data types. These data types are manifested as Java classes. Sqoop will automatically generate these data types for you
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--split-by empno \
--as-sequencefile
ram@hadoop:~$ hadoop dfs -cat /user/ram/employee/part-*SEQ !org.apache.hadoop.io.LongWritablemployee;��|ǀ�@E�A�'qf �KarthikChennaiSEQ !org.apache.hadoop.io.LongWritablemployee�
� Saravanan NYSEQ !org.apache.hadoop.io.LongWritablemployeecb�[N���sc �<� �kavitha
further reference: reading sequencial file via java program
http://hadooptutorial.info/reading-and-writing-sequencefile-example/
Compression:By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the -z or --compress argument, or specify any Hadoop compression codec using the --compression-codec argument. This applies to SequenceFile, text, and Avro files.
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--split-by empno \
--as-avrodatafile \
--compressram@hadoop:~$ hadoop dfs -ls /user/ram/employee/
Found 4 items
-rw-r--r-- 1 ram supergroup 0 2015-11-14 23:11 /user/ram/employee/_SUCCESS
-rw-r--r-- 1 ram supergroup 481 2015-11-14 23:11 /user/ram/employee/part-m-00000.avro
-rw-r--r-- 1 ram supergroup 478 2015-11-14 23:11 /user/ram/employee/part-m-00001.avro
-rw-r--r-- 1 ram supergroup 473 2015-11-14 23:11 /user/ram/employee/part-m-00002.avro
ram@hadoop:~$
ram@hadoop:~$ hadoop dfs -text /user/ram/employee/part*{"empno":{"int":1001},"empname":{"string":"Karthik"},"empcity":{"string":"Chennai"}}
{"empno":{"int":1002},"empname":{"string":"Saravanan"},"empcity":{"string":"NY"}}
{"empno":{"int":1003},"empname":{"string":"kavitha"},"empcity":null}
ram@hadoop:~$
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--split-by empno \
--compress \
--compression-codec org.apache.hadoop.io.compress.BZip2Codec
ram@hadoop:~$ hadoop dfs -ls /user/ram/employee/Found 4 items
-rw-r--r-- 1 ram supergroup 0 2015-11-14 23:20 /user/ram/employee/_SUCCESS
-rw-r--r-- 1 ram supergroup 62 2015-11-14 23:20 /user/ram/employee/part-m-00000.bz2
-rw-r--r-- 1 ram supergroup 62 2015-11-14 23:20 /user/ram/employee/part-m-00001.bz2
-rw-r--r-- 1 ram supergroup 59 2015-11-14 23:20 /user/ram/employee/part-m-00002.bz2
ram@hadoop:~$
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--split-by empno \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodecError: native snappy library not available: SnappyCompressor has not been loaded.
<<how to download and configure snappy:>>
--direct to Use direct connector if exists for the database
For example, MySQL provides the mysqldump tool which can export data from MySQL to other systems very quickly. By supplying the --direct argument, you are specifying that Sqoop should attempt the direct import channel. This channel may be higher performance than using JDBC.
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--split-by empno \
--direct
ram@hadoop:~$ hadoop dfs -ls /user/ram/employee/Found 4 items
-rw-r--r-- 1 ram supergroup 0 2015-11-14 23:32 /user/ram/employee/_SUCCESS
-rw-r--r-- 1 ram supergroup 21 2015-11-14 23:32 /user/ram/employee/part-m-00000
-rw-r--r-- 1 ram supergroup 18 2015-11-14 23:32 /user/ram/employee/part-m-00001
-rw-r--r-- 1 ram supergroup 18 2015-11-14 23:32 /user/ram/employee/part-m-00002
ram@hadoop:~$ hadoop dfs -ls /user/ram/employee/part-m*-rw-r--r-- 1 ram supergroup 21 2015-11-14 23:32 /user/ram/employee/part-m-00000
-rw-r--r-- 1 ram supergroup 18 2015-11-14 23:32 /user/ram/employee/part-m-00001
-rw-r--r-- 1 ram supergroup 18 2015-11-14 23:32 /user/ram/employee/part-m-00002
ram@hadoop:~$
incremental imports
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--split-by empno \
--direct \
--incremental lastmodified \
--last-value 0 \
--check-column empnonote: --delete-target-dir can not be used with incremental imports.
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--table employee \
--split-by empno \
--direct \
--check-column empno \
--incremental append \
--last-value 0 Error: java.lang.RuntimeException: Column type is neither timestamp nor date!
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--table employee \
--split-by empno \
--direct\
--check-column empno \
--incremental append \
--last-value 0
15/11/15 05:19:26 INFO tool.ImportTool: Incremental import complete! To run another incremental import of all data following this import, supply the following arguments:
15/11/15 05:19:26 INFO tool.ImportTool: --incremental append
15/11/15 05:19:26 INFO tool.ImportTool: --check-column empno
15/11/15 05:19:26 INFO tool.ImportTool: --last-value 1003
15/11/15 05:19:26 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create')
ram@hadoop:~$ hadoop fs -ls /user/ram/
Found 4 itemsdrwxr-xr-x - ram supergroup 0 2015-11-15 05:19 /user/ram/_sqoop
drwxr-xr-x - ram supergroup 0 2015-11-15 05:19 /user/ram/employee
drwxr-xr-x - ram supergroup 0 2015-11-14 16:10 /user/ram/employee_result
drwxr-xr-x - ram supergroup 0 2015-11-14 16:28 /user/ram/employee_result2
ram@hadoop:~$ hadoop fs -ls /user/ram/employeeFound 3 items
-rw-r--r-- 1 ram supergroup 21 2015-11-15 05:19 /user/ram/employee/part-m-00000
-rw-r--r-- 1 ram supergroup 18 2015-11-15 05:19 /user/ram/employee/part-m-00001
-rw-r--r-- 1 ram supergroup 18 2015-11-15 05:19 /user/ram/employee/part-m-00002
ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m*1001,Karthik,Chennai
1002,Saravanan,NY
1003,kavitha,NULL
ram@hadoop:~$
Insert 2 more records:
mysql> insert into employee values (1004, 'ramkumar','NJ');Query OK, 1 row affected (0.07 sec)
mysql> insert into employee values (1005, 'gopal','NJ');Query OK, 1 row affected (0.05 sec)
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--table employee \
--split-by empno \
--direct \
--check-column empno \
--incremental append \
--last-value 100315/11/15 05:27:21 INFO tool.ImportTool: --incremental append
15/11/15 05:27:21 INFO tool.ImportTool: --check-column empno
15/11/15 05:27:21 INFO tool.ImportTool: --last-value 1005
15/11/15 05:27:21 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create')
ram@hadoop:~$ hadoop fs -ls /user/ram/employeeFound 5 items
-rw-r--r-- 1 ram supergroup 21 2015-11-15 05:19 /user/ram/employee/part-m-00000
-rw-r--r-- 1 ram supergroup 18 2015-11-15 05:19 /user/ram/employee/part-m-00001
-rw-r--r-- 1 ram supergroup 18 2015-11-15 05:19 /user/ram/employee/part-m-00002
-rw-r--r-- 1 ram supergroup 17 2015-11-15 05:27 /user/ram/employee/part-m-00003
-rw-r--r-- 1 ram supergroup 14 2015-11-15 05:27 /user/ram/employee/part-m-00004
ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m*1001,Karthik,Chennai
1002,Saravanan,NY
1003,kavitha,NULL
1004,ramkumar,NJ
1005,gopal,NJ
ram@hadoop:~$
Create and execute Sqoop jobs:
sqoop job --create employee_bulk_import \
-- import \
--username root \
--password ram \
--connect jdbc:mysql://localhost/sales \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--split-by empno
sqoop job --list
sqoop job --show employee_bulk_import
sqoop job --exec employee_bulk_import
sqoop job --delete employee_bulk_import
Hive Import
- If you have a Hive metastore associated with your HDFS cluster, Sqoop can also import the data into Hive by generating and executing a
if the Hive table already exists, you can specify the
Sqoop will generate a Hive script containing a
Even though Hive supports escaping characters, it does not handle escaping of new-line character. Also, it does not support the notion of enclosing characters that may include field delimiters in the enclosed string
t is therefore recommended that you choose unambiguous field and record-terminating delimiters without the help of escaping and enclosing characters when working with Hive; this is due to limitations of Hive’s input parsing abilities. If you do use
Hive will have problems using Sqoop-imported data if your database’s rows contain string fields that have Hive’s default row delimiters (
Sqoop will pass the field and record delimiters through to Hive. If you do not set any delimiters and do use
You should append parameters
The table name used in Hive is, by default, the same as that of the source table. You can control the output table name with the
Hive can put data into partitions for more efficient query performance. You can tell a Sqoop job to import data for Hive into a particular partition by specifying the
You can import compressed tables into Hive using the
Note: External tables/partitions are not supported
Note: Partitions are created under the default location
Note: Existing Hive table can’t be truncated before the import
The --delete-target-dir option is not respected.
sqoop import \
--username root \
--password ram \
--connect jdbc:mysql://localhost/sales \
--delete-target-dir \
--table employee \
--split-by empno \
--direct \
--hive-import \
--hive-table=hive_employee
15/11/15 06:17:35 INFO mapreduce.ImportJobBase: Transferred 88 bytes in 35.049 seconds (2.5108 bytes/sec)
15/11/15 06:17:35 INFO mapreduce.ImportJobBase: Retrieved 5 records.
15/11/15 06:17:35 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1
15/11/15 06:17:35 INFO hive.HiveImport: Loading uploaded data into Hive
15/11/15 06:17:54 INFO hive.HiveImport: Time taken: 2.839 seconds
15/11/15 06:17:54 INFO hive.HiveImport: Loading data to table default.hive_employee
15/11/15 06:17:55 INFO hive.HiveImport: Table default.hive_employee stats: [numFiles=5, totalSize=88]
15/11/15 06:17:55 INFO hive.HiveImport: Hive import complete.
15/11/15 06:17:55 INFO hive.HiveImport: Export directory is contains the _SUCCESS file only, removing the directory.
ram@hadoop:~$
ram@hadoop:~$ hadoop fs -ls /user/hive/warehouse/hive_employeeFound 5 items
-rwxrwxr-x 1 ram supergroup 21 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00000
-rwxrwxr-x 1 ram supergroup 18 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00001
-rwxrwxr-x 1 ram supergroup 18 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00002
-rwxrwxr-x 1 ram supergroup 17 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00003
-rwxrwxr-x 1 ram supergroup 14 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00004
ram@hadoop:~$ hadoop fs -cat /user/hive/warehouse/hive_employee/part-m*1001 Karthik Chennai
1002 Saravanan NY
1003 kavitha NULL
1004 ramkumar NJ
1005 gopal NJ
ram@hadoop:~$
ram@hadoop:~$ hive -e "select * from hive_employee"OK
1001 Karthik Chennai
1002 Saravanan NY
1003 kavitha NULL
1004 ramkumar NJ
1005 gopal NJ
Time taken: 2.727 seconds, Fetched: 5 row(s)
ram@hadoop:~$ hive -e "select * from hive_employee where empname = 'gopal'"1005 gopal NJ
Time taken: 3.182 seconds, Fetched: 1 row(s)
ram@hadoop:~$
What happens when you run the import again:The --delete-target-dir option is not respected.
sqoop import \
--username root \
--password ram \
--connect jdbc:mysql://localhost/sales \
--delete-target-dir \
--table employee \
--split-by empno \
--direct \
--hive-import \
--hive-table=hive_employee
ram@hadoop:~$ hadoop fs -ls /user/hive/warehouse/hive_employeeFound 10 items
-rwxrwxr-x 1 ram supergroup 21 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00000
-rwxrwxr-x 1 ram supergroup 21 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00000_copy_1
-rwxrwxr-x 1 ram supergroup 18 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00001
-rwxrwxr-x 1 ram supergroup 18 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00001_copy_1
-rwxrwxr-x 1 ram supergroup 18 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00002
-rwxrwxr-x 1 ram supergroup 18 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00002_copy_1
-rwxrwxr-x 1 ram supergroup 17 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00003
-rwxrwxr-x 1 ram supergroup 17 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00003_copy_1
-rwxrwxr-x 1 ram supergroup 14 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00004
-rwxrwxr-x 1 ram supergroup 14 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00004_copy_1
ram@hadoop:~$ hive -e "select * from hive_employee"
OK
1001 Karthik Chennai
1001 Karthik Chennai
1002 Saravanan NY
1002 Saravanan NY
1003 kavitha NULL
1003 kavitha NULL
1004 ramkumar NJ
1004 ramkumar NJ
1005 gopal NJ
1005 gopal NJ
Time taken: 2.56 seconds, Fetched: 10 row(s)
ram@hadoop:~$
sqoop import \
--username root \
--password ram \
--connect jdbc:mysql://localhost/sales \
--delete-target-dir \
--table employee \
--split-by empno \
--direct \
--hive-import \
--hive-overwrite
Note: Hive-Overwrite will delete all the files employee folder before loading the data
hive -e "drop table employee"
sqoop import \
--username root \
--password ram \
--connect jdbc:mysql://localhost/sales \
--table employee \
--split-by empno \
--direct \
--hive-import \
--hive-overwrite \
--hive-drop-import-delims \
--hive-partition-key empno \
--hive-partition-value 1002ERROR: Direct import currently do not support dropping hive delimiters, please remove parameter --hive-drop-import-delims.
hive -e "drop table employee"
sqoop import \
--username root \
--password ram \
--connect jdbc:mysql://localhost/sales \
--table employee \
--split-by empno \
--direct \
--hive-import \
--hive-overwrite \
--hive-partition-key "empcity" \
--hive-partition-value "NJ"
<<this creates files under /user/ram/employee. need to check>>
ram@hadoop:~$ hadoop fs -ls /user/ram/employee15/11/15 15:36:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 6 items
-rw-r--r-- 1 ram supergroup 0 2015-11-15 15:34 /user/ram/employee/_SUCCESS
-rw-r--r-- 1 ram supergroup 21 2015-11-15 15:34 /user/ram/employee/part-m-00000
-rw-r--r-- 1 ram supergroup 18 2015-11-15 15:34 /user/ram/employee/part-m-00001
-rw-r--r-- 1 ram supergroup 18 2015-11-15 15:34 /user/ram/employee/part-m-00002
-rw-r--r-- 1 ram supergroup 17 2015-11-15 15:34 /user/ram/employee/part-m-00003
-rw-r--r-- 1 ram supergroup 14 2015-11-15 15:34 /user/ram/employee/part-m-00004
sqoop version
sqoop help
sqoop list-databases --connect jdbc:mysql://localhost/ --username root --password ram
/first time error/
Warning: /usr/lib/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
16/01/15 22:57:28 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
16/01/15 22:57:28 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/01/15 22:57:28 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
16/01/15 22:57:28 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
Error1
Warning: /usr/lib/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Fix1:
gedit .bashrc
export HCAT_HOME=${HIVE_HOME}/hcatalog/
export PATH=$HCAT_HOME/bin:$PATH
source .bashrc
Error 2:
16/01/15 22:57:28 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
note:
1. download connector for MySQL here http://dev.mysql.com/downloads/file/?id=459312
2. Extract the .gz and copy the .jar file in /usr/lib/sqoop/lib folder
sqoop list-tables --connect jdbc:mysql://localhost/sales --username root --password ram
Mysql: setting up environment
mysql -u root -p
create database sales
use sales
create table employee (empno int, empname varchar(50), empcity varchar(50));
insert into employee values (1001, 'Karthik', 'Chennai');
insert into employee values (1002, 'Saravanan', 'NY');
select * from employee;
sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee
ERROR tool.ImportTool: Error during import: No primary key could be found for table employee.
Please specify one with --split-by or perform a sequential import with '-m 1'.
sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee -m1
hadoop fs -ls /user/ram/employee
Found 2 items
-rw-r--r-- 1 ram supergroup 0 2015-11-14 08:58 /user/ram/employee/_SUCCESS
-rw-r--r-- 1 ram supergroup 39 2015-11-14 08:58 /user/ram/employee/part-m-00000
ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m-00000
1001,Karthik,Chennai
1002,Saravanan,NY
When you run the same import again:
sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee -m1
ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory hdfs://localhost:9000/user/ram/employee already exists
Appending the records in hdfs:
sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee -m1 --append
or
sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee --autoreset-to-one-mapper --append
ram@hadoop:~$ hadoop fs -ls /user/ram/employee/
15/11/14 09:13:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r-- 1 ram supergroup 0 2015-11-14 08:58 /user/ram/employee/_SUCCESS
-rw-r--r-- 1 ram supergroup 39 2015-11-14 08:58 /user/ram/employee/part-m-00000
-rw-r--r-- 1 ram supergroup 39 2015-11-14 09:13 /user/ram/employee/part-m-00001
sqoop import --connect jdbc:mysql://localhost/sales --username root --password ram --table employee --split-by empno --append
15/11/14 09:17:45 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=267100
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=221
HDFS: Number of bytes written=39
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=4
Job Counters
Launched map tasks=2
Other local map tasks=2
Total time spent by all maps in occupied slots (ms)=12800
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=12800
Total vcore-seconds taken by all map tasks=12800
Total megabyte-seconds taken by all map tasks=13107200
Map-Reduce Framework
Map input records=2
Map output records=2
Input split bytes=221
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=142
CPU time spent (ms)=2880
Physical memory (bytes) snapshot=363220992
Virtual memory (bytes) snapshot=1681674240
Total committed heap usage (bytes)=191889408
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=39
ram@hadoop:~$ hadoop fs -ls /user/ram/employee/
15/11/14 09:19:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 5 items
-rw-r--r-- 1 ram supergroup 0 2015-11-14 08:58 /user/ram/employee/_SUCCESS
-rw-r--r-- 1 ram supergroup 39 2015-11-14 08:58 /user/ram/employee/part-m-00000
-rw-r--r-- 1 ram supergroup 39 2015-11-14 09:13 /user/ram/employee/part-m-00001
-rw-r--r-- 1 ram supergroup 21 2015-11-14 09:17 /user/ram/employee/part-m-00002
-rw-r--r-- 1 ram supergroup 18 2015-11-14 09:17 /user/ram/employee/part-m-00003
ram@hadoop:~$
Note: two files are imported as two mappers executed (002 and 003)
sqoop --options-file sqoopconf.conf
sqoopconf.conf
list-tables
--connect
jdbc:mysql://localhost/sales
--username
root
--password-file
/usr/local/hadoop/password.txt
Error while loading password file: The provided password file .password does not exist!ram@hadoop:~$ cat .password
ram
ram@hadoop:~$
note: You should save the password in a file on the users home directory with 400 permissions and
specify the path to that file using the --password-file argument
<<Need to check>>
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram --table employee \
--split-by empno --append \
--target-dir /user/ram/employee_result \
--query "select empno, empname from employee" \
--where "where empno=1001"
Error: Cannot specify --query and --table together.
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--split-by empno --append \
--target-dir /user/ram/employee_result \
--query "select empno, empname from employee" \
--where "where empno=1001"
Error:
ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException:
Query [select empno, empname from employee] must contain '$CONDITIONS' in WHERE clause.
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--target-dir /user/ram/employee_result \
--table employee \
--split-by empno \
--append \
--where " empno=1001 and empname = 'karthik' "
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--target-dir /user/ram/employee_result \
--table employee \
--columns "empno, empname" \
--split-by empno \
--append \
--where " empno=1001 and empname = 'karthik' "
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--target-dir /user/ram/employee_result2 \
--query 'select empno, empname from employee where empname = and \$CONDITIONS \
--num-mappers 1 \
--append
Free-form Query Imports
When importing a free-form query, you must specify a destination directory with --target-dir.
If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--target-dir /user/ram/employee_result2 \
--query "select empno, empname from employee where empno < 100 and \$CONDITIONS " \
--num-mappers 1 \
--append
Note:
If you are issuing the query wrapped with double quotes ("), you will have to use \$CONDITIONS instead of just $CONDITIONS to disallow your shell from treating it as a shell variable. For example, a double quoted query may look like: "SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"
The facility of using free-form query in the current version of Sqoop is limited to simple queries where there are no ambiguous projections and no OR conditions in the WHERE clause. Use of complex queries such as queries that have sub-queries or joins leading to ambiguous projections can lead to unexpected results.
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--target-dir /user/ram/employee_result2 \
--query 'select empno, empname from employee where $CONDITIONS ' \
--append \
--num-mappers 1
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--num-mappers 1 \
--fetch-size 1000 \
--fields-terminated-by '|' \
--lines-terminated-by '\n'
ram@hadoop:~$ hadoop fs -ls /user/ram/employeeFound 2 items
-rw-r--r-- 1 ram supergroup 0 2015-11-14 22:05 /user/ram/employee/_SUCCESS
-rw-r--r-- 1 ram supergroup 39 2015-11-14 22:05 /user/ram/employee/part-m-00000
ram@hadoop:~$
ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m-000001001|Karthik|Chennai
1002|Saravanan|NY
ram@hadoop:~$
Boundary Query and Null Handling:
Boundary Query:
By default sqoop will use query select min(<split-by>), max(<split-by>) from <table name> to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using --boundary-query argument.
Null handling:
Sqoop will by default import NULL values as string null. Hive is however using string \N to denote NULL values and therefore predicates dealing with NULL (like IS NULL) will not work correctly. You should append parameters --null-string and --null-non-string in case of import job or --input-null-string and --input-null-non-string in case of an export job if you wish to properly preserve NULL values.
mysql> insert into employee (empno, empname) values (1003,'kavitha');Query OK, 1 row affected (0.07 sec)
mysql> select * from employee;
+-------+-----------+---------+
| empno | empname | empcity |
+-------+-----------+---------+
| 1001 | Karthik | Chennai |
| 1002 | Saravanan | NY |
| 1003 | kavitha | NULL |
+-------+-----------+---------+
3 rows in set (0.00 sec)
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--boundary-query "select min(empno), max(empno) from employee" \
--split-by empno \
--fetch-size 1000 \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--null-string '\\N' \
--null-non-string '\\N'--null-string '\\N': The string to be interpreted as null for string columns.
--null-non-string '\\N' : The string to be interpreted as null for non string columns.
note:select min(empno), max(empno) from employee -- will work
select min (empno), max (empno) from employee -- will throw function not found error
ram@hadoop:~$ hadoop fs -ls /user/ram/employee/Found 4 items
-rw-r--r-- 1 ram supergroup 0 2015-11-14 22:29 /user/ram/employee/_SUCCESS
-rw-r--r-- 1 ram supergroup 21 2015-11-14 22:29 /user/ram/employee/part-m-00000
-rw-r--r-- 1 ram supergroup 18 2015-11-14 22:29 /user/ram/employee/part-m-00001
-rw-r--r-- 1 ram supergroup 16 2015-11-14 22:29 /user/ram/employee/part-m-00002
ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m-000001001|Karthik|Chennai
ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m-00002
1003|kavitha|\N
Relaxed Isolation:
By default Sqoop uses Read Comitted isolation level
--relaxed isolation to use read uncommitted (not supported by all rdbms)
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--boundary-query "select min(empno), max(empno) from employee" \
--split-by empno \
--fetch-size 1000 \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--null-string '\\N' \
--null-non-string '\\N' \
--relaxed-isolation ram@hadoop:~$ hadoop dfs -cat /user/ram/employee/part-m-*1001|Karthik|Chennai
1002|Saravanan|NY
1003|kavitha|\N
ram@hadoop:~$
--as-avrodatafile:
Avro data files are a compact, efficient binary format that provides interoperability with applications written in other programming languages. Avro also supports versioning, so that when, e.g., columns are added or removed from a table, previously imported data files can be processed along with new ones.
By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the -z or --compress argument, or specify any Hadoop compression codec using the --compression-codec argument. This applies to SequenceFile, text, and Avro files.
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--split-by empno \
--as-avrodatafile
ram@hadoop:~$ hadoop dfs -cat /user/ram/employee/part-*Obj avro.schema� {"type":"record","name":"employee","doc":"Sqoop import of employee","fields":[{"name":"empno","type":["null","int"],"default":null,"columnName":"empno","sqlType":"4"},{"name":"empname","type":["null","string"],"default":null,"columnName":"empname","sqlType":"12"},{"namc =Obj avro.schema� {"type":"record","name":"employee","doc":"Sqoop import of employee","fields":[{"name":"empno","type":["null","int"],"default":null,"columnName":"empno","sqlType":"4"},{"name":"empname","type":["null","string"],"default":null,"columnName":"empname","sqlType":"12"},{"name":"empcity","type":["null","string"],"default":null,"columnName":"empcity","sqlType":"12"}],"tableName":"employee"} � ���'h�S�T���� $ Saravanan NY � ���'h�S�T����Obj avro.schema� {"type":"record","name":"employee","doc":"Sqoop import of employee","fields":[{"name":"empno","type":["null","int"],"default":null,"columnName":"empno","sqlType":"4"},{"name":"empname","type":["null","string"],"default":null,"columnName":"empname","sqlType":"12"},{"name":"empcity","type":["null","string"],"default":null,"columnName":"empcity","sqlType":"12"}],"tableName":"employee"}N�&��~�c�H�, kavithaN�&��~�c�H�,ram@hadoop:~$
$ hadoop dfs -text /user/ram/employee/part-*{"empno":{"int":1001},"empname":{"string":"Karthik"},"empcity":{"string":"Chennai"}}
{"empno":{"int":1002},"empname":{"string":"Saravanan"},"empcity":{"string":"NY"}}
{"empno":{"int":1003},"empname":{"string":"kavitha"},"empcity":null}
ram@hadoop:~$
--as-sequencefile:SequenceFiles are a binary format that store individual records in custom record-specific data types. These data types are manifested as Java classes. Sqoop will automatically generate these data types for you
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--split-by empno \
--as-sequencefile
ram@hadoop:~$ hadoop dfs -cat /user/ram/employee/part-*SEQ !org.apache.hadoop.io.LongWritablemployee;��|ǀ�@E�A�'qf �KarthikChennaiSEQ !org.apache.hadoop.io.LongWritablemployee�
� Saravanan NYSEQ !org.apache.hadoop.io.LongWritablemployeecb�[N���sc �<� �kavitha
further reference: reading sequencial file via java program
http://hadooptutorial.info/reading-and-writing-sequencefile-example/
Compression:By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the -z or --compress argument, or specify any Hadoop compression codec using the --compression-codec argument. This applies to SequenceFile, text, and Avro files.
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--split-by empno \
--as-avrodatafile \
--compressram@hadoop:~$ hadoop dfs -ls /user/ram/employee/
Found 4 items
-rw-r--r-- 1 ram supergroup 0 2015-11-14 23:11 /user/ram/employee/_SUCCESS
-rw-r--r-- 1 ram supergroup 481 2015-11-14 23:11 /user/ram/employee/part-m-00000.avro
-rw-r--r-- 1 ram supergroup 478 2015-11-14 23:11 /user/ram/employee/part-m-00001.avro
-rw-r--r-- 1 ram supergroup 473 2015-11-14 23:11 /user/ram/employee/part-m-00002.avro
ram@hadoop:~$
ram@hadoop:~$ hadoop dfs -text /user/ram/employee/part*{"empno":{"int":1001},"empname":{"string":"Karthik"},"empcity":{"string":"Chennai"}}
{"empno":{"int":1002},"empname":{"string":"Saravanan"},"empcity":{"string":"NY"}}
{"empno":{"int":1003},"empname":{"string":"kavitha"},"empcity":null}
ram@hadoop:~$
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--split-by empno \
--compress \
--compression-codec org.apache.hadoop.io.compress.BZip2Codec
ram@hadoop:~$ hadoop dfs -ls /user/ram/employee/Found 4 items
-rw-r--r-- 1 ram supergroup 0 2015-11-14 23:20 /user/ram/employee/_SUCCESS
-rw-r--r-- 1 ram supergroup 62 2015-11-14 23:20 /user/ram/employee/part-m-00000.bz2
-rw-r--r-- 1 ram supergroup 62 2015-11-14 23:20 /user/ram/employee/part-m-00001.bz2
-rw-r--r-- 1 ram supergroup 59 2015-11-14 23:20 /user/ram/employee/part-m-00002.bz2
ram@hadoop:~$
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--split-by empno \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodecError: native snappy library not available: SnappyCompressor has not been loaded.
<<how to download and configure snappy:>>
--direct to Use direct connector if exists for the database
For example, MySQL provides the mysqldump tool which can export data from MySQL to other systems very quickly. By supplying the --direct argument, you are specifying that Sqoop should attempt the direct import channel. This channel may be higher performance than using JDBC.
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--split-by empno \
--direct
ram@hadoop:~$ hadoop dfs -ls /user/ram/employee/Found 4 items
-rw-r--r-- 1 ram supergroup 0 2015-11-14 23:32 /user/ram/employee/_SUCCESS
-rw-r--r-- 1 ram supergroup 21 2015-11-14 23:32 /user/ram/employee/part-m-00000
-rw-r--r-- 1 ram supergroup 18 2015-11-14 23:32 /user/ram/employee/part-m-00001
-rw-r--r-- 1 ram supergroup 18 2015-11-14 23:32 /user/ram/employee/part-m-00002
ram@hadoop:~$ hadoop dfs -ls /user/ram/employee/part-m*-rw-r--r-- 1 ram supergroup 21 2015-11-14 23:32 /user/ram/employee/part-m-00000
-rw-r--r-- 1 ram supergroup 18 2015-11-14 23:32 /user/ram/employee/part-m-00001
-rw-r--r-- 1 ram supergroup 18 2015-11-14 23:32 /user/ram/employee/part-m-00002
ram@hadoop:~$
incremental imports
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--split-by empno \
--direct \
--incremental lastmodified \
--last-value 0 \
--check-column empnonote: --delete-target-dir can not be used with incremental imports.
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--table employee \
--split-by empno \
--direct \
--check-column empno \
--incremental append \
--last-value 0 Error: java.lang.RuntimeException: Column type is neither timestamp nor date!
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--table employee \
--split-by empno \
--direct\
--check-column empno \
--incremental append \
--last-value 0
15/11/15 05:19:26 INFO tool.ImportTool: Incremental import complete! To run another incremental import of all data following this import, supply the following arguments:
15/11/15 05:19:26 INFO tool.ImportTool: --incremental append
15/11/15 05:19:26 INFO tool.ImportTool: --check-column empno
15/11/15 05:19:26 INFO tool.ImportTool: --last-value 1003
15/11/15 05:19:26 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create')
ram@hadoop:~$ hadoop fs -ls /user/ram/
Found 4 itemsdrwxr-xr-x - ram supergroup 0 2015-11-15 05:19 /user/ram/_sqoop
drwxr-xr-x - ram supergroup 0 2015-11-15 05:19 /user/ram/employee
drwxr-xr-x - ram supergroup 0 2015-11-14 16:10 /user/ram/employee_result
drwxr-xr-x - ram supergroup 0 2015-11-14 16:28 /user/ram/employee_result2
ram@hadoop:~$ hadoop fs -ls /user/ram/employeeFound 3 items
-rw-r--r-- 1 ram supergroup 21 2015-11-15 05:19 /user/ram/employee/part-m-00000
-rw-r--r-- 1 ram supergroup 18 2015-11-15 05:19 /user/ram/employee/part-m-00001
-rw-r--r-- 1 ram supergroup 18 2015-11-15 05:19 /user/ram/employee/part-m-00002
ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m*1001,Karthik,Chennai
1002,Saravanan,NY
1003,kavitha,NULL
ram@hadoop:~$
Insert 2 more records:
mysql> insert into employee values (1004, 'ramkumar','NJ');Query OK, 1 row affected (0.07 sec)
mysql> insert into employee values (1005, 'gopal','NJ');Query OK, 1 row affected (0.05 sec)
sqoop import --connect jdbc:mysql://localhost/sales \
--username root --password ram \
--table employee \
--split-by empno \
--direct \
--check-column empno \
--incremental append \
--last-value 100315/11/15 05:27:21 INFO tool.ImportTool: --incremental append
15/11/15 05:27:21 INFO tool.ImportTool: --check-column empno
15/11/15 05:27:21 INFO tool.ImportTool: --last-value 1005
15/11/15 05:27:21 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create')
ram@hadoop:~$ hadoop fs -ls /user/ram/employeeFound 5 items
-rw-r--r-- 1 ram supergroup 21 2015-11-15 05:19 /user/ram/employee/part-m-00000
-rw-r--r-- 1 ram supergroup 18 2015-11-15 05:19 /user/ram/employee/part-m-00001
-rw-r--r-- 1 ram supergroup 18 2015-11-15 05:19 /user/ram/employee/part-m-00002
-rw-r--r-- 1 ram supergroup 17 2015-11-15 05:27 /user/ram/employee/part-m-00003
-rw-r--r-- 1 ram supergroup 14 2015-11-15 05:27 /user/ram/employee/part-m-00004
ram@hadoop:~$ hadoop fs -cat /user/ram/employee/part-m*1001,Karthik,Chennai
1002,Saravanan,NY
1003,kavitha,NULL
1004,ramkumar,NJ
1005,gopal,NJ
ram@hadoop:~$
Create and execute Sqoop jobs:
sqoop job --create employee_bulk_import \
-- import \
--username root \
--password ram \
--connect jdbc:mysql://localhost/sales \
--delete-target-dir \
--target-dir /user/ram/employee \
--table employee \
--split-by empno
sqoop job --list
sqoop job --show employee_bulk_import
sqoop job --exec employee_bulk_import
sqoop job --delete employee_bulk_import
Hive Import
- If you have a Hive metastore associated with your HDFS cluster, Sqoop can also import the data into Hive by generating and executing a
CREATE TABLE
statement to define the data’s layout in
Hive. Importing data into Hive is as simple as adding the
--hive-import
option to your Sqoop command line.if the Hive table already exists, you can specify the
--hive-overwrite
Sqoop will generate a Hive script containing a
CREATE TABLE
operation defining your columns using Hive’s types, and a LOAD DATA INPATH
statement to move the data files into Hive’s warehouse directory.Even though Hive supports escaping characters, it does not handle escaping of new-line character. Also, it does not support the notion of enclosing characters that may include field delimiters in the enclosed string
t is therefore recommended that you choose unambiguous field and record-terminating delimiters without the help of escaping and enclosing characters when working with Hive; this is due to limitations of Hive’s input parsing abilities. If you do use
--escaped-by
, --enclosed-by
, or --optionally-enclosed-by
when
importing data into Hive, Sqoop will print a warning message.Hive will have problems using Sqoop-imported data if your database’s rows contain string fields that have Hive’s default row delimiters (
\n
and \r
characters) or column delimiters (\01
characters)
present in them. You can use the --hive-drop-import-delims
option
to drop those characters on import to give Hive-compatible text data.
Alternatively, you can use the --hive-delims-replacement
option
to replace those characters with a user-defined string on import to give
Hive-compatible text dataSqoop will pass the field and record delimiters through to Hive. If you do not set any delimiters and do use
--hive-import
, the field delimiter will
be set to ^A
and the record delimiter will be set to \n
to be consistent
with Hive’s defaults.You should append parameters
--null-string
and --null-non-string
in case of import job or
--input-null-string
and --input-null-non-string
in case of an export job if
you wish to properly preserve NULL
values. The table name used in Hive is, by default, the same as that of the source table. You can control the output table name with the
--hive-table
option.Hive can put data into partitions for more efficient query performance. You can tell a Sqoop job to import data for Hive into a particular partition by specifying the
--hive-partition-key
and
--hive-partition-value
arguments.You can import compressed tables into Hive using the
--compress
and
--compression-codec
options. One downside to compressing tables imported
into Hive is that many codecs cannot be split for processing by parallel map
tasks. The lzop codec, however, does support splitting. When importing tables
with this codec, Sqoop will automatically index the files for splitting and
configuring a new Hive table with the correct InputFormat.Note: External tables/partitions are not supported
Note: Partitions are created under the default location
Note: Existing Hive table can’t be truncated before the import
The --delete-target-dir option is not respected.
sqoop import \
--username root \
--password ram \
--connect jdbc:mysql://localhost/sales \
--delete-target-dir \
--table employee \
--split-by empno \
--direct \
--hive-import \
--hive-table=hive_employee
15/11/15 06:17:35 INFO mapreduce.ImportJobBase: Transferred 88 bytes in 35.049 seconds (2.5108 bytes/sec)
15/11/15 06:17:35 INFO mapreduce.ImportJobBase: Retrieved 5 records.
15/11/15 06:17:35 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `employee` AS t LIMIT 1
15/11/15 06:17:35 INFO hive.HiveImport: Loading uploaded data into Hive
15/11/15 06:17:54 INFO hive.HiveImport: Time taken: 2.839 seconds
15/11/15 06:17:54 INFO hive.HiveImport: Loading data to table default.hive_employee
15/11/15 06:17:55 INFO hive.HiveImport: Table default.hive_employee stats: [numFiles=5, totalSize=88]
15/11/15 06:17:55 INFO hive.HiveImport: Hive import complete.
15/11/15 06:17:55 INFO hive.HiveImport: Export directory is contains the _SUCCESS file only, removing the directory.
ram@hadoop:~$
ram@hadoop:~$ hadoop fs -ls /user/hive/warehouse/hive_employeeFound 5 items
-rwxrwxr-x 1 ram supergroup 21 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00000
-rwxrwxr-x 1 ram supergroup 18 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00001
-rwxrwxr-x 1 ram supergroup 18 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00002
-rwxrwxr-x 1 ram supergroup 17 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00003
-rwxrwxr-x 1 ram supergroup 14 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00004
ram@hadoop:~$ hadoop fs -cat /user/hive/warehouse/hive_employee/part-m*1001 Karthik Chennai
1002 Saravanan NY
1003 kavitha NULL
1004 ramkumar NJ
1005 gopal NJ
ram@hadoop:~$
ram@hadoop:~$ hive -e "select * from hive_employee"OK
1001 Karthik Chennai
1002 Saravanan NY
1003 kavitha NULL
1004 ramkumar NJ
1005 gopal NJ
Time taken: 2.727 seconds, Fetched: 5 row(s)
ram@hadoop:~$ hive -e "select * from hive_employee where empname = 'gopal'"1005 gopal NJ
Time taken: 3.182 seconds, Fetched: 1 row(s)
ram@hadoop:~$
What happens when you run the import again:The --delete-target-dir option is not respected.
sqoop import \
--username root \
--password ram \
--connect jdbc:mysql://localhost/sales \
--delete-target-dir \
--table employee \
--split-by empno \
--direct \
--hive-import \
--hive-table=hive_employee
ram@hadoop:~$ hadoop fs -ls /user/hive/warehouse/hive_employeeFound 10 items
-rwxrwxr-x 1 ram supergroup 21 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00000
-rwxrwxr-x 1 ram supergroup 21 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00000_copy_1
-rwxrwxr-x 1 ram supergroup 18 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00001
-rwxrwxr-x 1 ram supergroup 18 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00001_copy_1
-rwxrwxr-x 1 ram supergroup 18 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00002
-rwxrwxr-x 1 ram supergroup 18 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00002_copy_1
-rwxrwxr-x 1 ram supergroup 17 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00003
-rwxrwxr-x 1 ram supergroup 17 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00003_copy_1
-rwxrwxr-x 1 ram supergroup 14 2015-11-15 06:17 /user/hive/warehouse/hive_employee/part-m-00004
-rwxrwxr-x 1 ram supergroup 14 2015-11-15 06:25 /user/hive/warehouse/hive_employee/part-m-00004_copy_1
ram@hadoop:~$ hive -e "select * from hive_employee"
OK
1001 Karthik Chennai
1001 Karthik Chennai
1002 Saravanan NY
1002 Saravanan NY
1003 kavitha NULL
1003 kavitha NULL
1004 ramkumar NJ
1004 ramkumar NJ
1005 gopal NJ
1005 gopal NJ
Time taken: 2.56 seconds, Fetched: 10 row(s)
ram@hadoop:~$
sqoop import \
--username root \
--password ram \
--connect jdbc:mysql://localhost/sales \
--delete-target-dir \
--table employee \
--split-by empno \
--direct \
--hive-import \
--hive-overwrite
Note: Hive-Overwrite will delete all the files employee folder before loading the data
hive -e "drop table employee"
sqoop import \
--username root \
--password ram \
--connect jdbc:mysql://localhost/sales \
--table employee \
--split-by empno \
--direct \
--hive-import \
--hive-overwrite \
--hive-drop-import-delims \
--hive-partition-key empno \
--hive-partition-value 1002ERROR: Direct import currently do not support dropping hive delimiters, please remove parameter --hive-drop-import-delims.
hive -e "drop table employee"
sqoop import \
--username root \
--password ram \
--connect jdbc:mysql://localhost/sales \
--table employee \
--split-by empno \
--direct \
--hive-import \
--hive-overwrite \
--hive-partition-key "empcity" \
--hive-partition-value "NJ"
<<this creates files under /user/ram/employee. need to check>>
ram@hadoop:~$ hadoop fs -ls /user/ram/employee15/11/15 15:36:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 6 items
-rw-r--r-- 1 ram supergroup 0 2015-11-15 15:34 /user/ram/employee/_SUCCESS
-rw-r--r-- 1 ram supergroup 21 2015-11-15 15:34 /user/ram/employee/part-m-00000
-rw-r--r-- 1 ram supergroup 18 2015-11-15 15:34 /user/ram/employee/part-m-00001
-rw-r--r-- 1 ram supergroup 18 2015-11-15 15:34 /user/ram/employee/part-m-00002
-rw-r--r-- 1 ram supergroup 17 2015-11-15 15:34 /user/ram/employee/part-m-00003
-rw-r--r-- 1 ram supergroup 14 2015-11-15 15:34 /user/ram/employee/part-m-00004