Learn Hadoop and NoSQL: July 2015

Wednesday, July 29, 2015

Hadoop Admin Commands quick reference

Reference:http://www.thegeekstuff.com/

Hadoop filesystem commands

hadoop fs -mkdir /dir
hadoop fs -ls
hadoop fs -cat <filename>
hadoop fs -rm <<filename>>
hadoop fs -mv file:///data/datafile /user/hduser/data
hadoop fs -touchz <<filename>> -create empty file
hadoop fs -stat <filename>
hadoop fs -expunge   <<empty trash on hdfs>>
ram@ram:/etc/init.d$ hadoop fs -du /user
50270 /user/1.log
0      /user/hive

hadoop fs -copyFromLocal <source> <destination>
hadoop fs -copyToLocal <source> <destination>
hadoop fs -put <source> <destination> --copy from remote location
hadoop fs -get <source> <destination> --copy to remote location
hadoop distcp hdfs://192.168.0.8:8020/input hdfs://192.168.0.8:8020/output
-- Copy data from one cluster to another using the cluster URL
hadoop fs -setrep -w 3 file1
hadoop fs -getmerge mydir bigfile
-- Merge files in mydir directory and download it as one big file

Hadoop Job Commands

hadoop job -submit <job-file>
hadoop job -status <job-id>
hadoop job -history
hadoop job -kill-task <task-id>

ram@ram:/etc/init.d$ hadoop job -list all
DEPRECATED: Use of
this script to execute mapred command is deprecated.
Instead use the mapred command for it.

15/07/29 21:03:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/07/29 21:03:51 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
Total jobs:0
                  JobId         State         StartTime        UserName           Queue    Priority    UsedContainers    RsvdContainers    UsedMem RsvdMem    NeededMem       AM info

ram@ram:/etc/init.d$ hadoop job -list-active-trackers
DEPRECATED: Use of this script to execute mapred command is deprecated.
Instead use the mapred command for it.

15/07/29 21:04:24 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
tracker_ram:49874

Hadoop Namenode commands

hadoop namenode -format
hadoop namenode -upgrade
hadoop namenode -recover -force
hadoop fsck -delete    <<delete corrupted files>>
hadoop fsck -move    <<move corrupted files to lost+found folder>

-- Recover namenode metadata after a cluster failure (may lose data)

ram@ram:/etc/init.d$ stop-dfs.sh
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode

ram@ram:/etc/init.d$ stop-yarn.sh
stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
no proxyserver to stop

ram@ram:/etc/init.d$ start-dfs.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-ram-namenode-ram.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-ram-datanode-ram.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-ram-secondarynamenode-ram.out

ram@ram:/etc/init.d$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-ram-resourcemanager-ram.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-ram-nodemanager-ram.out
ram@ram:/etc/init.d$

ram@ram:/etc/init.d$ jps
6330 NodeManager
6192 ResourceManager
5827 DataNode
6649 Jps
6028 SecondaryNameNode
5664 NameNode

ram@ram:/etc/init.d$ hadoop fsck /
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Connecting to namenode via http://localhost:50070/fsck?ugi=ram&path=%2F
FSCK started by ram (auth:SIMPLE) from /127.0.0.1 for path / at Wed Jul 29 20:56:55 IST 2015
.
/user/1.log: Under replicated BP-393036986-127.0.1.1-1437358619878:blk_1073741825_1001. Target Replicas is 3 but found 1 replica(s).
Status: HEALTHY
Total size:    50270 B
Total dirs:    7
Total files:    1
Total symlinks:        0
Total blocks (validated):    1 (avg. block size 50270 B)
Minimally replicated blocks:    1 (100.0 %)
Over-replicated blocks:    0 (0.0 %)
Under-replicated blocks:    1 (100.0 %)
Mis-replicated blocks:        0 (0.0 %)
Default replication factor:    3
Average block replication:    1.0
Corrupt blocks:        0
Missing replicas:        2 (66.666664 %)
Number of data-nodes:        1
Number of racks:        1
FSCK ended at Wed Jul 29 20:56:55 IST 2015 in 3 milliseconds

The filesystem under path '/' is HEALTHY

ram@ram:/etc/init.d$ hadoop fsck / -files -blocks -locations -racks
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Connecting to namenode via http://localhost:50070/fsck?ugi=ram&files=1&blocks=1&locations=1&racks=1&path=%2F
FSCK started by ram (auth:SIMPLE) from /127.0.0.1 for path / at Wed Jul 29 20:58:22 IST 2015
/ <dir>
/tmp <dir>
/tmp/hive <dir>
/tmp/hive/ram <dir>
/user <dir>
/user/1.log 50270 bytes, 1 block(s): Under replicated BP-393036986-127.0.1.1-1437358619878:blk_1073741825_1001. Target Replicas is 3 but found 1 replica(s).
0. BP-393036986-127.0.1.1-1437358619878:blk_1073741825_1001 len=50270 repl=1 [/default-rack/127.0.0.1:50010]

/user/hive <dir>
/user/hive/warehouse <dir>
Status: HEALTHY
Total size:    50270 B
Total dirs:    7
Total files:    1
Total symlinks:        0
Total blocks (validated):    1 (avg. block size 50270 B)
Minimally replicated blocks:    1 (100.0 %)
Over-replicated blocks:    0 (0.0 %)
Under-replicated blocks:    1 (100.0 %)
Mis-replicated blocks:        0 (0.0 %)
Default replication factor:    3
Average block replication:    1.0
Corrupt blocks:        0
Missing replicas:        2 (66.666664 %)
Number of data-nodes:        1
Number of racks:        1
FSCK ended at Wed Jul 29 20:58:22 IST 2015 in 3 milliseconds

The filesystem under path '/' is HEALTHY

Hadoop dfsadmin commands

ram@ram:/etc/init.d$ hadoop dfsadmin -report
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Configured Capacity: 98496679936 (91.73 GB)
Present Capacity: 80164052992 (74.66 GB)
DFS Remaining: 80163958784 (74.66 GB)
DFS Used: 94208 (92 KB)
DFS Used%: 0.00%
Under replicated blocks: 1
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0

-------------------------------------------------
Live datanodes (1):

Name: 127.0.0.1:50010 (localhost)
Hostname: ram
Decommission Status : Normal
Configured Capacity: 98496679936 (91.73 GB)
DFS Used: 94208 (92 KB)
Non DFS Used: 18332626944 (17.07 GB)
DFS Remaining: 80163958784 (74.66 GB)
DFS Used%: 0.00%
DFS Remaining%: 81.39%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Wed Jul 29 21:06:41 IST 2015

ram@ram:/etc/init.d$ hadoop dfsadmin -setQuota 10 /user
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

ram@ram:/etc/init.d$ hadoop fs -count -q /user
          10               6            none             inf            3            1              50270 /user
ram@ram:/etc/init.d$

ram@ram:/etc/init.d$ hadoop dfsadmin -safemode enter
Safe mode is ON

ram@ram:/etc/init.d$ hadoop dfsadmin -saveNamespace
<<Backup Metadata (fsimage & edits). Put cluster in safe mode before this command.>>
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Save namespace successful

ram@ram:/etc/init.d$ hadoop dfsadmin -safemode get
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Safe mode is ON

ram@ram:/etc/init.d$ hadoop dfsadmin -safemode leave
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Safe mode is OFF
ram@ram:/etc/init.d$

Hadoop yarn commands

Hadoop Balancer commands

ram@ram:/etc/init.d$ start-balancer.sh
starting balancer, logging to /usr/local/hadoop/logs/hadoop-ram-balancer-ram.out

hadoop dfsadmin -setBalancerBandwidth <bandwidthinbytes>
ram@ram:/etc/init.d$ hadoop balancer -threshold 20
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

15/07/29 21:16:22 INFO balancer.Balancer: Using a threshold of 20.0
15/07/29 21:16:22 INFO balancer.Balancer: namenodes = [hdfs://localhost:9000]
15/07/29 21:16:22 INFO balancer.Balancer: parameters = Balancer.Parameters[BalancingPolicy.Node, threshold=20.0, max idle iteration = 5, number of nodes to be excluded = 0, number of nodes to be included = 0]
Time Stamp               Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved
15/07/29 21:16:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/07/29 21:16:24 INFO net.NetworkTopology: Adding a new node: /default-rack/127.0.0.1:50010
15/07/29 21:16:24 INFO balancer.Balancer: 0 over-utilized: []
15/07/29 21:16:24 INFO balancer.Balancer: 0 underutilized: []
The cluster is balanced. Exiting...
29 Jul, 2015 9:16:24 PM           0                  0 B                 0 B               -1 B
29 Jul, 2015 9:16:24 PM Balancing took 2.217 seconds
ram@ram:/etc/init.d$

Hadoop Notes

Hadoop Install directory - /usr/lib/hadoop-0.20/

The port number for Namenode is ’70′, for job tracker is ’30′ and for task tracker is ’60′.

3 config files: core-site, mapred-site.xml, hdfs-site.xml
Spill factor is the size after which your files move to the temp file. Hadoop-temp directory is used for this.
Hdfs-site.xml properties:
dfs.name.dir, dfs.data.dir and fs.checkpoint.dir
Fsck – file system check
Jps – to check if hadoop daemons are running
Restart hadoop daemons
start-yarn.sh, stop-yarn.sh
start-all.sh, stop-all.sh
Slaves and Masters file are used by the startup and the shutdown commands.
Slaves consist of a list of hosts, one per line, that host datanode and task tracker servers.
Masters contain a list of hosts, one per line, that are to host secondary namenode servers.
hadoop-env.sh provides the environment for Hadoop to run. JAVA_HOME is set over here.
The command mapred.job.tracker lists out which of your nodes is acting as a job tracker.
/etc /init.d specifies where daemons (services) are placed or to see the status of these daemons. It is very LINUX specific, and nothing to do with Hadoop.

Which are the three modes in which Hadoop can be run?

1. standalone (local) mode – no daemons, all on single JVM, no dfs, only local file system.
2. Pseudo-distributed mode -
3. Fully distributed mode – daemons running on clusters.

How can we check whether Namenode is working or not?
To check whether Namenode is working or not, use the command /etc/init.d/hadoop-0.20-namenode status or as simple as jps
Default Ports
SSH – 22
The port number for Namenode is ’70′, for job tracker is ’30′ and for task tracker is ’60′.
http://Hadoopmaster:50070/ – web UI of the NameNode daemon
http://Hadoopmaster:50030/ – web UI of the JobTracker daemon
http://Hadoopmaster:50060/ – web UI of the TaskTracker daemon

Quickly switching hadoop modes

hadoop@computer:~$ cd /your/hadoop/installation/
hadoop@computer:~$ cp -R conf conf.standalone
hadoop@computer:~$ cp -R conf conf.pseudo
hadoop@computer:~$ cp -R conf conf.distributed
hadoop@computer:~$ rm -R conf

ln- to create a link for a folder.

Switching to standalone modehadoop@computer:~$ ln -s conf.standalone conf

Switching to pseudo-distributed modehadoop@computer:~$ ln -s conf.pseudo conf

Switching to fully distributed modehadoop@computer:~$ ln -s conf.distributed conf

Map and reduce slots are controled in mapred-site.xml
mapreduce.tasktracker.map.
mapreduce.tasktracker.reduce.
Important: If you change these settings, restart all of the TaskTracker nodes.

What are the network requirements for Hadoop?

The Hadoop core uses Shell (SSH) to launch the server processes on the slave nodes. It requires password-less SSH connection between the master and all the slaves and the secondary machines.
SSH is a password-less secure communication where data packets are sent across the slave
SSH is nothing but a secure shell communication, it is a kind of a protocol that works on a Port No. 22, and when you do an SSH, what you really require is a password.

What happens to job tracker when Namenode is down?

When Namenode is down, your cluster is OFF, this is because Namenode is the single point of failure in HDFS.

What happens to a Namenode, when job tracker is down?

When a job tracker is down, it will not be functional but Namenode will be present. So, cluster is accessible if Namenode is working, even if the job tracker is not working.

Does the HDFS client decide the input split or Namenode?

No, the Client does not decide. It is already specified in one of the configurations through which input split is already configured

Monday, July 20, 2015

Install hive 1.2.1 on ubuntu 14.04

1. download hive on below path
http://apache.mirrors.hoobly.com/hive/

2. extract the .tar.gz file and move the folder to /usr/lib/hive path
sudo mv Downloads/apache-hive-1.2.1-bin /usr/lib/hive

3. provide access to hive path
sudo chown -R ram /usr/lib/hive

4. configure environment variables .bashrc

export HIVE_HOME=/usr/lib/hive/
export PATH=$PATH:$HIVE_HOME/bin

5. effect the changes
source ~/.bashrc

6. create folders for hive in HDFS

hadoop fs -mkdir /tmp
hadoop fs -mkdir /user/hive/warehouse
hadoop fs -chmod g+w /tmp
hadoop fs -chmod g+w /user/hive/warehouse

7. run hive

hive

8. starting hive server1 and server2

/bin/hive --service hiveserver
/bin/hiveserver2

Note: Hive Web UI

1. start hwi service
hive --service hwi

2. http://localhost:9999/hwi/index.jsp

HDFS frequently used commands

ram@ram:~$ hadoop fsck /
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

15/07/20 20:30:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Connecting to namenode via http://localhost:50070/fsck?ugi=ram&path=%2F
FSCK started by ram (auth:SIMPLE) from /127.0.0.1 for path / at Mon Jul 20 20:30:04 IST 2015
Status: HEALTHY
Total size:    0 B
Total dirs:    2
Total files:    0
Total symlinks:        0
Total blocks (validated):    0
Minimally replicated blocks:    0
Over-replicated blocks:    0
Under-replicated blocks:    0
Mis-replicated blocks:        0
Default replication factor:    3
Average block replication:    0.0
Corrupt blocks:        0
Missing replicas:        0
Number of data-nodes:        1
Number of racks:        1
FSCK ended at Mon Jul 20 20:30:04 IST 2015 in 3 milliseconds

The filesystem under path '/' is HEALTHY

ram@ram:~$ hadoop fs -copyFromLocal /home/ram/Documents/1.log /user
ram@ram:~$ hadoop fs -ls /userFound 1 items
-rw-r--r--   3 ram supergroup      50270 2015-07-20 20:34 /user/1.log

ram@ram:~$ hadoop fs -copyToLocal /user/1.log ~/15/07/20 20:36:38 WARN hdfs.DFSClient: DFSInputStream has been closed already

ram@ram:~$ ls1.log Desktop Documents Downloads examples.desktop Music Pictures Public Templates Videos zookeeper

ram@ram:~$ hadoop fs -setrep -w 3 /user/1.logReplication 3 set: /user/1.log

ram@ram:~$ hadoop fs -cat /user/1.log
ram@ram:~$ hadoop fs -du /user
50270 /user/1.log

ram@ram:~$ hadoop fs -ls hdfs://localhost:9000/Found 2 items
drwxrwxr-x   - ram supergroup          0 2015-07-20 21:45 hdfs://localhost:9000/tmp
drwxr-xr-x   - ram supergroup          0 2015-07-20 21:43 hdfs://localhost:9000/user
ram@ram:~$

Install HBase 1.0.1.1 on Ubuntu 14.04

Steps:

1. Download hbase tar.gz file in below path
http://apache.mirrors.ionfish.org/hbase/

2. unzip the file and move the extracted folder to /usr/lib/hbase path
sudo mv ~/Downloads/hbase-0.94.6 /usr/lib/hbase

3. edit gedit /usr/lib/hbase/conf/hbase-env.sh file to include JAVA_HOME path
# The java implementation to use. Java 1.6 required.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

4. update hbase path in .bashrc file

#HBASE_PATH
export HBASE_HOME=/usr/lib/hbase/
export PATH=$PATH:$HBASE_HOME/bin

5. update conf/hbase-site.xml with working path

<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///usr/lib/hbase/data_dir</value>
</property>

<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/ram/zookeeper</value>
</property>
</configuration>

6. start hbase

start-hbase.sh

7. using hbase shell

hbase shell

8. stop hbase daemon

stop-hbase.sh

Errors and fixes:

Error 1 :
ERROR org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master
org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4

reason: hadoop vs hbase 0.94.6 compatibility issue.
fix: install latest hadoop version (1.0.1.1)