Wednesday, July 29, 2015

Hadoop Notes

Hadoop Install directory - /usr/lib/hadoop-0.20/

The port number for Namenode is ’70′, for job tracker is ’30′ and for task tracker is ’60′.

3 config files: core-site, mapred-site.xml, hdfs-site.xml
Spill factor is the size after which your files move to the temp file. Hadoop-temp directory is used for this.
Hdfs-site.xml properties:
dfs.name.dir, dfs.data.dir and fs.checkpoint.dir
Fsck – file system check
Jps – to check if hadoop daemons are running
Restart hadoop daemons
start-yarn.sh, stop-yarn.sh
start-all.sh, stop-all.sh
Slaves and Masters  file are used by the startup and the shutdown commands.
Slaves consist of a list of hosts, one per line, that host datanode and task tracker servers.
Masters contain a list of hosts, one per line, that are to host secondary namenode servers.
hadoop-env.sh provides the environment for Hadoop to run. JAVA_HOME is set over here.
The command mapred.job.tracker lists out which of your nodes is acting as a job tracker.
/etc /init.d specifies where daemons (services) are placed or to see the status of these daemons. It is very LINUX specific, and nothing to do with Hadoop.


Which are the three modes in which Hadoop can be run?

1. standalone (local) mode – no daemons, all on single JVM, no dfs, only local file system.
2. Pseudo-distributed mode -
3. Fully distributed mode – daemons running on clusters.


How can we check whether Namenode is working or not?
To check whether Namenode is working or not, use the command /etc/init.d/hadoop-0.20-namenode status or as simple as jps
Default Ports
SSH – 22
The port number for Namenode is ’70′, for job tracker is ’30′ and for task tracker is ’60′.
  http://Hadoopmaster:50070/ – web UI of the NameNode daemon
http://Hadoopmaster:50030/ – web UI of the JobTracker daemon
http://Hadoopmaster:50060/ – web UI of the TaskTracker daemon


Quickly switching hadoop modes

hadoop@computer:~$ cd /your/hadoop/installation/
hadoop@computer:~$ cp -R conf conf.standalone
hadoop@computer:~$ cp -R conf conf.pseudo
hadoop@computer:~$ cp -R conf conf.distributed
hadoop@computer:~$ rm -R conf

ln- to create a link for a folder.
Switching to standalone modehadoop@computer:~$ ln -s conf.standalone conf
Switching to pseudo-distributed modehadoop@computer:~$ ln -s conf.pseudo conf
Switching to fully distributed modehadoop@computer:~$ ln -s conf.distributed conf
Map and reduce slots are controled in mapred-site.xml
mapreduce.tasktracker.map.
mapreduce.tasktracker.reduce.
Important: If you change these settings, restart all of the TaskTracker nodes.


What are the network requirements for Hadoop?

The Hadoop core uses Shell (SSH) to launch the server processes on the slave nodes. It requires password-less SSH connection between the master and all the slaves and the secondary machines.
SSH is a password-less secure communication where data packets are sent across the slave
SSH is nothing but a secure shell communication, it is a kind of a protocol that works on a Port No. 22, and when you do an SSH, what you really require is a password.

What happens to job tracker when Namenode is down?

When Namenode is down, your cluster is OFF, this is because Namenode is the single point of failure in HDFS.


What happens to a Namenode, when job tracker is down?

 

When a job tracker is down, it will not be functional but Namenode will be present. So, cluster is accessible if Namenode is working, even if the job tracker is not working.

 

Does the HDFS client decide the input split or Namenode?

No, the Client does not decide. It is already specified in one of the configurations through which input split is already configured

No comments:

Post a Comment