Source: http://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
Introduction
- The Hadoop Distributed File System
(HDFS) is a distributed file system designed to run on commodity
hardware
- HDFS is highly fault-tolerant
Hardware Failure
An HDFS instance may consist of
hundreds or thousands of server machines, each storing part of the
file system’s data. detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS
Streaming Data Access
Applications that run on HDFS need
streaming access to their data sets.
HDFS is designed more for batch
processing rather than interactive use by users.
The emphasis is on high throughput
of data access rather than low latency of data access
Large Data Sets
Applications that run on HDFS have
large data sets. A typical file in HDFS is gigabytes to terabytes in
size.
Simple Coherency Model
HDFS applications need a
write-once-read-many access model for files.
A file once created, written, and
closed need not be changed.
Map/Reduce application or a web
crawler application fits perfectly with this model. There is a plan
to support appending-writes to files in the future.
“Moving Computation is Cheaper than Moving Data”
A computation requested by an
application is much more efficient if it is executed near the data it
operates on. HDFS provides interfaces for applications to move
themselves closer to where the data is located.
Portability Across Heterogeneous Hardware and Software Platforms
HDFS has been designed to be easily
portable from one platform to another
NameNode and DataNodes
HDFS has a master/slave
architecture. An HDFS cluster consists of a single NameNode, a master
server that manages the file system namespace and regulates access to
files by clients. In addition, there are a number of DataNodes,
usually one per node in the cluster, which manage storage attached to
the nodes that they run on.
Internally, a file is split into one
or more blocks and these blocks are stored in a set of DataNodes. The
DataNodes are responsible for serving read and write requests from
the file system’s clients. The DataNodes also perform block
creation, deletion, and replication upon instruction from the
NameNode.
Name Node and Data Nodes machines
typically run a GNU/Linux operating system (OS).
HDFS is built using the Java
language; any machine that supports Java can run the NameNode or the
DataNode software.
A typical deployment has a dedicated
machine that runs only the NameNode software.
Each of the other machines in the
cluster runs one instance of the DataNode software.
The architecture does not preclude
running multiple DataNodes on the same machine but in a real
deployment that is rarely the case.
The existence of a single NameNode
in a cluster greatly simplifies the architecture of the systemThe
NameNode is the arbitrator and repository for all HDFS metadata. The
system is designed in such a way that user data never flows through
the NameNode.
The File System Namespace
A user or an application can create
directories and store files inside these directories.
HDFS does not support hard links or
soft links. However, the HDFS architecture does not preclude
implementing these features.
The NameNode maintains the file
system namespace. Any change to the file system namespace or its
properties is recorded by the NameNode. An application can specify
the number of replicas of a file that should be maintained by HDFS.
The number of copies of a file is called the replication factor of
that file. This information is stored by the NameNode.
Data Replication
HDFS is designed to reliably store
very large files across machines in a large cluster. It stores each
file as a sequence of blocks; all blocks in a file except the last
block are the same size
The blocks of a file are replicated
for fault tolerance. The block size and replication factor are
configurable per file.
An application can specify the
number of replicas of a file. The replication factor can be specified
at file creation time and can be changed later. Files in HDFS are
write-once and have strictly one writer at any time.
The NameNode makes all decisions
regarding replication of blocks. It periodically receives a Heartbeat
and a Blockreport from each of the DataNodes in the cluster. A
Blockreport contains a list of all blocks on a DataNode.
Replica Placement: The First Baby Steps
Large HDFS instances run on a
cluster of computers that commonly spread across many racks.
Communication between two nodes in different racks has to go through
switches. In most cases, network bandwidth between machines in the
same rack is greater than network bandwidth between machines in
different racks.
A simple but non-optimal policy is
to place replicas on unique racks. This prevents losing data when an
entire rack fails and allows use of bandwidth from multiple racks
when reading data.
For the common case, when the
replication factor is three, HDFS’s placement policy is to put one
replica on one node in the local rack, another on a different node in
the local rack, and the last on a different node in a different rack.
This policy cuts the inter-rack write traffic which generally
improves write performance. With this policy, the replicas of a file
do not evenly distribute across the racks. One third of replicas are
on one node, two thirds of replicas are on one rack, and the other
third are evenly distributed across the remaining racks. This policy
improves write performance without compromising data reliability or
read performance.
Replica Selection
To minimize global bandwidth
consumption and read latency, HDFS tries to satisfy a read request
from a replica that is closest to the reader. If there exists a
replica on the same rack as the reader node, then that replica is
preferred to satisfy the read request. If angg/ HDFS cluster spans
multiple data centers, then a replica that is resident in the local
data center is preferred over any remote replica.
Safemode
On startup, the NameNode enters a
special state called Safemode. Replication of data blocks does not
occur when the NameNode is in the Safemode state. The NameNode
receives Heartbeat and Blockreport messages from the DataNodes. A
Blockreport contains the list of data blocks that a DataNode is
hosting. Each block has a specified minimum number of replicas. A
block is considered safely replicated when the minimum number of
replicas of that data block has checked in with the NameNode. After a
configurable percentage of safely replicated data blocks checks in
with the NameNode (plus an additional 30 seconds), the NameNode exits
the Safemode state. It then determines the list of data blocks (if
any) that still have fewer than the specified number of replicas. The
NameNode then replicates these blocks to other DataNodes.
The Persistence of File System Metadata
The HDFS namespace is stored by the
NameNode. The NameNode uses a transaction log called the EditLog to
persistently record every change that occurs to file system metadata.
The NameNode uses a file in its local host OS file system to store
the EditLog. The entire file system namespace, including the mapping
of blocks to files and file system properties, is stored in a file
called the FsImage. The FsImage is stored as a file in the NameNode’s
local file system too.
The NameNode keeps an image of the
entire file system namespace and file Blockmap in memory. NameNode
with 4 GB of RAM is plenty to support a huge number of files and
directories. When the NameNode starts up, it reads the FsImage and
EditLog from disk, applies all the transactions from the EditLog to
the in-memory representation of the FsImage, and flushes out this new
version into a new FsImage on disk. It can then truncate the old
EditLog because its transactions have been applied to the persistent
FsImage. This process is called a checkpoint. In the current
implementation, a checkpoint only occurs when the NameNode starts up.
Work is in progress to support periodic checkpointing in the near
future.
The DataNode stores HDFS data in
files in its local file system. The DataNode has no knowledge about
HDFS files. It stores each block of HDFS data in a separate file in
its local file system. The DataNode does not create all files in the
same directory. Instead, it uses a heuristic to determine the optimal
number of files per directory and creates subdirectories
appropriately. It is not optimal to create all local files in the
same directory because the local file system might not be able to
efficiently support a huge number of files in a single directory.
When a DataNode starts up, it scans through its local file system,
generates a list of all HDFS data blocks that correspond to each of
these local files and sends this report to the NameNode: this is the
Blockreport.
All HDFS communication protocols are
layered on top of the TCP/IP protocol. A client establishes a
connection to a configurable TCP port on the NameNode machine. It
talks the ClientProtocol with the NameNode. The DataNodes talk to the
NameNode using the DataNode Protocol. A Remote Procedure Call (RPC)
abstraction wraps both the Client Protocol and the DataNode Protocol.
By design, the NameNode never initiates any RPCs. Instead, it only
responds to RPC requests issued by DataNodes or clients.
The
primary objective of HDFS is to store data reliably even in the
presence of failures. The three common types of failures are NameNode
failures, DataNode failures and network partitions.
Each DataNode sends a Heartbeat
message to the NameNode periodically. The NameNode detects this
condition by the absence of a Heartbeat message. The NameNode marks
DataNodes without recent Heartbeats as dead and does not forward any
new IO requests to them. Any data that was registered to a dead
DataNode is not available to HDFS any more. DataNode death may cause
the replication factor of some blocks to fall below their specified
value. The NameNode constantly tracks which blocks need to be
replicated and initiates replication whenever necessary. The
necessity for re-replication may arise due to many reasons: a
DataNode may become unavailable, a replica may become corrupted, a
hard disk on a DataNode may fail, or the replication factor of a file
may be increased.
The
HDFS architecture is compatible with data rebalancing schemes. A
scheme might automatically move data from one DataNode to another if
the free space on a DataNode falls below a certain threshold. In the
event of a sudden high demand for a particular file, a scheme might
dynamically create additional replicas and rebalance other data in
the cluster. These types
of data rebalancing schemes are not yet implemented.
It is possible that a block of data
fetched from a DataNode arrives corrupted. This corruption can occur
because of faults in a storage device, network faults, or buggy
software. The HDFS client software implements checksum checking on
the contents of HDFS files. When a client creates an HDFS file, it
computes a checksum of each block of the file and stores these
checksums in a separate hidden file in the same HDFS namespace. When
a client retrieves file contents it verifies that the data it
received from each DataNode matches the checksum stored in the
associated checksum file. If not, then the client can opt to retrieve
that block from another DataNode that has a replica of that block.
Metadata Disk Failure
The
FsImage and the EditLog are central data structures of HDFS. A
corruption of these files can cause the HDFS instance to be
non-functional. For this reason, the NameNode can be configured to
support maintaining multiple copies of the FsImage and EditLog. Any
update to either the FsImage or EditLog causes each of the FsImages
and EditLogs to get updated synchronously. This synchronous updating
of multiple copies of the FsImage and EditLog may degrade the rate of
namespace transactions per second that a NameNode can support.
However, this degradation is acceptable because even though HDFS
applications are very data intensive in nature, they are not metadata
intensive. When a NameNode restarts, it selects the latest consistent
FsImage and EditLog to use.
The
NameNode machine is a single point of failure for an HDFS cluster. If
the NameNode machine fails, manual intervention is necessary.
Currently, automatic
restart and failover of the NameNode software to another machine is
not supported.
Snapshots
Snapshots
support storing a copy of data at a particular instant of time. One
usage of the snapshot feature may be to roll back a corrupted HDFS
instance to a previously known good point in time. HDFS
does not currently support snapshots but will in a future release.
Data Blocks
HDFS supports write-once-read-many
semantics on files. A typical block size used by HDFS is 64 MB. Thus,
an HDFS file is chopped up into 64 MB chunks, and if possible, each
chunk will reside on a different DataNode.
Staging
A client request to create a file
does not reach the NameNode immediately. In fact, initially the HDFS
client caches the file data into a temporary local file. Application
writes are transparently redirected to this temporary local file.
When the local file accumulates data worth over one HDFS block size,
the client contacts the NameNode. The NameNode inserts the file name
into the file system hierarchy and allocates a data block for it. The
NameNode responds to the client request with the identity of the
DataNode and the destination data block. Then the client flushes the
block of data from the local temporary file to the specified
DataNode. When a file is closed, the remaining un-flushed data in the
temporary local file is transferred to the DataNode. The client then
tells the NameNode that the file is closed. At this point, the
NameNode commits the file creation operation into a persistent store.
If the NameNode dies before the file is closed, the file is lost
Replication Pipelining
When a client is writing data to an
HDFS file, its data is first written to a local file as explained in
the previous section. Suppose the HDFS file has a replication factor
of three. When the local file accumulates a full block of user data,
the client retrieves a list of DataNodes from the NameNode. This list
contains the DataNodes that will host a replica of that block. The
client then flushes the data block to the first DataNode. The first
DataNode starts receiving the data in small portions, writes each
portion to its local repository and transfers that portion to the
second DataNode in the list. The second DataNode, in turn starts
receiving each portion of the data block, writes that portion to its
repository and then flushes that portion to the third DataNode.
Finally, the third DataNode writes the data to its local repository.
Thus, a DataNode can be receiving data from the previous one in the
pipeline and at the same time forwarding data to the next one in the
pipeline. Thus, the data is pipelined from one DataNode to the next.
Accessibility
Natively, HDFS provides a FileSystem
Java API
for applications to use.
In addition, an HTTP browser can
also be used to browse the files of an HDFS instance.
FS Shell
HDFS allows user data to be
organized in the form of files and directories. It provides a
commandline interface called FS shell that lets a user interact with
the data in HDFS.
Browser Interface
A
typical HDFS install configures a web server to expose the HDFS
namespace through a configurable TCP port. This allows a user to
navigate the HDFS namespace and view the contents of its files using
a web browser.
File Deletes and Undeletes
When
a file is deleted by a user or an application, it is not immediately
removed from HDFS. Instead, HDFS first renames it to a file in the
/trash
directory. The file can be restored quickly as long as it remains in
/trash.
A file remains in /trash
for a configurable amount of time. After the expiry of its life in
/trash,
the NameNode deletes the file from the HDFS namespace. The deletion
of a file causes the blocks associated with the file to be freed.
Note that there could be an appreciable time delay between the time a
file is deleted by a user and the time of the corresponding increase
in free space in HDFS.
A
user can Undelete a file after deleting it as long as it remains in
the /trash
directory. If a user wants to undelete a file that he/she has
deleted, he/she can navigate the /trash
directory and retrieve the file. The /trash
directory contains only the latest copy of the file that was deleted.
The /trash
directory is just like any other directory with one special feature:
HDFS applies specified policies to automatically delete files from
this directory. Current default trash interval is set to 0 (Deletes
file without storing in trash). This value is configurable parameter
stored as fs.trash.interval
stored in core-site.xml.
Decrease Replication Factor
When the replication factor of a
file is reduced, the NameNode selects excess replicas that can be
deleted. The next Heartbeat transfers this information to the
DataNode. The DataNode then removes the corresponding blocks and the
corresponding free space appears in the cluster. Once again, there
might be a time delay between the completion of the setReplication
API call and the appearance of free space in the cluster.
No comments:
Post a Comment