HBase vs Cassandra
Note: Entire content of this blog post is copied from below two sources.
please refer the sources for more details.
Source: http://bigdatanoob.blogspot.in/2012/11/hbase-vs-cassandra.html
Source:http://www.javaworld.com/article/2140805/big-data/big-data-showdown-cassandra-vs-hbase.html
Similarities
- both Cassandra and HBase are open source projects managed under the Apache Software Foundation,
- both are available free under an Apache version 2 license
- Cassandra descends from both Bigtable and Amazon's Dynamo
- HBase describes itself as an "open source Bigtable implementation"
- Both Cassandra and HBase are NoSQL databases
- Generally, it means you cannot manipulate the database with SQL.
- However, Cassandra has implemented CQL (Cassandra Query Language), the syntax of which is obviously modeled after SQL.
- Both are designed to manage extremely large data sets (in billions).
- Anything less, and you're advised to stick with an RDBMS
- Both are distributed databases, not only in how data is stored, but also in how the data can be accessed.
- Clients can connect to any node in the cluster and access any data.
- Both claim near linear scalability. Need to manage twice the data? Then double the number of nodes in your cluster
- Both safeguard data loss from cluster node failure via replication
- If the primary node fails, its data can still be fetched from one of the replica nodes.
- Both are referred to as column-oriented databases
- unlike a relational database, no two rows in a column-oriented database need have the same columns.
- you can add columns to a row on the fly
- it's unlikely you'll hit the limit even if you add tens of thousands of columns.
- Both implement similar write paths that begin with logging write operations to a log file to ensure durability (WAL).
- The data is written next to a memory cache, then finally to disk via a large, sequential write (essentially a copy of the memory cache)
- The overall memory-and-disk data structure used by both Cassandra and HBase is more or less a log-structured merge tree.
- The disk component in Cassandra is the SSTable; in HBase it is the HFile.
- Both provide command-line shells implemented in JRuby. Both are written largely in Java
Differences:
1. Cassandra requires that you identify some nodes as seed nodes, which serve as concentration points for intercluster communication. Meanwhile, on HBase, you must press some nodes into serving as master nodes, whose job it is to monitor and coordinate the actions of region servers.
Thus, Cassandra guarantees high availability by allowing multiple seed nodes in a cluster, while HBase guarantees the same via standby master nodes -- one of which will become the new master should the current master fail.
2.
Cassandra uses the Gossip protocol for internode communications, and Gossip services are integrated with the Cassandra software.
HBase relies on Zookeeper -- an entirely separate distributed application -- to handle corresponding tasks
3. Cassandra lets you create additional, secondary indexes on column values. Hbase do not have secondary index option.
4. While the data manipulation commands of HBase are not as rich as CQL, HBase does have a "filter" capability that executes on the server side of a session and improves scanning (search) throughput.
5. HBase's reliance on Zookeeper -- a separate application -- introduces an additional point of failure (and the attendant difficulties troubleshooting the source of a problem) that Cassandra avoids.
6.
Note: Entire content of this blog post is copied from below two sources.
please refer the sources for more details.
Source: http://bigdatanoob.blogspot.in/2012/11/hbase-vs-cassandra.html
Point
|
HBase
|
Cassandra
|
CAP Theorem
Focus
|
Consistency,
Availability
|
Availability,
Partition-Tolerance
|
Consistency
|
Strong
|
Eventual
(Strong is Optional)
|
Single Write
Master
|
Yes
|
No
(R+W+1 to get Strong Consistency)
|
Optimized
For
|
Reads
|
Writes
|
Main Data
Structure
|
CF,
RowKey, Name Value Pair Set
|
CF, RowKey,
Name Value Pair Set
|
Dynamic
Columns
|
Yes
|
Yes
|
Column Names
as Data
|
Yes
|
Yes
|
Static
Columns
|
No
|
Yes
|
RowKey
Slices
|
Yes
|
No
|
Static
Column Value Indexes
|
No
|
Yes
|
Sorted
Column Names
|
Yes
|
Yes
|
Cell Versioning Support
|
Yes
|
No
|
Bloom
Filters
|
Yes
|
Yes(only on
Key)
|
CoProcessors
|
Yes
|
No
|
Triggers
|
Yes(Part of
Coprocessor)
|
No
|
Push Down
Predicates
|
Yes(Part of
Coprocessor)
|
No
|
Atomic
Compare and Set
|
Yes
|
No
|
Explicit Row
Locks
|
Yes
|
No
|
Row Key
Caching
|
Yes
|
Yes
|
Partitioning
Strategy
|
Ordered
Partitioning
|
Random
Partitioning recommended
|
Rebalancing
|
Automatic
|
Not Needed
with Random Partitioning
|
Availability
|
N-Replicas
across Nodes
|
N-Replicas
across Nodes
|
Data Node
Failure
|
Graceful
Degredation
|
Graceful
Degredation
|
Data Node
Failure - Replication
|
N-Replicas
Preserved
|
(N-1) Replicas
Preserved + Hinted Handoff
|
Data Node
Restoration
|
Same as Node
Addition
|
Requires
Node Repair Admin-action
|
Data Node
Addition
|
Rebalancing
Automatic
|
Rebalancing
Requires Token-Assignment Adjustment
|
Data Node
Management
|
Simple (Roll
In, Role Out)
|
Human
Admin Action Required
|
Cluster
Admin Nodes
|
Zookeeper,
NameNode, HMaster
|
All Nodes
are Equal
|
SPOF
|
Now, all the
Admin Nodes are Fault Tolerant
|
All Nodes
are Equal
|
Write.ANY
|
No, but Replicas
are Node Agnostic
|
Yes (Writes
Never Fail if this option is used)
|
Write.ONE
|
Standard,
HA, Strong Consistency
|
Yes (often
used), HA, Weak Consistency
|
Write.QUORUM
|
No (not
required)
|
Yes (often used
with Read.QUORUM for Strong Consistency
|
Write.ALL
|
Yes
(performance penalty)
|
Yes
(performance penalty, not HA)
|
Asynchronous
WAN Replication
|
Yes, but it
needs testing on corner cases.
|
Yes
(Replica's can span data centers)
|
Synchronous WAN
Replication
|
No
|
Yes with
Write.QUORUM or Write.EACH-QUORUM
|
Compression
Support
|
Yes
|
Yes
|
Point
|
HBase
|
Cassandra
|
||
|
HBase is based on BigTable (Google)
|
Cassandra is based on DynamoDB
(Amazon). Initially developed at Facebook by former Amazon
engineers. This is one reason why Cassandra supports multi data
center. Rackspace is a big contributor to Cassandra due to multi data
center support.
|
||
Infrastructure
|
HBase uses the Hadoop Infrastructure
(Zookeeper, NameNode, HDFS). Organizations that will deploy Hadoop
anyway may be comfortable with leveraging Hadoop knowledge by using HBase
|
Cassandra started and evolved separate
from Hadoop and its infrastructure and Operational knowledge requirements are
different than Hadoop. However, for analytics, many Cassandra
deployments use Cassandra + Storm (which uses Zookeeper), and/or Cassandra +
Hadoop.
|
||
Infrastructure Simplicity and SPOF
|
The HBase-Hadoop Infrastructure has
several "moving parts" consisting of Zookeeper, Name
Node, Hbase Master, and Data Nodes, Zookeeper is clustered
and naturally fault tolerant. Name Node needs to be clustered to be
fault tolerant.
|
Cassandra uses a a single Node-type.
All nodes are equal and perform all functions. Any Node can act
as a coordinator, ensuring no SPOF. Adding Storm or Hadoop, of
course, adds complexity to the infrastructure.
|
||
Read Intensive Use Cases
|
HBase is optimized for reads, supported
by single-write master, and resulting strict consistency model, as well
as use of Ordered Partitioning which supports row-scans. HBase is well suited for doing Range based scans.
|
Cassandra has excellent single-row read
performance as long as eventual consistency semantics are sufficient for the
use-case. Cassandra quorum reads, which are required for strict
consistency will naturally be slower than Hbase reads. Cassandra does
not support Range based row-scans which may be limiting in certain use-cases.
Cassandra is well suited for supporting single-row queries, or
selecting multiple rows based on a Column-Value index.
|
||
Multi-Data Center Support and Disaster
Recovery
|
HBase provides for asynchronous
replication of an HBase Cluster across a WAN. HBase clusters
cannot be set up to achieve zero RPO, but in steady-state HBase should be
roughly failover-equivalent to any other DBMS that relies on
asynchronous replication over a WAN. Fall-back processes and procedures (e.g.
after failover) are TBD.
|
Cassandra Random Partitioning provides
for row-replication of a single row across a WAN, either asynchronous
(write.ONE, write.LOCAL_QUORUM), or synchronous
(write.QUORUM, write.ALL). Cassandra clusters can
therefore be set up to achieve zero RPO, but each write will require at
least one wan-ACK back to the coordinator to achieve this
capability.
|
||
Write.ONE Durability
|
Writes are replicated in a pipeline
fashion: the first-data-node for the region persists the write, and then
sends the write to the next Natural Endpoint, and so-on in a pipeline
fashion. HBase’s commit log "acks" a write only
after *all* of the nodes in the pipeline have written the data to their
OS buffers. The first Region Server in the pipeline must also have persisted the
write to its WAL.
|
Cassandra's coordinators will send
parallel write-requests to all Natural Endpoints, The coordinator will
"ack" the write after exactly one Natural Endpoint has
"acked" the write, which means that node has also persisted the
write to its WAL. The writes may or may not have committed to any
other Natural Endpoint.
|
||
Ordered Partitioning
|
HBase only supports Ordered
Partitoning. This means that Rows for a CF are stored in RowKey
order in HFiles, where each Hfile contains a "block" or
"shard" of all the rows in a CF. HFiles are distributed
across all data-nodes in the Cluster
|
Cassandra officially supports Ordered
Partitioning, but no production user of Cassandra uses Ordered Partitioning
due to the "hot spots" it creates and the operational difficulties
such hot-spots cause. Random Partitioning is the only recommended
Cassandra partitioning scheme, and rows are distributed across all nodes in
the cluster.
|
||
RowKey Range Scans
|
Because of ordered partitioning,
HBase queries can be formulated with partial start and end row-keys, and can
locate rows inclusive-of, or exclusive of these
partial-rowkeys. The start and end row-keys in a range-scan need
not even exist in Hbase.
|
Because of random partitioning,
partial rowkeys cannot be used with Cassandra. RowKeys must be
known exactly. Counting rows in a CF is
complicated. It is highly recommended that for these types of
use-cases, data should be stored in columns in Cassandra, not
in rows.
|
||
Linear Scalability for large tables and
range scans
|
Due to Ordered Partitioning, HBase will
easily scale horizontally while still supporting rowkey range scans.
|
If data is stored in columns in
Cassandra to support range scans, the practical limitation of a row size in
Cassandra is 10's of Megabytes. Rows larger than that causes problems
with compaction overhead and time.
|
||
Atomic Compare and Set
|
HBase supports Atomic Compare and
Set. HBase supports supports transaction within a Row.
|
Cassandra does not support Atomic
Compare and Set. Counters require dedicated counter
column-families which because of eventual-consistency requires that
all replicas in all natural end-points be read and updated with
ACK. However, hinted-handoff mechanisms can make even these built-in
counters suspect for accuracy. FIFO queues are difficult (if not
impossible) to implement with Cassandra.
|
||
Read Load Balancing - single Row
|
Hbase does not support Read Load
Balancing against a single row. A single row is served by exactly
one region server at a time. Other replicas are used ony in case of a
node failure. Scalability is primarily supported by Partitioning which
statistically distributes reads of different rows across multiple data nodes.
|
Cassandra will support Read Load
Balancing against a single row. However, this is primarily
supported by Read.ONE, and eventual consistency must be taken into
consideration. Scalability is primarily supported by Partitioning which
distributes reads of different rows across multiple data nodes.
|
||
Bloom Filters
|
Bloom Filters can be used in
HBase as another form of Indexing. They work on the basis of RowKey
or RowKey+ColumnName to reduce the number of data-blocks that HBase has
to read to satisfy a query. (Bloom Filters may exhibit false-positives
(reading too much data), but never false negatives (reading not enough data).
|
Cassandra uses bloom filters for key
lookup.
|
||
Triggers
|
Triggers are supported by
the CoProcessor capability in HBase. They allow HBase to observe
the get/put/delete events on a table (CF), and then execute the
trigger-logic. Triggers are coded as java classes.
|
Cassandra does not support
co-processor-like functionality (as far as we know)
|
||
Secondary Indexes
|
Hbase does not natively support
secondary indexes, but one use-case of Triggers is that a trigger
on a "put" can automatically keep a secondary index up-to-date, and
therefore not put the burden on the application (client).
|
Cassandra supports secondary indexes on
column families where the column name is known. (Not on dynamic columns).
|
||
Simple Aggregation
|
Hbase CoProcessors support
out-of-the-box simple aggregations in HBase. SUM, MIN, MAX,
AVG, STD. Other aggregations can be built by defining
java-classes to perform the aggregation
|
Aggregations in Cassandra are not
supported by the Cassandra nodes - client must provide
aggregations. When the aggregation requirement spans multiple
rows, Random Partitioning makes aggregations very difficult for the
client. Recommendation is to use Storm or Hadoop for
aggregations.
|
||
HIVE Integration
|
HIVE can access HBase tables directly
(uses de-serialization under the hood that is aware of the HBase file
format).
|
Work in Progress (https://issues.apache.org/jira/browse/CASSANDRA-4131) | ||
PIG Integration
|
PIG has native support for writing
into/reading from HBase.
|
Cassandra 0.7.4+
|
Source:http://www.javaworld.com/article/2140805/big-data/big-data-showdown-cassandra-vs-hbase.html
Similarities
- both Cassandra and HBase are open source projects managed under the Apache Software Foundation,
- both are available free under an Apache version 2 license
- Cassandra descends from both Bigtable and Amazon's Dynamo
- HBase describes itself as an "open source Bigtable implementation"
- Both Cassandra and HBase are NoSQL databases
- Generally, it means you cannot manipulate the database with SQL.
- However, Cassandra has implemented CQL (Cassandra Query Language), the syntax of which is obviously modeled after SQL.
- Both are designed to manage extremely large data sets (in billions).
- Anything less, and you're advised to stick with an RDBMS
- Both are distributed databases, not only in how data is stored, but also in how the data can be accessed.
- Clients can connect to any node in the cluster and access any data.
- Both claim near linear scalability. Need to manage twice the data? Then double the number of nodes in your cluster
- Both safeguard data loss from cluster node failure via replication
- If the primary node fails, its data can still be fetched from one of the replica nodes.
- Both are referred to as column-oriented databases
- unlike a relational database, no two rows in a column-oriented database need have the same columns.
- you can add columns to a row on the fly
- it's unlikely you'll hit the limit even if you add tens of thousands of columns.
- Both implement similar write paths that begin with logging write operations to a log file to ensure durability (WAL).
- The data is written next to a memory cache, then finally to disk via a large, sequential write (essentially a copy of the memory cache)
- The overall memory-and-disk data structure used by both Cassandra and HBase is more or less a log-structured merge tree.
- The disk component in Cassandra is the SSTable; in HBase it is the HFile.
- Both provide command-line shells implemented in JRuby. Both are written largely in Java
Differences:
1. Cassandra requires that you identify some nodes as seed nodes, which serve as concentration points for intercluster communication. Meanwhile, on HBase, you must press some nodes into serving as master nodes, whose job it is to monitor and coordinate the actions of region servers.
Thus, Cassandra guarantees high availability by allowing multiple seed nodes in a cluster, while HBase guarantees the same via standby master nodes -- one of which will become the new master should the current master fail.
2.
Cassandra uses the Gossip protocol for internode communications, and Gossip services are integrated with the Cassandra software.
HBase relies on Zookeeper -- an entirely separate distributed application -- to handle corresponding tasks
3. Cassandra lets you create additional, secondary indexes on column values. Hbase do not have secondary index option.
4. While the data manipulation commands of HBase are not as rich as CQL, HBase does have a "filter" capability that executes on the server side of a session and improves scanning (search) throughput.
5. HBase's reliance on Zookeeper -- a separate application -- introduces an additional point of failure (and the attendant difficulties troubleshooting the source of a problem) that Cassandra avoids.
6.
Needed to compose one simple word yet thanks for the suggestions that you are contributed here, please do keep updating us...
ReplyDeleteApache Cassandra Training | Hadoop Online Training
Thanks for sharing useful information. I learned something new from your bog. Its very interesting and informative. keep updating.
ReplyDeleteSalesforce Training in Chennai
Salesforce Online Training in Chennai
Salesforce Training in Bangalore
Salesforce Training in Hyderabad
Salesforce training in ameerpet
Salesforce Training in Pune
Salesforce Online Training
Salesforce Training