Replication architecture in Cassandra and HBase March 19, 2010
Posted by Terry.Cho in Distributed System.Tags: Cassandra, Distributed System, geo replication, Hadoop, HBase
trackback
Now i’m research distributed database architecture.
I found a very interesting article.
http://www.roadtofailure.com/2009/10/29/hbase-vs-cassandra-nosql-battle/comment-page-1/
Apache Cassandra and Hadoop Hbase are most popular distributed database.
Twitter and Facebook are using Cassandra.
These solution is started from Google Big Table. So the data model is very similar.
The data model is called “Column database”. I will introduce the model later.
However my concern is how to replicate data across region (data centers in different region)
Here is very interesting information.
In case of Cassandra, it replicates data in every transaction. A coordinator captures changes and propagate it to other nodes.
But fiber based low latency network is required and there are no reference yet.
HBase data replication architecture looks very practical.
It captures change log and put it into replication queue. The replication message is passed to other nodes.
This mechanism is very similar to CDC (Change Data Capture).
Oracle Goden Gate, Quest Share Flex, MySQL geo replication are using this mechanism.
HBase replication looks more reasonable. It has common architecture and they have a reference.
===
After i had written this article, i got a feed back. Followed by the comment the article which i referenced is written by fan of Hbase. Cassandra supports geo replication and has reference in face book. And Digg will deploy Cassandra in different data center.
But as i know even if facebook has two data center, they have fiber-link between the center. It is not a real geo replication. I will more research about cassandra data replication feature and re-post about this issue later.
Advertisement


Hi Terry,
The “nosql battle” post you cite is written by an HBase fanboy and it’s unfortunately basically an anti-Cassandra FUD piece. Most of what it says about Cassandra, and some of what it says about HBase, is completely wrong.
Cassandra replication works very well in realtime across normal WAN links; Facebook’s largest Cassandra cluster spans East and West coast data centers, and Digg is deploying to 2 DCs soon.
Section 5.2 of the Cassandra whitepaper covers how this works in more detail: http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf
Thank you very much for the information.
I will review the white paper.