jump to navigation

Apache Cassandra Quick tour March 22, 2010

Posted by Terry.Cho in Distributed System.
Tags: , , , , ,

Cassandra is distributed database system. It is donated to Apache open source group by Facebook at 2008.The Cassandra is based on Google Big Table data model and Facebook Dynamo distributed architecture. It doesn’t use SQL and optimized to high scale size of data & transaction handling. Even though Cassandra is implemented with Java language, other language can use the Cassandra as a client. (It supports Ruby,Perl,Python,Scala,PHP etc).

It is used to High Scale Size SNS like Face book,Digg,Twitter etc. It doesn’t support complex relationship like Foreign Key. It just provides Key & Value relationship like Java Hashmap. It is very easy to install and use.

Let’s look at data model of Cassandra

Data Model

Cassandra is based on google big table data model. It is called “Column DB”. It is totally different from traditional RDBMS.


Column data structure which consists of column name and column value.

{name: emailAddress, value:cassandra@apache.org}
{name:age , value:20}

Column Family

Column family is set of columns. It is similar to row in RDBMS table. I will explain more detail about difference between Column Family and row in RDBMS later. Column Family has a key which identify each row in data set. Each row has a number of Columns.

For example, one row is

Cassandra = { emailAddress:casandra@apache.org , age:20}

“Cassandra” is key for the row, and the row has two columns. Keys of the columns are “emailAddress” and “age”. Each column value is “casandra@apache.org” and “20”.

Let’s look at Column Family which has a number of rows.

Cassandra={ emailAddress:”casandra@apache.org” , age:”20”}
TerryCho= { emailAddress:”terry.cho@apache.org” , gender:”male”}
Cath= { emailAddress:”cath@apache.org” , age:”20”,gender:”female”,address:”Seoul”}

One of interest thing is each row can have different scheme. Cassandra row has “emailAddress” ,”age” column. TerryCho row has “emailAddress”,”gender” column. This characteristic is called as “Schemeless” (Data structure of each row in column family can be different)


Keyspace is logical set of column family for management perspective. It doesn’t impact data structure.

Super Column & Super Column Family

As I mentioned earlier, column value can have a column itself. (Similar to Java Hashtable can have ValueObject class as a ‘Object’ type)

value: firstname{name:”firstname”,value=”Terry”}
value: lastname{name:”lastname”,value=”Cho”}

As a same way column family also can have column family like this


UserList column family has two rows with key “Cath” and “Terry”. Each of the “Carry” and “Terry” row  has two column families – “Cath” row has “username” and “address’ column family, “Terry” row has “username” and “account” column family.

Cassandra Quick Test

Download Cassandra from http://incubator.apache.org/cassandra/ Extract zip file and run bin/cassandra.bat

We will connect Cassandra node with CLI interface. It is located in /bin/cassandra-cli.bat

The default TCP port number is 9160. You can change the port number in “conf/storage-conf.xml”

In “/conf/storage-conf.xml” file, default key space with name “Keyspace1” is defined. Column family type of the Keyspace is like this

Let’s put a new row with key name “Terry” which has Column (key=”gender”, value=”Male”)


Replication architecture in Cassandra and HBase March 19, 2010

Posted by Terry.Cho in Distributed System.
Tags: , , , ,
Now i’m research distributed database architecture.
I found a very interesting article.
Apache Cassandra and Hadoop Hbase are most popular distributed database.
Twitter and Facebook are using Cassandra.
These solution is started from Google Big Table. So the data model is very similar.
The data model is called “Column database”. I will introduce the model later.
However my concern is how to replicate data across region (data centers in different region)
Here is very interesting information.
In case of Cassandra, it replicates data in every transaction. A coordinator captures changes and propagate it to other nodes.
But fiber based low latency network is required and there are no reference yet.
HBase data replication architecture looks very practical.
It captures change log and put it into replication queue. The replication message is passed to other nodes.
This mechanism is very similar to CDC (Change Data Capture).
Oracle Goden Gate, Quest Share Flex, MySQL geo replication are using this mechanism.

HBase replication looks more reasonable. It has common architecture and they have a reference.


After i had written this article, i got a feed back. Followed by the comment the article which i referenced is written by  fan of Hbase. Cassandra supports geo replication and has reference in face book. And Digg will deploy Cassandra in different data center.

But as i know even if facebook has two data center, they have fiber-link between the center. It is not a real geo replication. I will more research about cassandra data replication feature and re-post about this issue later.