jump to navigation

Apache Cassandra Quick tour March 22, 2010

Posted by Terry.Cho in Distributed System.
Tags: , , , , ,
20 comments

Cassandra is distributed database system. It is donated to Apache open source group by Facebook at 2008.The Cassandra is based on Google Big Table data model and Facebook Dynamo distributed architecture. It doesn’t use SQL and optimized to high scale size of data & transaction handling. Even though Cassandra is implemented with Java language, other language can use the Cassandra as a client. (It supports Ruby,Perl,Python,Scala,PHP etc).

It is used to High Scale Size SNS like Face book,Digg,Twitter etc. It doesn’t support complex relationship like Foreign Key. It just provides Key & Value relationship like Java Hashmap. It is very easy to install and use.

Let’s look at data model of Cassandra

Data Model

Cassandra is based on google big table data model. It is called “Column DB”. It is totally different from traditional RDBMS.

Column

Column data structure which consists of column name and column value.

{name: emailAddress, value:cassandra@apache.org}
{name:age , value:20}

Column Family

Column family is set of columns. It is similar to row in RDBMS table. I will explain more detail about difference between Column Family and row in RDBMS later. Column Family has a key which identify each row in data set. Each row has a number of Columns.

For example, one row is

Cassandra = { emailAddress:casandra@apache.org , age:20}

“Cassandra” is key for the row, and the row has two columns. Keys of the columns are “emailAddress” and “age”. Each column value is “casandra@apache.org” and “20”.

Let’s look at Column Family which has a number of rows.

UserProfile={
Cassandra={ emailAddress:”casandra@apache.org” , age:”20”}
TerryCho= { emailAddress:”terry.cho@apache.org” , gender:”male”}
Cath= { emailAddress:”cath@apache.org” , age:”20”,gender:”female”,address:”Seoul”}
}

One of interest thing is each row can have different scheme. Cassandra row has “emailAddress” ,”age” column. TerryCho row has “emailAddress”,”gender” column. This characteristic is called as “Schemeless” (Data structure of each row in column family can be different)

KeySpace

Keyspace is logical set of column family for management perspective. It doesn’t impact data structure.

Super Column & Super Column Family

As I mentioned earlier, column value can have a column itself. (Similar to Java Hashtable can have ValueObject class as a ‘Object’ type)

{name:”username”
value: firstname{name:”firstname”,value=”Terry”}
value: lastname{name:”lastname”,value=”Cho”}
}

As a same way column family also can have column family like this

UserList={
Cath:{
username:{firstname:”Cath”,lastname:”Yoon”}
address:{city:”Seoul”,postcode:”1234”}
}
Terry:{
username:{firstname:”Terry”,lastname:”Cho”}
account:{bank:”hana”,accounted:”1234”}
}
}

UserList column family has two rows with key “Cath” and “Terry”. Each of the “Carry” and “Terry” row  has two column families – “Cath” row has “username” and “address’ column family, “Terry” row has “username” and “account” column family.

Cassandra Quick Test

Download Cassandra from http://incubator.apache.org/cassandra/ Extract zip file and run bin/cassandra.bat

We will connect Cassandra node with CLI interface. It is located in /bin/cassandra-cli.bat

The default TCP port number is 9160. You can change the port number in “conf/storage-conf.xml”

In “/conf/storage-conf.xml” file, default key space with name “Keyspace1” is defined. Column family type of the Keyspace is like this

Let’s put a new row with key name “Terry” which has Column (key=”gender”, value=”Male”)