Apache Cassandra Quick tour March 22, 2010
Posted by Terry.Cho in Distributed System.Tags: Apache Cassandra, Distributed database, Intro, NoSQL, overview, Tutorial
20 comments
Cassandra is distributed database system. It is donated to Apache open source group by Facebook at 2008.The Cassandra is based on Google Big Table data model and Facebook Dynamo distributed architecture. It doesn’t use SQL and optimized to high scale size of data & transaction handling. Even though Cassandra is implemented with Java language, other language can use the Cassandra as a client. (It supports Ruby,Perl,Python,Scala,PHP etc).
It is used to High Scale Size SNS like Face book,Digg,Twitter etc. It doesn’t support complex relationship like Foreign Key. It just provides Key & Value relationship like Java Hashmap. It is very easy to install and use.
Let’s look at data model of Cassandra
Data Model
Cassandra is based on google big table data model. It is called “Column DB”. It is totally different from traditional RDBMS.
Column
Column data structure which consists of column name and column value.
{name: “emailAddress”, value:”cassandra@apache.org”}
{name:”age” , value:”20”}
Column Family
Column family is set of columns. It is similar to row in RDBMS table. I will explain more detail about difference between Column Family and row in RDBMS later. Column Family has a key which identify each row in data set. Each row has a number of Columns.
For example, one row is
Cassandra = { emailAddress:”casandra@apache.org” , age:”20”}
“Cassandra” is key for the row, and the row has two columns. Keys of the columns are “emailAddress” and “age”. Each column value is “casandra@apache.org” and “20”.
Let’s look at Column Family which has a number of rows.
UserProfile={
Cassandra={ emailAddress:”casandra@apache.org” , age:”20”}
TerryCho= { emailAddress:”terry.cho@apache.org” , gender:”male”}
Cath= { emailAddress:”cath@apache.org” , age:”20”,gender:”female”,address:”Seoul”}
}
One of interest thing is each row can have different scheme. Cassandra row has “emailAddress” ,”age” column. TerryCho row has “emailAddress”,”gender” column. This characteristic is called as “Schemeless” (Data structure of each row in column family can be different)
KeySpace
Keyspace is logical set of column family for management perspective. It doesn’t impact data structure.
Super Column & Super Column Family
As I mentioned earlier, column value can have a column itself. (Similar to Java Hashtable can have ValueObject class as a ‘Object’ type)
{name:”username”
value: firstname{name:”firstname”,value=”Terry”}
value: lastname{name:”lastname”,value=”Cho”}
}
As a same way column family also can have column family like this
UserList={
Cath:{
username:{firstname:”Cath”,lastname:”Yoon”}
address:{city:”Seoul”,postcode:”1234”}
}
Terry:{
username:{firstname:”Terry”,lastname:”Cho”}
account:{bank:”hana”,accounted:”1234”}
}
}
UserList column family has two rows with key “Cath” and “Terry”. Each of the “Carry” and “Terry” row has two column families – “Cath” row has “username” and “address’ column family, “Terry” row has “username” and “account” column family.
Cassandra Quick Test
Download Cassandra from http://incubator.apache.org/cassandra/ Extract zip file and run bin/cassandra.bat
We will connect Cassandra node with CLI interface. It is located in /bin/cassandra-cli.bat
The default TCP port number is 9160. You can change the port number in “conf/storage-conf.xml”
In “/conf/storage-conf.xml” file, default key space with name “Keyspace1” is defined. Column family type of the Keyspace is like this
Let’s put a new row with key name “Terry” which has Column (key=”gender”, value=”Male”)