Search This Blog

Monday, November 8, 2010

Apache Cassandra with Hector - An Example

Recently I had been to the Strange Loop Conference in Saint Louis. While there I indulged in two things primarily, booze with old buddies and No SQL in the conference.

In particular, I found a lot of mention of Apache Cassandra. Why would one care about Cassandra, how about a 150 TB cluster spanning over 150 machines at Facebook ? Cassandra is used by organizations such as Digg, Twitter etc who deal with a large amount of data. I could attempt to write more on Cassandra but there is a great presentation by Eric Evans on the same http://www.parleys.com/#id=1866&sl=40&st=5.

If not talking about Cassandra, what am I talking about? Well, I wanted to use Cassandra to get a grasp of how Columns and Super Columns work in Cassandra. Yeah, I hear it, WTF are Super Columns? I found myself asking the same question at the Conference, but luckily for me I found this nice blog by Arin Sarkissian titled aptly "WTF is a SuperColumn?" explaining the same. I wanted to translate his example Schema of a Blog Application into a working example that uses Cassandra and provide a playing ground for someone like me wanting to try cassandra.

So I am only going to use Java, sorry no Ruby or Scala for me right now. There is a Thrift Java Client for Cassandra but is limited in functionality so I proceeded to use Hector.

The model created was based Arin's schema with a few enhancements. I have updated the Author Schema to contain a user name and password with the user name being the "row key" for the Column Family Authors.
<!--
    ColumnFamily: Authors
    We'll store all the author data here.

    Row Key = Author user name
    Column Name: an attribute for the entry (title, body, etc)
    Column Value: value of the associated attribute

    Access: get author by userName (aka grab all columns from a specific Row)

    Authors : { // CF
        sacharya : { // row key
            // and the columns as "profile" attributes
            password:#$%#%#$%#%
            name:Sanjay Acharya
            twitterId: sterling23,
            email: sacharya@example.com,
            biography: "bla bla bla"
        },
        // and the other authors
        dduck {
            ...
        }
    }
-->
<ColumnFamily CompareWith="BytesType" Name="Authors"/>
The above Column Family translated to a simple Author POJO as shown below:
public class Author {
  private String userName;
  private String password;
  private String name;
  private String twitterId;
  private String biography;
  ..// Getters and Setters
}
Using Hector directly, a DAO to create an author might look like:
public void create(Author author) {
    Mutator<String> mutator = HFactory.createMutator(keySpace, StringSerializer.get());
    
    String userName = author.getUserName();
    
    mutator.addInsertion(userName,COLUMN_FAMILY_NAME,
        HFactory.createColumn("password", author.getPassword(), StringSerializer.get(),
          StringSerializer.get()))
          .addInsertion(userName, COLUMN_FAMILY_NAME, 
            HFactory.createColumn("name", author.getName(), StringSerializer.get(), 
              StringSerializer.get()))
          .addInsertion(userName, COLUMN_FAMILY_NAME, 
            HFactory.createColumn("biography", author.getBiography(), StringSerializer.get(),
              StringSerializer.get()))
          .addInsertion(userName, COLUMN_FAMILY_NAME, 
            HFactory.createColumn("twitterId", author.getTwitterId(), StringSerializer.get(),
              StringSerializer.get()))
}
The above code felt rather verbose so with a small compromise, column names are the same name as attribute names of the POJO and default constructor must exist for the POJO, I present an AbstractColumnFamilyDao that an AuthorDao for example would implement:
public abstract class AbstractColumnFamilyDao<KeyType, T> {
  private final Class<T> persistentClass;
  private final Class<KeyType> keyTypeClass;
  protected final Keyspace keySpace;
  private final String columnFamilyName;
  private final String[] allColumnNames;

  public AbstractColumnFamilyDao(Keyspace keyspace, Class<KeyType> keyTypeClass, Class<T> persistentClass,
      String columnFamilyName) {
    this.keySpace = keyspace;
    this.keyTypeClass = keyTypeClass;
    this.persistentClass = persistentClass;
    this.columnFamilyName = columnFamilyName;
    this.allColumnNames = DaoHelper.getAllColumnNames(persistentClass);
  }

  public void save(KeyType key, T model) {
  
    Mutator<Object> mutator = HFactory.createMutator(keySpace, SerializerTypeInferer.getSerializer(keyTypeClass));
    for (HColumn<?, ?> column : DaoHelper.getColumns(model)) {
      mutator.addInsertion(key, columnFamilyName, column);
    }

    mutator.execute();
  }

  public T find(KeyType key) {
    SliceQuery<Object, String, byte[]> query = HFactory.createSliceQuery(keySpace,
      SerializerTypeInferer.getSerializer(keyTypeClass), StringSerializer.get(), BytesSerializer.get());

    QueryResult<ColumnSlice<String, byte[]>> result = query.setColumnFamily(columnFamilyName)
        .setKey(key).setColumnNames(allColumnNames).execute();

    if (result.get().getColumns().size() == 0) {
      return null;
    }

    try {
      T t = persistentClass.newInstance();
      DaoHelper.populateEntity(t, result);
      return t;
    }
    catch (Exception e) {
      throw new RuntimeException("Error creating persistent class", e);
    }
  }

  public void delete(KeyType key) {
    Mutator<Object> mutator = HFactory.createMutator(keySpace, SerializerTypeInferer.getSerializer(keyTypeClass));
    mutator.delete(key, columnFamilyName, null, SerializerTypeInferer.getSerializer(keyTypeClass));
  }
}
One might ask, why not just annotate the POJO with JPA annotations and thus handle the persistence? I did head down that route but found a project that was already doing the same, i.e., Kundera. For this reason, I kept the
example more focussed on Hector. Also I am a bit wary regarding whether the JPA specs will be a good fit for a Sparse column store like Cassandra.

With the above mentioned DAO, I modeled the rest of my code to Arin's example schema. The sample code provided contains a Blog Simulation which is a Multi-threaded test that simulates the working of the BLOG application, i.e., authors being created, BLOG Entries being created and authors commenting on BLOG Entries, Finding all Blog Entries created, Getting Blog Entries by a tag, Getting comments for a Blog Entry etc etc.

The example can be DOWNLOADED HERE. You will not need to install a Cassandra server as the example uses an embedded Server. The code however does not demonstrate any fail over or consistency strategies. Enjoy!

8 comments:

Sathya said...

Hey Sanjay... was just browsing about cassandra and I felt that you must have blogged something about it. Good one...

Sanjay Acharya said...

Are you using Cassandra at work?

Sathya said...

No not right now. May be in the future. :)

bsr203 said...

Impressive the effort you put in all your articles.. thanks..

Sanjay Acharya said...

@B thanks

Anonymous said...

Why do you mention fail over and consistency strategies at the end of the article?

Doesn't Cassandra automatically handle this with settings in the .yaml and quorum reads? What am I missing about that?

What is the purpose of creating the AbstractColumnFamilyDao? Why not just use Kundera or some other open source product?

Sanjay Acharya said...

@Dave,
Sorry about the delay in replying, vacation.
I only mentioned fail over and consistency as the example in question does not attempt to 'demonstrate' the same in any way.

In the example mentioned, I am using Hector and Open Source project. However, I found myself repeating certain patterns and thus introduced the AbstractColumnFamilyDao. Clearly the same is not a requirement but it served my case. If one wants to use Kundera, I would love to hear or try an example of the same.

Vivs said...

Nice post. recent kundera release 2.0.4 is good in numbers same is published at:

https://github.com/impetus-opensource/Kundera/wiki/Kundera-Performance