Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Taking Bytes from Cassandra Clients: An Apache Gora Perspective

Presentation on implementing a pluggable client architecture for the gora-cassandra module of Apache Gora. For technical information please see https://issues.apache.org/jira/browse/GORA-224. This is being presented at Cassandra Summit 2013

Lewis McGibbney

on 4 February 2016

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Taking Bytes from Cassandra Clients: An Apache Gora Perspective

Renato Marroquin* & Lewis John McGibbney**


Taking Bytes from Cassandra Clients:
An Apache Gora Perspective

Introduction to Gora
Enough already... whats the presentation about?
The End
What is Gora?
Introduction Cont'd
Why Gora?
The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop™ MapReduce support. Gora uses the Apache Software License v2.0.
Although there are various excellent ORM frameworks for relational databases, data modeling in NoSQL data stores differ profoundly from their relational cousins. Moreover, data-model agnostic frameworks such as JDO are not sufficient for use cases, where one needs to use the full power of the data models in column stores. Gora fills this gap.
Project Goals
Simple... to become the standard data representation and persistence framework for big data. Can be grouped as follows:
Data Persistence
: Persisting objects to Column stores such as Apache HBase™, Apache Cassandra™, Hypertable; key-value stores such as Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system of Hadoop HDFS;
Data Access
: An easy to use Java-friendly common API for accessing the data regardless of its location;
: Persisting objects to Apache Lucene and Apache Solr indexes, accessing/querying the data with Gora API;
: Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading;
MapReduce support
: Out-of-the-box and extensive MapReduce (Apache Hadoop™) support for data in the data store.
The following operations (as one would expect) are native in Gora:

delete(K key)
deleteByQuery(Query<K,T> query)
get(K key)
get(K key, String[] fields)
put(K key, T obj)
Design Rationale
The main justification lies in simplifying the interchange of different C*Clients.
We _do_not_ want to rewrite our applications if we want to change C* Client.
The most important thing is our data within Cassandra!!! That is how we sleep at night...ZZZZzzzzzzzzz
Version Control
Well...what do we currently support?
Gora is so much more than ORM for
NoSQL Data Stores
Another Observation
NoSQL community is new(ish)... NoSQL clients are newer!
There are many... many of them... for many different languages.
Which one(s) to use? Can't I have them all?
Existing Hector Client Architecture
Existing Configuration
5.1.18 (mysql-connector-java)
Introduction to Us
We're both Apache Gora dev's
We met @ApacheCon 2011 in Vancouver
Forged a friendship during GSoC 2012 on a project consisting an Amazon DynamoDB module and WebService API for Gora
In between the coding and discussion, we both enjoy a good drink... oh and very, very happy to be here @CassandraSummit
Renato Marroquin
Lewis John McGibbney
Scottish expat fae Glasgow
Post Doc @Stanford University: Engineering Informatics
Quantity Surveyor/Cost Consultant by profession
Keen OSS enthusiast @TheASF and beyond
Cycling mad

How do we go about deciding which client(s) we wish to use?
Mmmm... and how do we implement all of this in a user friendly way?
Based on the previous statement...
How expensive are such operations when using different Cassandra Clients?
What kind of overhead do we add by essentially wrapping client code for Gora operations?
Operations In a Nutshell
Introduction (dusted)
Problem Brief and some observations
Big Data Client Wars
Pluggable Client Architecture for Gora (technical)
Goraci (if we have time)
Discussion & Questions
Whats Next for Gora
A lot...
A (brief) Story
Apache Gora
Java Driver
Hector Client
...and a bunch more classes but these are the ones shown are where the magic happens
Avro schema
Compile the .avsc into Persistent classes...
Some (Historic) Specifics about Gora
Bean Definition
Pluggable Client Archiecture (proposed)
Pluggable Configuration
Avro schema
Have a banana
Wait a Minute...
Problems So Far
Some Positive Outcome's
Now go!!!
Apache Accumulo has a test suite that verifies that data is not lost at scale. This test; continuous ingest, runs many ingest clients that continually create linked lists containing 25 million nodes. At some point the clients are stopped and a map reduce job is run to ensure no linked list has a hole. A hole indicates data was lost.

Goraci is a version of the test suite written using the Gora API.
Thank you so much, have a great C* Summit 2013
Why are you guys telling us this?
The reason is simple... it's potentially opening up Apache Cassandra to be used as the primary storage mechanism in many many places and projects. Remember Gora is much more than ORM for NoSQL.
GSoC 2013
Integration with Cascading
Cascading is a nice framework for working with Mapreduce at a higher level. Cascading defines a Tap architecture which is the source/sink for records. This is very similar to Gora's DataStore's. The project will develop a GoraTap as an adapter for gora->cascading. This way any data store gora supports can be used at Cascading.
GSoC 2013
OracleNoSQL Module
Expanding datastore support is key to Gora becoming the standard bid data persistence framework. The goal is to implement a new module; gora-oraclenosql, which will offer Gora users and developers to use the functionality of the enterprise-class Oracle NoSQL database and vice versa.
Apache Giraph
Apache Giraph is a graph-processing framework which can be used as regular Hadoop jobs in order to leverage existing Hadoop infrastructure.
Gora will be used to provide a new vertex input format for Giraph and help Giraph provide a wider spectrum of data sources where graph processing could be done and stored.
Flexible Metadata Cataloging in Apache OODT CAS FileMgr
OODT is metadata for middleware (and vice versa):
Transparent access to distributed resources
Data discovery and query optimization
Distributed processing and virtual archives

Gora will eventually enable OODT users to direct the persistence of their metadata cataloging to many storage layers. It will also enable them to access, query and analyze it much more data focused manner than they can currently.
Vacant Slide...
Oh no it's not
Can Gora solve your {$problem}?
Let us know
Paragliding fan
Master in Computer Science @Pontifical Catholic University of Rio de Janeiro
Data management consultant
Big data and OpenSource @TheASF enthuasiast
Serialization was being done by persisting data 'as it was' e.g. in its native type.
Now serialization is done persisting everything as bytes, and using an schema inferrer to determine its type.
Right now we support Avro as an extra backend, but also as a serialization model.
In the future, we plan to support other serialization libraries such as Twitter's Parquet. Linkedin's Voldermort is an excellent example of how this can be done.
Defining only schema helps the user not to worry about the burden of persisting data in different backends.
Having to generate a common data bean across different data stores is a challenging task.
Data bean state is managed using its Avro schema, and a manager state ... let Gora worry about it.
Different clients have different abstraction levels.
Error handling is also different.
Concurrency control is also different.
Handling super columns and composite columns is different.
Handling Cassandra's new features is different across all clients.
Every client is different and has its own unique characteristics... which of course (unfortunately) creates problems... for us!!!
Serialization will be done by each client
Concurrency will be handled by each client with a little help from Gora.
Data modeling will be the main focus . . . finally!!!
"Freedom" from clients by letting us try them without too much burden
Ease of extensibility... if you know your client API, then plug and play. This is a real nice one.
Integration with Apache Bigtop
Initiative to make Gora part of the Bigtop distribution.
What it would accomplish is a much tighter integration between Gora and
HBase/Hadoop/other big data projects and a ready made availability in Bigtop binary distro (and
potentially commercial Hadoop vendors).
Although this has been ported to gora-cassandra, due to lack of resources (and time) we've not been able to execute over the pluggable architecture and see the numbers. In the long term we would like to have this as part of the Bigtop distro mentioned earlier.
In Conclusion
We're nearly there, we only started a month or so ago! You can check out the development:
You can also see the Jira integration
If we didn't try this one, we would have been kicking ourselves, that is for sure! We would really like to hear from vendors/developers of other (maintained and active) clients, who would like to further expose their product to more users.
In Review
There are many, many clients out there. We were not content with simply accepting that the choice of client is something which is a trivial decision... it is not.
We were even more shocked to discover than 'big data guys' do not necessarily even consider client architecture as part of their agenda... strange but true.
We've learned a good bit more about client characteristics e.g. serialization, abstraction, error handling, concurrency, etc. and have therefore been able to learn more about where we want (and see) such aspect of Gora going in the future.
From the data stores we support in Gora, the pluggable client observation only came to the surface within the gora-cassandra module. This in itself is interesting, and actually says more about Cassandra than it does about Gora. It's a great, vibrant project and we like it.
Questions perhaps???
Full transcript