Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
The State of Gora
Transcript of The State of Gora
The familiar concept
Shadowing the notion that large datasets are split over nodes and that data is distributed similarly, we move the computation to the data rather than move data to computation.
This allows Hadoop & consequently Gora to achieve high data locality which in turn yields high performance. Achieving data locality is a core part of any datastore implementation aiding queries and subsequently analysis tasks submitted as Gora jobs.
o.a.g.sDataStore#getPartitions(Query<K, T> query) partitions the query and returns a list of PartitionsQueries, which will execute on local data. The query parameter should represent the base query to create the partitions for. If the query is 'null', then the data store returns the results for the default query (returning EVERY object).
This method is implemented via @Override within the datastore implementation. It is currently supported in both HBaseStore and AccumuloStore. This requires the implementation of a datastore specific InputFormat for which will get the InputSplits required for the computation.
Generally speaking the idea is to obtain a List<PartitionQuery<K, T>> partitions objects which correlate to the nodes upon which the data (for computation) resides.
As with many things in Gora, all datastore implementation are different!
Upgrade to Avro 1.7.X
Complete overhaul of Persistent API: Removal of StateManager as the NEW and readable object states were not being used
Implement a mechanism 'inside' of persistent objects to indicate whether or not objects or fields are dirty __g__ field therefore added which contains minimal Bytes to track which fields within an object are dirty.
GoraComnpiler now extends AvroCompiler, we now also use velocity templates making the compiler flexible.
Addition of wrapper classes within o.a.g.persistency.impl e.g. DirtyMap/List/Set wrappers to track structural modifications to these types.
Plenty, plenty more...
Upgrade to Avro 1.7.X Cont'd
User 1 = new User();
User 1 = new User('Lewis', 'Whisky');
User 1 =
Builders automatically set default values specified in the schema. Using object constructors generally offer better performance as builders create a copy of the datastructure before it is written.
The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop™ MapReduce support. Gora uses the Apache Software License v2.0.
Compiling data beans
Me , myself and I
Currently PostDoc (Engineering Informatics)
Keen open source enthusiast
Apache member and committer on several projects
GSoC mentor 2012/13
Whats on the agenda?
An introduction to Apache Gora.
Basic API design and usage.
MapReduce support in Gora
Developments in trunk (0.4-SNAPSHOT)
Monday 16th December 2013
Dublin NoSQL Meetup
Compiling data beans
We define out data model using Avro schemas.
Gora uses Apache Avro for bean definition, not bytecode enhancement or annotations. Avro has improved dramatically over the last few years... more of this to come. First lets look at an example schema.
The overall goal for Gora is to become the standard representation and persistence framework for BIG data. The roadmap for Gora can be defined as follows:
Data Persistence: Persisting Java objects to column stores such as Apache HBase, Apache Cassandra, Hypertable, etc; key-value store such as Voldermort, Redis, etc; SQL databases such as MySQL, HSQLDB, etc.; flat files (sequence) which reside within HDFS.
Data Access: An easy to use Java-friendly common API for accessing data regardless of its location.
Indexing: Persisting objects to Apache Lucene and Apache Solr indexes, accessing the data with the Gora API.
Analysis: Accessing the data and easing analysis through adapters for Apache Pig, Apache Hive, Cascading, etc.
MapReduce support: Out-of-the-box and extensive MapReduce support for data in with data store.
Without further ado, lets dig into the API.
Support primitive data types: null, int, long, float, boolean, float, double, bytes, string
Complex types: records, enums, arrays, maps, unions, fixed
Object-to-datastore mappings are backend specific so the full data model and functionality of the datastore can be utilized.
Two important properties
gora.datastore.default = org.apache.gora.cassandra.store.CassandraStore
Remaining properties are datastore specific
Piece of cake
We need to use the GoraCompiler which is soon to be released as an independent maven plugin and can be included in your pom.xml
Core class is DataStore.java (or FileBackedDataStore). This handles actual object persistence. Objects can be persisted, fetched, queried pr de;eted by the DataStore methods. DataStore instances can be constructed by invoking DataStoreFactory.createDataStore.
Base class for client interaction is o.a.g.store.impl.DataStoreBase (or FileBackedDataStoreBase). Used to indirectly interact with the DataStore. All DataStore implementations extend this class.
Core classes are BeanFactory, Persistent and State. The former enables the construction of keys and persistent objects. The objects persisted by Gora implement the second and the latter defines the actual state of an object or field. State is managed through the StateManager. Objects can be NEW, CLEAN (UNMODIFIED), DIRTY (MODIFIED) or DELETED.
As with the Store API, persistency has base classes for client interaction e.g. o.a.g.p.impl.BeanFactoryImpl, o.a.g.p.impl.PersistentBase and o.a.g.p.impl.StateManagerImpl respectively.
All objects persisted by Gora extend PersistentBase.
Core classes are Query, PartitionQuery and Result.
PartitionQuery divides the results of the Query into multi partitions so that the queries can be run locally on the nodes that hold the data. PartitionsQueries are used for generating Hadoop InputSplits.
Queries are constructed by the DataStore implementation via DataStore.newQuery();
For consistency within the gora-core API clients access queries through the implementations in o.a.g.q.impl.
Contains ALL of the Gora MR functionality and Utilities required to make use of Gora within a MR environment.
Including GoraMapper (extend Mapper), Reducer (extend Reducer), ALL Record Counter, REader and Writer implementations.
Persistent and String Se/Deserialization is done using o.a.g.avro.PersistentDatumWriter (for writing Avro's dirty and readable information) with Avro's BinaryEncoder.
There were & are some problems; namely
Subseqent upgrade to HBase 0.90.4 --> 0.94.14
Subsequent upgrade to Cassandra 1.0.2 --> 2.0.2
Numeous additional logging libraries on classpath. Unnecessary.