Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Do you really want to delete this prezi?
Neither you, nor the coeditors you shared it with will be able to recover it again.
Make your likes visible on Facebook?
Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.
In the Mix with gora-hbase
Transcript of In the Mix with gora-hbase
'Pluggable' storage mechanism via DataStore interface design. DataStore's are initialized via DataStoreFactories...
Why Gora... #2
Why not... Kiji?
++ Kiji has as 'excellent' support for Schema evolution
-- 'Only' supports writing and reading data to HDFS, HBase and Avro files.
Why not Avro?
Gora can be considered as an acronym for 'Generic Object Representation using Avro'... it is out native Object Datum Read/Write + Object Serialization mechansim.
Go and ask: Apache Nutch, Giraph, Log4j 2.X, BigTop, OODT CloudStack... many non-ASF projects.
The fact that storage engine is pluggable e.g. [insert the sh*t hot data store of the moment here] but with schema?
The out of the box MR / analysis... getPartitions((Query<K, T> query))
The ability to provide a simple API for data access an querying based on a Key Value storage model
Good track history of innovative implementations including the Google Summer of Code successes we had in 2012, 2013 and... fingers cross in 2014... beyond.
A datastore-first methodology where we provide a simple API for easy consumption... mappings are datastore specific so that the full data model is utilized.
A dynamic community with 18 PMC including emeritus members...
Implementing new features (such as UNION support) has proved itself to be time consuming across data store support. e.g. finding the trade off between consistent implementation of common functionality across data stores Vs sticking with our morals of leveraging the underlying data model/mappings... this is difficult!
There are undoubtedly other instances of such behavior slowing development based on committer input.
What happens if you do not want to use Avro as your underlying data representation format?
Support for Schema evolution is a WIP... see AVRO-1124 RESTful service for holding schemas. Currently dynamic changes/revisions to Schema's 'can' result in some nasty outcome(s). NPE is typical...
In the Mix with gora-hbase
Lewis John McGibbney
London Meetup - 19/05/2014
Me, Myself and I
Glasgow, Scotland - PhD Legislative Informatics
Engineering Informatics (PostDoc)
Engineering Application Software Engineer NASA Jet Propulsion Laboratory/CalTech
Open Source advocate/enthusiast
Apache Nutch, Gora, Any23, OODT, TAC, Tika, Usergrid, OCW...
Crawler Commons, Hector C* Client...
GSoC Mentor 2012-present
Interests: cycling... IR, Web search, open data initiatives, NoSQL, distributed systems engineering, persistence and serialization...
Making new friends and meeting old ones through shared interests.
centred around mapping data to HBase - what functionality it currently provides and approaches to making this better.
What do we project committers see as the core strengths of Gora as a O2DM framework?
Why Gora over some alternatives?
What is the overhead of using it... what are the cons?
Iterate through Avro Schema Fields in obj, if field is dirty then obtain the column mapping for this field. We then call:
Topmost fields of the record are persisted in "raw" format (not avro serialized). This behavior happens in maps and arrays too.
For UNION's ["null","type"] type (a.k.a. optional field) we persist the field as if it was ["type"], however the column is deleted if object value==null (so value read after will be null).
We define out data model using Avro schemas.
Gora uses Apache Avro for bean definition, not bytecode enhancement or annotations. Avro has improved dramatically over the last few years... more of this to come. First lets look at an example schema.
Support primitive data types:
\ null, int, long, float, boolean, float, double, bytes, string
\ records, enums, arrays, maps, unions, fixed
Object-to-datastore mappings are backend specific so the full data model and functionality of the datastore can be utilized.
Implementation and extension of DataStore class
Create table in HBase if one doesn't already exist
HBase client cache that improves the scan n HBase (default 0)
HBAse autoflushing. Enabling this decreases write performance. Default=false
Put simply, this method wraps
Passing in equivalent to * || all fields
We obtain a List<Field> schemaFields for the object by accessing a cached persistent object relating to they K Key.
For each field in List<Field> schemaFields which contains a 'magic' __g__dirty field representing field is dirty, add the Fields to a new ArrayList<Field>();
Return this new ArrayList<Field>() as the fields for which to construct org.apache.hadoop.hbase.client.Get with.
Once Get has been defined we can allow HBaseTableConnection.get(get).
This is REALLY simple... HBaseTableConnection.delete(new Delete(toBytes(key)));
N.B. HBase does not return success information and executing a get for success is a bit costly
Similar to DataStore.get(K key, String fields), we obtain a String[ ] schemaFields for the object by accessing a cached persistent object relating to they K Key.
Asserts whether all fields are queried, which means that complete rows will be deleted.
We execute the query via query.execute(), NOTE that Query here is an interface in Gora and API is consistent across DataStore's.
This returns us a Result object.
If All fields in the case... we do the delete right there and then
Otherwise we continue to iterate through result members buffring Delete's before executing