Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Deploying Apache Gora as a Query Broker

Presentation given at ApacheCon NA 2014 in Denver, CO for more info please see http://sched.co/1bsM9es

Lewis McGibbney

on 21 November 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Deploying Apache Gora as a Query Broker

Lewis John McGibbney
ApacheCon NA 2014

Deploying Apache Gora as a Query Broker
Federated Web Search
DataStore Support
Apache Gora
Me, Myself and I
A bit about myself
PostDoc Stanford University, CA '13 - '14 Engineering Informatics
PhD Candidate Glasgow Caledonian University, UK '09 - '12 Legislative Informatics
<'09 ...Quantity Surveyor!!!
Search Engines, Web Search, NoSQL, Open Data (probably in that order)
Analyzing and addressing REAL problems.
Apache Member, Nutch PMC, Gora PMC, Any23 PMC, OODT PMC, Apache TAC, Usergrid (incubating) PPMC
In the beginning...
The Story Begins... Invisible Engines
Domain of Application: Federated Web Search
The TREC FWS Track
FWS Theory
Apache Gora
Gora as a Query Broker
Feel free to ask questions as we go through
In this case totally invisible!!!
Invisiblity can be a strength...
local/national trade
Global: eCommerce
Some points to ponder over as we move on
1 - In 2008 it was easy to equate Windows OS and the PC industry (70% 2008 to about 30% in 2012*).
2 - It is however less trivial to equate Android OS and the mobile market** as Android is only one element in a complex structure that links mobile phone operators, handset makers, application providers, and software platform makers.

Quote of the day
"underlying software platform technology shape these industries, and the business strategies employed by firms in those industries, in fundamental and important ways"

Apache Web Server.. Apache Hadoop... _insert_next_game_changer

*According to Forrester Research
**Android >80% market share: 211.6m units in 3Q13
Federated search has the potential of improving web search:
the user becomes less dependent on a single search provider
and parts of the deep web become available through a uni-
fied interface, leading to a wider variety in the retrieved
search results.
Federated search is an information retrieval methodology that allows the simultaneous search of multiple searchable resources. A user makes a single query request which is distributed to the search engines participating in the federation. The federated search then aggregates the results that are received from the search engines for presentation to the user.
What do they all have in common?
Typical comparison sites are a good example of how FWS can well serve a specific purpose(s) however consider the following:
From _most_ of the sites currently available, it would seem that for each query, ALL data sources are involved within the federated query process. Assume that query results are returned as an unorganized list of data (based on faster responses entering the list earlier), linear time (Ω(n)) (lower bound) is required to find the minimum element. Add on variables relating 3rd party response times, etc. this explains why we have to wait for query execution and subsequent presentation of results. So in fact running time is much less efficient than (Ω(n)).
Although the user cannot then query the results, if we consider sorting the list/array, in which case only one initial, expensive sort is needed, followed by many cheap selection operations we would obtain O(1) (upper bound) for an array, though selection is O(n) in a list, even if sorted, due to lack of random access. In general, sorting requires O(n log n) time, where n is the length of the list.
Is there a better way of doing this?

Do we need to query EVERY underlying data source EVERY time?

What happens when we are dealing with domains other than price comparisons that work with Integer's?
The track investigates techniques for the selection and combination of search results from a large number of real on-line web search services. A list of 157 search engines is made available with sampled search results from each of these engines.
Task1: Vertical Selection
In web search, a vertical is associated with content dedicated to either a topic (e.g. “finance”), a media type (e.g. “images”) or a genre (e.g. “news”). For example, an “image” vertical contains resources such as Flickr and Picasa.
Therefore, the system should select a subset of verticals to retrieve from.
A query
A set of relevant verticals
Based on standard classification metrics: F-measure (main metric), precision and recall. The set of relevant verticals will be based on the relevance of the individual search results provided by the resources in that vertical.
Task2: Resource Selection
For practical reasons, it is not possible to query all available resources (search engines) when a query is issued to a federated search system. Therefore, the system first needs to select the appropriate search engines for the given query.
For example, suitable resources for a query such as 'Pittsburgh Steelers News' might be ESPN, Fox Sports, etc. To simulate a realistic setting, the participants are not allowed to sample or retrieve results from the resources themselves.
A query
A ranking of resources (the most appropriate resources are ranked highest)
The relevance of each resource is determined by calculating the graded precision* on its top 10 results.
*Using graded relevance assessments in IR evaluation, J. Kekäläinen and K. Järvelin, JASIST 53(13), 2002
Task3: Results Merging
The goal of results merging is to merge the search result snippets from previously selected resources in a single ranked list similar to that which we see in our price comparison sites, etc.
A query
A ranking of resources (the most appropriate resources are ranked highest)
Using two metrics: nDCG* to measure topical relevance, and IA-nDCG to measure diversity between verticals in addition to topical relevance.
*Christopher Burges et al. (Learning to rank using gradient descent. ICML 2005).
The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop™ MapReduce support.
Generic Object Representation using Avro
Several data stores supported
2.0.39 (client driver)
Secret Store: MemStore
MemStore API
Gora as a Query Broker: Approach
Gora as a Query Broker: Proposed Solution
Data Model
Resource Selection Algorithm
Future Aim
The End

A huge thank you for the last 40 or so minutes.
Enjoy the rest of ApacheCon and your time in Denver

put(K key, T obj)
get(K key, String[] fields)
delete(K key)
deleteByQuery(Query<K, T> query)
execute(Query<K, T> query)
exp. av log(n)
exp. av log(n)
exp. av log(n)
map.firstKey() = O(1) constant
map.lastKey() = O(n) linear unless map is modified
If no fields are requested, we get ALL fields
Create a map.subMap first and last keys inclusive
return Result(K, T)
Focus on
Resource Selection and Results Merging
Simulate geographically distributed data in heterogeneous storage mediums
Utilize a stregth of Gora: Access data regardless of it's location (persistent location as well as geographical)
Use Apache Mahout* to build dictionaryMap's representing tf-idf term to frequency mappings.
Use MemStore as a cache and as a broker between incoming queries and data store selection
Improve upon this implementation and integrate it as an example in Gora trunk.

Gora REST API so that applications can call the Query Broker.

int TOP_K = 10;
Map simObjs = new HashMap();
int count = 0;
for (Map map : maps) {
for (String term: terms){
if (map.containsKey(term)) {
int freq = map.get(term)
if (freq > 0) {
higher = freq;
simObjs.put(resourceName , map);
} if (count < TOP_K)
MemStore cannot be shared across multiple JVM's. We cannot share memory.
Given that "map" is static then it will shared by many MemStore within a JVM. The Key may or may not be same types, which in this case, the code: startKey = (K) map.firstKey(); could throw exception for illegal casting.
Full transcript