Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


From MapReduce to Pig to SPARQLPig

A very quick introduction to MapReduce, to Pig and then on how we can process SPARQL queries using Pig

Spyros Kotoulas

on 6 March 2011

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of From MapReduce to Pig to SPARQLPig

MapReduce SPARQL Developed as a framework for large-scale parallel processing
Hides much of the complexity of parallel programming

Programs consist of a series of Map and Reduce functions:
Map: <k1, v1> to <k2, v2>
Reduce: <k1, Iterator<v1...n> to <k3, v3> taken from: http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/ Open-source implementation
Mostly developed by Yahoo!
Map and Reduce are functions are in Java
Widely-used +Very scalable (thousands of nodes)
+Relatively easy to program -Load balancing Reduce
-Significant coordination overhead Pig Framework for
large-scale analysis Pig script A= Load "tenGBfile";
B= ForEach A, match "amsterdam";
C= count B; Joins Filters Regex Legacy programs Unions User-defined functions Aggregates MapReduce program taken from: http://girlincomputerscience.blogspot.com/ Easier to deal with load balancing
Very easy to use
Will only do very basic optimisations Query language for the Semantic Web Select ?N from { ?X type Person. ?X name ?N. ?X livesIn Amsterdam. }
Construct {?X type Amsterdammer} from {?X type Person. ?X livesIn Amsterdam. }
Maps to relational algebra Approach:
Translate SPARQL to Pig Latin
Use Pig to translate Pig Latin to MapReduce
Run on large clusters for very big datasets
Focus on expensive queries
No preprocessing, no indexing Problems Query Optimization Load Balancing Traditional query optimization techniques do not always work:
Data access is very slow
Overhead for operations is very high (minutes)
Pig is most efficient when we batch operations select ?P from {?X type Person. ?X hasBeenIn ?P. ?P name "Amsterdam".} ?X type Person ?X hasBeenIn ?P ?P name "Amsterdam" ?X hasBeenIn ?P ?X type Person BAD GOOD ?P name "Amsterdam" Dynamic query optimization:
Optimize query
Run part of the query
Repeat Select ?X where {?X type ?Y. ?Y subclassOf LivingThing} MapReduce will group operations by possible ?Y.
Some ?Y are very common (think of Person).
Some groups will be very large, resulting in poor
parallelization. Replicate one side:
?Y subclassOf LivingThing is given to all nodes
No load balancing problems Use cases:
Data transformation: Integrate data in foaf, vcard etc
Cleaning up data
(Simple) Reasoning
Complex and expensive queries
Constructing views Targets:
Show that we can utilize massive hardware
To go beyond the state of the art in terms of size and query complexity
But, sacrificing response time along the way Sorry,
no results yet :-) From MapReduce to Pig to SPARQL and back Spyros Kotoulas(VUA)
Jacopo Urbani (VUA),
Peter Mika (Yahoo!) and Peter Boncz ( CWI & VUA)
Full transcript