Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Pig

No description
by

Peter Toth

on 3 June 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Pig

Pig - a "pig" data flow engine
What is Pig (useful for)?
Apache open source project (http://pig.apache.org)
A high level engine for executing data flows in parallel on Hadoop
It includes a language,
Pig Latin
, for expressing these data flows including operators for many of the traditional data flow operations (join, sort, filter etc.)
Three main categories of use cases:
data pipelines
research on raw data
iterative processing
"If you need to process GB's or TB 's of data in a batch processing mode, Pig is a good choice. For problems, that require writing single or small groups of records or looking up many different records in random order, Pig (& MapReduce) is not a good choice."
Pig's Philosophy
Pigs eat anything
Pig can operate on relational,nested,unstructured...data...
Pigs live anywhere
It is not tied to one particular parallel framework...
Pigs are domestic
Pig is designed to be easily controlled and modified by the user...
Pigs fly
Pig process data quickly...
Install and run
The official version comes packaged with all the JAR's needed
Pig does not need to be installed on the Hadoop cluster, it runs on the machine (gateway machine) from which you launch Hadoop jobs, or you can install it on your local machine (even without Hadoop being installed)
Requries Unix enviroment, but portable across OS's as it's written in Java
Download, unpack, add to PATH and set JAVA_HOME ->
pig -x local vs. pig -x mapreduce
A guiding example
Let's assume we have two kind of datasets.

activities_2013-09-01.csv:

"cid"; "time"; "product"; "action"
1; 2013.09.01 12:34:54; PR1; view
1; 2013.09.01 13:23:43; PR2; buy
2; 2013.09.01 14:23:56; PR123; view
...
customers.csv:

"cid"; "gender"; "age"
1; female; 32
2; male; 26
34; female; 19
23; male; 456
...
and many questions...
IO
There will be created a result folder where the "result" will be stored in as many part-r-0000* files as many reducers were used (typically one in local mode).
There are many built-in load and store functions. CSVExcelStorage is just one of them...
Relational Operations
Let's see how to aggregate views & buys per cid:
alias
schema
relation
setting the enviroment
The group statement collects together records with the same key into a
bag
.
A bag is an unordered collection of
tuples
. A tuple is a fix-length, ordered collection of Pig data elements containing fields.
foreach
applies expressions to every record in the pipeline
nested
foreach
Assume we want to compute the number of buys of customers on a daily basis ordered by time in descending order:
group by multiple fields
the order statement sorts your data based on the types of the fields
Assume we have a campaign and we want to see if young female customers (18 <= age <= 30) have bought more products in the last month than the month before:
Pig supports a rich set of
join
operations
schema after
join
!!!
distinct
works on entire records, not single fields
filter by
multiple predicates
Advanced Pig Latin
U(ser)D(efined)F(function)'s
register JAR's containing possibly many UDF's
there are many types of UDF's, for example evaluation functions, storage functions...also there are already many libraries containing very useful UDF's, e.g. DataFu from LinkedIn, so consider using them before writing your own
Embedded Pig
As Pig is a dataflow language, it does not include control flow constructs such as
if
and
for
.

There is an embedding interface written in Java, hence one can naturally embed Pig in Java projects, but also in Python scripts (using Jython!).
The interface class is called PigServer.
Using its methods one can create Pig queries very easly as part of a larger job / project.
It is also a very good way to debug applications using Pig.
MapReduce
mapreduce takes its first argument the JAR containing the code to run a MapReduce job.
it uses
store
and
load
to specify how data will be moved from Pig to MapReduce and back
one can pass arguments & Java options to the invocation of the Java command that will run the MR job
Making Pig fly
Filter & Project early and often
Set up your joins properly
Use multiquery when possible
Choose the right data type
Select the right level of parallelism
Using compression in intermediate Results
projection
project again
Full transcript