Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Do you really want to delete this prezi?
Neither you, nor the coeditors you shared it with will be able to recover it again.
Make your likes visible on Facebook?
You can change this under Settings & Account at any time.
Transcript of Pig
What is Pig (useful for)?
Apache open source project (http://pig.apache.org)
A high level engine for executing data flows in parallel on Hadoop
It includes a language,
, for expressing these data flows including operators for many of the traditional data flow operations (join, sort, filter etc.)
Three main categories of use cases:
research on raw data
"If you need to process GB's or TB 's of data in a batch processing mode, Pig is a good choice. For problems, that require writing single or small groups of records or looking up many different records in random order, Pig (& MapReduce) is not a good choice."
Pigs eat anything
Pig can operate on relational,nested,unstructured...data...
Pigs live anywhere
It is not tied to one particular parallel framework...
Pigs are domestic
Pig is designed to be easily controlled and modified by the user...
Pig process data quickly...
Install and run
The official version comes packaged with all the JAR's needed
Pig does not need to be installed on the Hadoop cluster, it runs on the machine (gateway machine) from which you launch Hadoop jobs, or you can install it on your local machine (even without Hadoop being installed)
Requries Unix enviroment, but portable across OS's as it's written in Java
Download, unpack, add to PATH and set JAVA_HOME ->
pig -x local vs. pig -x mapreduce
A guiding example
Let's assume we have two kind of datasets.
"cid"; "time"; "product"; "action"
1; 2013.09.01 12:34:54; PR1; view
1; 2013.09.01 13:23:43; PR2; buy
2; 2013.09.01 14:23:56; PR123; view
"cid"; "gender"; "age"
1; female; 32
2; male; 26
34; female; 19
23; male; 456
and many questions...
There will be created a result folder where the "result" will be stored in as many part-r-0000* files as many reducers were used (typically one in local mode).
There are many built-in load and store functions. CSVExcelStorage is just one of them...
Let's see how to aggregate views & buys per cid:
setting the enviroment
The group statement collects together records with the same key into a
A bag is an unordered collection of
. A tuple is a fix-length, ordered collection of Pig data elements containing fields.
applies expressions to every record in the pipeline
Assume we want to compute the number of buys of customers on a daily basis ordered by time in descending order:
group by multiple fields
the order statement sorts your data based on the types of the fields
Assume we have a campaign and we want to see if young female customers (18 <= age <= 30) have bought more products in the last month than the month before:
Pig supports a rich set of
works on entire records, not single fields
Advanced Pig Latin
register JAR's containing possibly many UDF's
there are many types of UDF's, for example evaluation functions, storage functions...also there are already many libraries containing very useful UDF's, e.g. DataFu from LinkedIn, so consider using them before writing your own
As Pig is a dataflow language, it does not include control flow constructs such as
There is an embedding interface written in Java, hence one can naturally embed Pig in Java projects, but also in Python scripts (using Jython!).
The interface class is called PigServer.
Using its methods one can create Pig queries very easly as part of a larger job / project.
It is also a very good way to debug applications using Pig.
mapreduce takes its first argument the JAR containing the code to run a MapReduce job.
to specify how data will be moved from Pig to MapReduce and back
one can pass arguments & Java options to the invocation of the Java command that will run the MR job
Making Pig fly
Filter & Project early and often
Set up your joins properly
Use multiquery when possible
Choose the right data type
Select the right level of parallelism
Using compression in intermediate Results