Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Hadoop and Hive in practice

No description
by

Lajos Domaniczky

on 5 June 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Hadoop and Hive in practice

Hadoop and Hive in practice
Introduction
Hadoop, HDFS, Map/Reduce
Hive, HiveQL
Practical issues
Questions

HADOOP
Parallel Computing Platform
Distributed File System (HDFS)
Parallel Processing Model (Map/Reduce)
Computation in any language
Job Exec framework for M/R
Open Source
Most popular Apache project
Highly Extensive Java stack
Runs on EC2
Advantages of commodity hardware
Cheap but (mostly) reliable
Data Local computing (no need for high speed network)
Highly Scalable
HDFS
Separation of Metadata from Data
Metadata = Inodes, attributes, block locations, replication info
Large blocks for files
~128MB
Highly reliable
Default replication is 3x
Block checksum verification
Single Namenode
All metadata stored in memory
Client talks to both namenode and datanodes
Bulk data from datanode to client
Custom client library in Java/C/Thrift
Not posix, not NFS
Map/Reduce
Map Function:
Apply to input data
Emits reduction key and value

Reduce Function
Apply to data grouped by reduction key
Often indeed reduces data

Hadoop groups data by sorting
Reductions multiple times: Combiners
Partitioning, Sorting, Grouping, etc.
Introduction
Explosion of Data
Web logs, ad-server logs, etc.
User generated content
Declining Revenue/GB
Hardware Trends
Commodity hardware
Software Trends
Open Source
Software as a Service
LAMP
Amazon EC2
...why Hive?
Large installed base of SQL users
map/reduce is for geeks...
much easier to write SQL
Analytics SQL queries translate very well to map/reduce
Extend files with Metadata
Tables, Schemas, Partitions, Indices
Metadata allows optimization, discovery, browsing
Work on any data format
Hive QL - Join
INSERT INTO TABLE pv_users
SELECT pv.pageid, u.age
FROM pageview pv
JOIN user u
ON (pv.userid = u.userid)
Hive QL - Outer Join
INSERT INTO TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM page_view pv
FULL OUTER JOIN user u
ON (pv.userid = u.id)
WHERE pv.date = 2008-03-03
Hive QL - Group By
SELECT pageid, age, count(1)
FROM pv_users
GROUP BY pageid, age
Hive QL - Group By with Distinct
SELECT pageid,
COUNT(DISTINCT userid)
FROM page_view
GROUP BY pageid;
Dealing with structured data
Type system
Primitive types
Recursively build up using Composition/Maps/Lists
ObjectInspector interface for user-defined types
to recursively list schema
to recursively access fields within a row object
Generic Serialization/Deserialization interface (SerDe)
Serialization families implement interface
Delimited text
Own SerDe (XML, Json, etc)
Metastore
Stores Table and Partition properties
Table schema and SerDe library
Table locations on HDFS
Logical partitioning keys and types
Partition level metadata
Other information
Thrift API
Current clients in PHP, Python interface to Hive, Java (Query engine and CLI)
Metadata stored in any backend (Derby, MySQL)
HIVE
Outline
HiveQL: Basic SQL
DDL
From clause subquery
ANSI JOIN
Multi-table insert
Multi group-by
Sampling
Objects traversal
Extensibility
Pluggable Map/Reduce scripts using TRANSFORM
How-to: Use a SerDe in Apache Hive:
https://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
...in practice
Don't reinvent the wheel
Domain specific knowledge
Importance of ETL
Data migration issues
Don't reinvent the wheel
Existing project: usage data already collected
describes domain specific knowledge in "hierarchic" CSV
XML-like query language developed in-house instead of using (waiting for?) Hive
CSV far from ideal
app1
category: general
startupTime:7s
helpIssued
etc.
app2
category: general
startupTime:3s
Importance of ETL
ETL can fix data-collection problems
KeyCollectorJob, QueryJob, AnalystJob, ObfuscatorJob, ReportCollectorJob, ReportCountValidatorJob: all kind of jobs was needed.
If the data model would have been designed better in advance, could have saved quite some of the ETL jobs
<?xml version="1.0" encoding="UTF-8"?>
<rdl subkey-delimiter="/">
<report name="BreakdownRegion">
<table name="Counts">
<rows>
<nReports/>
<nMachines/>
<nUsers/>
</rows>
<bigRows>
<breakdown type="key" key="Region"/>
</bigRows>
<columns>
<breakdown type="key" key="OSName"/>
</columns>
</table>
</report>
</rdl>
Data migration issues
Identify patterns
Tune data structure to discovered pattern
Run jobs to get data in new form
Very easy to do in the cloud (not cheap though)
Domain Specific Knowledge
What are you measuring?
Identify and capture common patterns in measured data
Categories, tags, etc.

Questions?
DDL
CREATE EXTERNAL TABLE page_view(
viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
STORED AS TEXTFILE
LOCATION '<hdfs_location>';
Full transcript