Prezi

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in the manual

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Data in and Data out: Using Hadoop to create data products at LinkedIn

Many of LinkedIn's products are driven by data. This presentation will cover how LinkedIn is solving its issues with creating and maintaining new data products that are processed by Hadoop.
by Richard Park on 23 October 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Data in and Data out: Using Hadoop to create data products at LinkedIn

Data Data Out
Creating large scale
data products at LinkedIn
Richard Park
Senior Software Engineer, Hadoop Infrastructure
(LinkedIn's SNA Team www.sna-projects.com)
Who Am I?
InMaps
Some of LinkedIn's
Data Products

Visualize clusters in your connections
>5 terabytes processed, once a week
16 GB served
A part of LinkedIn Labs. inmaps.linkedinlabs.com
What do we do?
Social Network for Professionals
Over 100 million users > 50% outside US
~ 1 million new members a week
Skills
People You May Know
(PYMK)
Avatara
Dozens of data projects, each very different, but all use Hadoop and have the same challenges that come with it.
Pull Data
Process Data
Push Data
Common Problems
Storage
Convert text (XML, CSV files) to binary (binary JSON, Avro)
Less storage, less tasks, less time
Converted ~1TB to ~100GB
Database Replication
Block compressed
Take advantage of grouping similar data
Saw ~30% reduction in data size
JDBC
Good for smaller tables, just replicate it whole
Need deltas for large table. Use Hadoop to merge
Throughput concerns and connection #
Flat Files
Using script or tool to extract data into files
Dump files into HDFS
Can be higher throughput
Kafka
Scalable OLAP
Create cubes in Hadoop
Offline and online component
Profile stats, Company stats
Distributed, high throughput, persistent, publish-subscribe message system.
Kafka
Open source scala project
Pull, not push
Disk, not memory
Kafka in Nutshell
Tracking hundreds of different event types
Over a billion records a day
Several terabytes a day
Kafka in LinkedIn
Production Data Center
Hadoop
Avro Serialization Format
Service Metrics
What we Use it For
Tracking Events
Page Views
Profile Views
Searches
Network Updates
Logins
Impressions
...
Some Tracking Example
CPU/disk usage statistics
Healthcheck on service
Reporting/graphs
Service Metric Example
Streaming feed
News feed
Operations
Security
Subscribers of Feed
Message Queues
(ActiveMQ)
Log Aggregators
(Flume,Scribe)
Kafka
Goes Here
Producer
Consumer
Example
Easy Kafka Integration Hint
Logger logger = Logger.getLogger(KafkaAppender.class);
logger.info("Look how easy it is!");
Example
For Hadoop
Use the simple consumer and handle your own offsets
Parallel fetch, one mapper per broker and partition
Be careful of failures. Need to recover/restart safely
Watch # of small files!
Java Map Reduce
Apache Pig
Apache Hive
Running the Job
Database
Voldemort
Kafka
Time consuming writing jobs
Little code re-use
More power to shoot yourself
Sometimes necessary for performance/memory constraints
SQL!!!
Great for adhoc
Had to write Avro ability in Hive
Still very new at LinkedIn
CREATE TABLE u_data_new (
userid INT,
movieid INT,
rating INT,
weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

add FILE weekday_mapper.py;

INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;

SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;
Good performance
Good for adhoc
Easier to tune
Tuples and Bags... weird
Vast majority of production jobs in Pig
Extendable through User Defined Functions (UDF)
Spend more time on algorithm, less time on setup, boiler plate code and debugging
At LinkedIn
Examples
register linkedin-pig.jar;

trackingData = load '/data/pageTracking/' using BinaryJSON();
memberData = load '/data/memberAccounts/' using BinaryJSON();

filteredTracking = filter trackingData by some.udf.IsAfter('2011-06-27');
joinedData = join trackingData by memberId, memberData by memberId;

store joinedData into '/myoutputDir/results/' using BinaryStorage();
Pig
import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class PageFilterDate {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
...
}
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
...
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("date","2011-06-27");

Job job = new Job(conf, "test");

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);
}

}
Java (Incomplete, I got lazy)
Workflow Manager
Need workflow manager, scheduler
Yahoo Oozie?
Cron?
At LinkedIn, we use Azkaban
Azkaban
Azkaban is a simple workflow manager and scheduler
It manages job dependencies
Notifies on failures
Some visualizations on work flow
Simple to use
Job File
type=java
job.class=org.myjob.RunThisJob
dependency=someotherjobfile,someotherjobfile2
property1=MyName
property2=${YourName}
Property File
MyJobFile.job
MyPropertyFile.properties
someGlobalProperties=test
YourName=foo
dependencies are specified
set job type
variable replacement
User Interface
Scheduler
Job Page
Execution History
Job Details
Workflow Graph
Open-source on www.sna-projects.com
Work in progress... lots of improvements coming
Will need security for newer Hadoop versions
Additionally...
Using JDBC to push can be dangerous
DBA pull from HDFS folder
Rolling back?
Requirements
Reliable
Fast
Frequent
Safe
Requirements?
Create products quickly
Less boiler plate code
Less maintenance
Less effort
Run reliably
Run frequently
Requirements
Push large amounts of data
Frequently
Reliably
Quickly
Be able to rollback
What is Project Voldemort?
Distributed key-value storage system
Based on Amazon Dynamo
Automatically replicated
Automatically partitioned
Scalable
Fault tolerant
Pluginable
Open-source (www.project-voldemort.com)
Read-Write
Read-write uses vector clocks for eventual consistancy
Pluginable read write storage (eg. Berkeley DB)
Read-only storage uses OS filesystem
Indexes and data files are pre-computed in Hadoop
Parallel Push
Have reusable Azkaban jobs to do this
vs Read Only
6 TB of data
1 TB Updated Daily
Dozens of different data stores
Ability to roll-back to backup
Key piece of many data products
Voldemort at LinkedIn
What products use it?
People you may know
Avatara
Skills
Inmaps
...many others
Great for read-only data, but what about
read-write?
Kafka! Remember me?
Distributed, high throughput, persistent, publish-subscribe message system.
What if a Kafka consumer wrote to DB or other read-write storage?
Reverse the flow
Used for read-write data updates
No good roll back mechanism
No direct access from Hadoop to consumers
For more info, go to our team site www.sna-projects.com
LinkedIn Labs: www.linkedinlabs.com
And of course, join us at LinkedIn: www.linkedin.com!
www.linkedin.com/in/richardbhpark

Finally...
Thank You for Attending!
>80 Hadoop Jobs
16 TB of intermediate data
700 gb of data output
Machine learning model
Bloom filter
~5 test algorithm concurrently running
Classification
Network analysis, Page Rank
16 TB processing
Open Source Map-Reduce Framework
Distributed processing
Distribute data (HDFS)
Hadoop
Other Data Center
Skills Hadoop Jobs
Richard Park (rpark@linkedin.com)
AvroStorage part of Piggybank
See the full transcript