Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Introduction to Distributed Processing

No description
by

Alberto Lorente Leal

on 23 December 2014

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Introduction to Distributed Processing

Introduction to Distributed Processing
Brief History
Data Processing Frameworks
Job Scheduling
Both Flink and Spark, have the same starting point:
Build a DAG (Direct Acyclic Graph)
Identify stages.
Processing Architecture
Both Flink and Spark follow the same structure:
Central Manager
Multiple worker/task managers
Work on top of a data layer
Playing with Spark
Configure a spark context
Speaking Transformations
Workflows speak with transformation and actions performed to data:
Mining Data with style
Author: Alberto Lorente Leal
Spark Example
Lets do a simple word count example:
2004
2006
2014
2010
2003
2002
Doug Cutting adds DFS & MapReduce support to Nutch
Yahoo! hires Cutting, Hadoop spins out of Nutch
Apache Spark Paper and released Open Source Project in Apache.
Started by Matei Zaharia at UC Berkeley AMPLab in 2009
In February 2014, it became one of the Top level Apache Projects.
Doug Cutting & Mike Cafarella started working on Nutch
Google publishes GFS & MapReduce papers
Boom Big Data
Big Data Analytics
i.e: Data Scientists
"It has over 465 contributors in 2014, making it the most active project in the Apache Software Foundation and among Big Data open source projects."
Wikipedia, Apache Spark
Hadoop 2.0
Apache Spark
Apache Flink
Spark proposes the usage of In Memory Data Processing and Sharing
Compatible with a variety of storage systems, including HDFS, Cassandra, OpenStack Swift, or Amazon S3.
Supports Python, Java & Scala.
In Memory
New Data Processing Framework.
Based on Stratosphere research project started at TU Berlin in 2009.
Released as an apache incubator project.(0.7.0)
Tasks
Results
RAM
RAM
RAM
val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
Specialized Libraries
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
More examples: https://spark.apache.org/examples.html
Plus in the source code
Simple Text Search example:
val file = spark.textFile("hdfs://...")
val errors = file.filter(line => line.contains("ERROR"))
// Count all the errors
errors.count()
// Count errors mentioning MySQL
errors.filter(line => line.contains("MySQL")).count()
// Fetch the MySQL errors as an array of strings
errors.filter(line => line.contains("MySQL")).collect()
Spark SQL
Allows seemless integration with structured data like SQL
Data is loaded through
SchemaRDD
Can be generated from:
Other SchemaRDD
JSON
Parquet
HiveQL
Spark Streaming
Enables scalable, high throughput, fault-tolerant stream processing of live data streams
MLib
Sparks machine learning library with algorithms and utilities like:
Classification
Regression
Clustering
Collaborative filtering
Dimensionality reduction

API for graph parallel computation
There is a large interest to process large scale graph data:
Social Networks
Language modeling
Web crawling
Inferring through Reflection
Code source: https://spark.apache.org/docs/latest/sql-programming-guide.html
Parquet Files
Code source: https://spark.apache.org/docs/latest/sql-programming-guide.html
Parquet
is a columnar format
Spark streaming breaks down the live input stream in chunks
This chunks get processed by the spark engine
This generates the final result stream
Use of a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.
Example: Network Word Count
What is stream processing?
It is a mechanism where we transform data in real time (aggregate, enrich) so other services can later consume this information
Count the words in from a live network stream
More:
https://spark.apache.org/docs/latest/streaming-programming-guide.html
Code Example:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/NetworkWordCount.scala
Deploying
Deploying
Sources & Acknowledgements
Main source Spark: https://spark.apache.org/docs/latest/index.html
Contains documentation and hands on example.

Advanced exercises for those interested:
https://databricks-training.s3.amazonaws.com/index.html

Apache Flink: http://flink.incubator.apache.org/
Apache Hadoop: http://hadoop.apache.org/
Apache Nutch: http://nutch.apache.org/
Map reduce Paper: http://static.googleusercontent.com/media/research.google.com/es/us/archive/mapreduce-osdi04.pdf
Google File System: http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
Spark Research Papers: https://spark.apache.org/research.html
Note:
Hadoop Map Reduce depended on I/O on disk
In the case of spark, it can work with multiple setups:

Local mode: In your personal workstation
Standalone: a simple cluster manager included with Spark that makes it easy to set up a cluster.
Apache Mesos: a general cluster manager that can also run Hadoop MapReduce and service applications.
Hadoop YARN: the resource manager in Hadoop 2.

In addition, Spark’s EC2 launch scripts make it easy to launch a standalone cluster on Amazon EC2.
Each application gets its own executor processes, during the lifetime of the computation .
Guarantees isolation of applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs).
Cluster agnostic architecture.
Submit our application both in .Jar or sbt artifact depending of you language of choice
Cached
Wide
Dependencies
Narrow
Dependencies
Full transcript