Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

fastdata @Torino 21/03/2016

No description
by

Claudia Misale

on 26 April 2016

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of fastdata @Torino 21/03/2016

Batch Processing
Example: Map+Reduce dataflows
Data
= Resilient Distributed Dataset (RDD)

Spark Program
= algebra of stateless operators on RDDs

Dataflow =
RDD as token + kernel on single RDD item
generalization of MapReduce
One Programming Model to dominate them all
Claudia Misale
Maurizio Drocco
DataFlow Model
Hadoop, Spark, Flink, Storm, Tensorflow: Instructions for Use
Actor
FIFO QUEUE
tokens
Map+Reduce
Data
= stream of tuples
Storm Program
= Semantic
Dataflow
= tuples as tokens + kernel on single tuple
stateful processing
=
modification
to system state
made
visible
at some future point
Actors execute user defined
kernels

Firing rules
determine when kernel can be executed

Parallelism
among independent actors

Definition of
operations and data dependencies
instead of procedures
1.
Partitioning
: low level aspects rising from runtime

2.
Heavyweight runtime
: needed for easy functional programming support
(e.g. function serialization and distribution)

3.
Master-Worker
execution model
Programming model:
Global State
Explicit synchronizations
Explicit Parallelism (actor = thread)
Thank you

some history...
Map+Reduce
Syntax and Semantics
map f a = [f a1, f a2, ..., f an]
a: collection
f: elemental function a.k.a.
map-kernel
reduce f a = a1 f a2 f ... f an = F(ai)
f: binary combinator a.k.a.
reduce-kernel
- f associative/commutative
Data processing
data-centric
programming
=
functional
composition



denotational
semantics
Actor
FIFO QUEUE
tokens
DataFlow
Parallel execution DF:
data dependencies
Stream processing
input: from-any
output: broadcast
kernel = Clojure/Java code
Example: PageRank semantics dataflow
Parallel Execution Dataflow
input: from-any
output: scatter
data-parallel
pipeline
Stream processing
Data
= DataStream
Flink Program
= algebra of "stateful" operators
Dataflow
= DataStream as token + kernels on single stream items
FLINK
Parallel execution dataflow
firing rules:
input: from-all
output: scatter

dataflow
DataStream
Bulk Synchronous Programming
Windowing
Specialized DataStream type
DataStream must be
PARTITIONED
Stream Processing
Data = DStream as continuous series of RDDs
Spark Program = algebra of stateless operators on DStream
Dataflow = DStream as token + kernel on single RDD item
State
Actor-local state
can be
emulated
Read-only closures?
Write-only accumulators?
Actor-local state?
What do they share
map f^s a =
f^s a1, f^s a2, ...
map f a =
f (s0,a1), f (s1,a2), ...
Batch Processing
Data
= Tensors (multi-dimensional arrays)

TF Program
=
Dataflow
=
tensors as tokens + kernels on single tensor items
Specific for machine learning
Big Data
never ask
what they are
High expressiveness at cost of low-level programming
Unified Dataflow Model
Timeline
processing Big Data
poses many challenges
...
high performance
easy programming
...

1978

2012

Backus' Functional Programming manifesto
2004
Google MapReduce 1st publication
2011
Higher-order
functionals
2014
Program semantics dataflow

Execution dataflow

Parallel runtime
2015
Data dependencies
" Given a 8CPUs core, roughly speaking,
how many workers would make sense to use?"
Spark, Flink, Storm, Hadoop hard to say they are High Performance oriented...
TensorFlow is moving to High Performance
at the cost of exposing low-level aspects
The Dataflow model has been proved
to describe all layers
of a big data analytics framework
Conclusion (1)
Conclusion (2)
M
R
FLINK
FLINK
Semantics DF
Not a new concept...
k
b=R(M(a))
a
b
Full transcript