Send the link below via email or IMCopy
Present to your audienceStart remote presentation
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article
Do you really want to delete this prezi?
Neither you, nor the coeditors you shared it with will be able to recover it again.
Make your likes visible on Facebook?
Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.
fastdata @Torino 21/03/2016
Transcript of fastdata @Torino 21/03/2016
Example: Map+Reduce dataflows
= Resilient Distributed Dataset (RDD)
= algebra of stateless operators on RDDs
RDD as token + kernel on single RDD item
generalization of MapReduce
One Programming Model to dominate them all
Hadoop, Spark, Flink, Storm, Tensorflow: Instructions for Use
= stream of tuples
= tuples as tokens + kernel on single tuple
to system state
at some future point
Actors execute user defined
determine when kernel can be executed
among independent actors
operations and data dependencies
instead of procedures
: low level aspects rising from runtime
: needed for easy functional programming support
(e.g. function serialization and distribution)
Explicit Parallelism (actor = thread)
Syntax and Semantics
map f a = [f a1, f a2, ..., f an]
f: elemental function a.k.a.
reduce f a = a1 f a2 f ... f an = F(ai)
f: binary combinator a.k.a.
- f associative/commutative
Parallel execution DF:
kernel = Clojure/Java code
Example: PageRank semantics dataflow
Parallel Execution Dataflow
= algebra of "stateful" operators
= DataStream as token + kernels on single stream items
Parallel execution dataflow
Bulk Synchronous Programming
Specialized DataStream type
DataStream must be
Data = DStream as continuous series of RDDs
Spark Program = algebra of stateless operators on DStream
Dataflow = DStream as token + kernel on single RDD item
What do they share
map f^s a =
f^s a1, f^s a2, ...
map f a =
f (s0,a1), f (s1,a2), ...
= Tensors (multi-dimensional arrays)
tensors as tokens + kernels on single tensor items
Specific for machine learning
what they are
High expressiveness at cost of low-level programming
Unified Dataflow Model
processing Big Data
poses many challenges
Backus' Functional Programming manifesto
Google MapReduce 1st publication
Program semantics dataflow
" Given a 8CPUs core, roughly speaking,
how many workers would make sense to use?"
Spark, Flink, Storm, Hadoop hard to say they are High Performance oriented...
TensorFlow is moving to High Performance
at the cost of exposing low-level aspects
The Dataflow model has been proved
to describe all layers
of a big data analytics framework
Not a new concept...