Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Apache Spark

No description
by

Anant Asthana

on 23 March 2016

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Apache Spark

Apache
Anant Asthana
anant.asty@gmail.com
github.com/anantasty
linkedin.com/anantasthana

Objectives
Questions?
https://amplab.cs.berkeley.edu/benchmark/
http://blog.cloudera.com/blog/2014/03/apache-spark-a-delight-for-developers/
https://spark.apache.org/
https://github.com/apache/spark/graphs/contributors
http://hci.stanford.edu/courses/cs448g/a2/files/map_reduce_tutorial.pdf
What is Spark
Open Source - BSD License
Alternative to the Map Reduce paradigm
Low latency cluster computing system
Scalable for very large datasets
Maybe upto 100x faster than Map Reduce for:
Iterative algorithms
Interactive data mining
Can be used with Hadoop/ HDFS
RDD - Resilient Distributed Dataset
Fault-tolerant
Distributed
Can be operated on in parallel
RDD's are the fundamental unit of data in spark

Performance benchmarks
Spark Stack
Spark SQL
Spark Streaming
MLlib
GraphX
Logistic Regression Spark vs Hadoop
Multiple iterations over dataset
RDD - Sources & Operations
Data Sources
HDFS
S3
File System
HBase, Paraquet, Cassandra
Hadoop input formats

Transformations
map
filter
flatMap
sample
union
intersection
distinct
groupByKey
reduceByKey
join
Actions
reduce
collect
count
first
take
takeSample
countByKey
Operations
What Is Spark Being Used For?
ETL
BigData SQL
Machine Learning
Stream Processing
Graph Processing
Recommendation Engines
Data Science
Log Analytics
Video Optimization

Examples
examples: https://github.com/anantasty/spark-examples
docker containers: https://github.com/anantasty/docker-spark
Wordcount
SparkSQL
ETL
Log Analysis
Recommender
K-Means Clustering
Spark UI

Working with RDD

RDD Example - Distributed grep
Learn about Map Reduce
What is spark
Features
What spark excels at
Performance comparisons
Basic usage examples
What Spark excels at
Languages Supported
Java
Scala
Python
Ease of use
Multiple languages
Interactive shell
Speed
In memory computation
Shared memory
broadcast variables
accumulators
More operations than the MapReduce paradigm
Word Count Example
RDD Caching and Fault Tolerence
Fault Tolerence
RDD's track liniage - easy to recompute lost data
HDFS File
Filtered File
Mapped RDD
RDD caching - using rdd.cache()

examples: https://github.com/anantasty/spark-examples
docker containers: https://github.com/anantasty/docker-spark
presentation: http://goo.gl/Pn61TB

Word Count - Map Reduce(Java)
http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
Unified Data Platform
Concise code
The Purpose Of Cluster Computing
Scalability
Easy to add nodes
High availibility
Fault tolerence
Intro to Map Reduce
Thanks to our Sponsors!
Full transcript