Apache
Anant Asthana
anant.asty@gmail.com
github.com/anantasty
linkedin.com/anantasthana
Objectives
Questions?
https://amplab.cs.berkeley.edu/benchmark/
http://blog.cloudera.com/blog/2014/03/apache-spark-a-delight-for-developers/
https://spark.apache.org/
https://github.com/apache/spark/graphs/contributors
http://hci.stanford.edu/courses/cs448g/a2/files/map_reduce_tutorial.pdf
What is Spark
Open Source - BSD License
Alternative to the Map Reduce paradigm
Low latency cluster computing system
Scalable for very large datasets
Maybe upto 100x faster than Map Reduce for:
Iterative algorithms
Interactive data mining
Can be used with Hadoop/ HDFS
RDD - Resilient Distributed Dataset
Fault-tolerant
Distributed
Can be operated on in parallel
RDD's are the fundamental unit of data in spark
Performance benchmarks
Spark Stack
Spark SQL
Spark Streaming
MLlib
GraphX
Logistic Regression Spark vs Hadoop
Multiple iterations over dataset
RDD - Sources & Operations
Data Sources
HDFS
S3
File System
HBase, Paraquet, Cassandra
Hadoop input formats
Transformations
map
filter
flatMap
sample
union
intersection
distinct
groupByKey
reduceByKey
join
Actions
reduce
collect
count
first
take
takeSample
countByKey
Operations
What Is Spark Being Used For?
ETL
BigData SQL
Machine Learning
Stream Processing
Graph Processing
Recommendation Engines
Data Science
Log Analytics
Video Optimization
Examples
examples: https://github.com/anantasty/spark-examples
docker containers: https://github.com/anantasty/docker-spark
Wordcount
SparkSQL
ETL
Log Analysis
Recommender
K-Means Clustering
Spark UI
Working with RDD
RDD Example - Distributed grep
Learn about Map Reduce
What is spark
Features
What spark excels at
Performance comparisons
Basic usage examples
What Spark excels at
Languages Supported
Java
Scala
Python
Ease of use
Multiple languages
Interactive shell
Speed
In memory computation
Shared memory
broadcast variables
accumulators
More operations than the MapReduce paradigm
Word Count Example
RDD Caching and Fault Tolerence
Fault Tolerence
RDD's track liniage - easy to recompute lost data
HDFS File
Filtered File
Mapped RDD
RDD caching - using rdd.cache()
examples: https://github.com/anantasty/spark-examples
docker containers: https://github.com/anantasty/docker-spark
presentation: http://goo.gl/Pn61TB
Word Count - Map Reduce(Java)
http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
Unified Data Platform
Concise code
The Purpose Of Cluster Computing
Scalability
Easy to add nodes
High availibility
Fault tolerence
Intro to Map Reduce
Thanks to our Sponsors!
Present Remotely
Send the link below via email or IM
Present to your audience
- Invited audience members will follow you as you navigate and present
- People invited to a presentation do not need a Prezi account
- This link expires 10 minutes after you close the presentation
- A maximum of 30 users can follow your presentation
- Learn more about this feature in our knowledge base article