Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Spark Gotchas and Anti-Patterns

No description
by

Michael Malak

on 29 July 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Spark Gotchas and Anti-Patterns

Gotchas and Anti-Patterns
Michael Malak
November 5, 2014
`
RDDs
AP: Ignore Partitioning
Alternative:
group by x.substring(0,2)
Image from Matei's paper
Chart from performanceterracotta.blogspot.com
AP: Ignore GC
Solution 1:
Use Tachyon
Solution 2:
Break up node into multiple workers
Workers Per Node
Optimal in this case: 3
From http://hpsc-spark.github.io/
G: Sharing RDDs
Solution 1:
Tachyon and/or Succinct
Solution 2:
"god" daemon/router w/SparkContext
(RDDs tied to SparkContext)
AP: Ignore Closure
From engineering.sharethrough.com
JSONParser needs to be Serliazable
JSONParser needs to be stateless
JSONParser cannot be outside of main()
What is Closure?
Scala does not have Closure with Serialization
Scala has Closure
"Spores" library not part of even Scala 2.11
So AMPLab wrote their own!
Spark has serilizable closure even though Scala does not
G: Trying to Do Graceful Shutdown
API
Application
Fixed in 1.0.0: SPARK-1331 Graceful shutdown of Spark Streaming computation
Fixed in 1.0.1: SPARK-2034 KafkaInputDStream doesn't close resources and may prevent JVM shutdown
Still open: SPARK-2892 Socket Receiver does not stop when streaming context is stopped
Still open: SPARK-2383 With auto.offset.reset, KafkaReceiver potentially deletes Consumer nodes from Zookeeper
Save/Restore State
Open SPARK-3660: Initial RDD for updateStateByKey transformation
Standard updateStateByKey()
(keys under private control of Spark API)
Alternative updateStateByKey() overload provided by standard API
(full access to all keys)
G: Overrunning Batch Duration
E.g. Batch duration of 10 seconds,
but one particular batch takes 15 seconds to process
"Undefined" behavior
"Solution": Set batch size for worst case rather than average case
Batch Size Tradeoff
Latency vs. Throughput
G: No transactions
If a node deep in the DAG fails to handle incoming data, Spark Streaming has no facility to percolate that error back up the DAG (e.g. for future reprocessing or for flow control)
Major advantage of Storm over Spark Streaming
Spark 1.1.0 SPARK-1729 Make Flume pull data from source, rather than the current push model
Some reliability when using Flume
This makes sure that the if the Spark executor running the receiver goes down, the new receiver on a new node can still get data from Flume.
Other data souces (Kafka? Socket?)
G: Kafka "At Least Once" Semantics
Good News: Exactly once when Kafka nodes aren't going down
But this is a Kafka issue, not a Spark issue
G: Multiple Kafka Input Streams
Open SPARK-2388: Streaming from multiple different Kafka topics is problematic
G: Graphs are immutable
Spark 1.2.0: SPARK-2365 Add IndexedRDD, an efficient updatable key-value store
"GraphX would be the first user of IndexedRDD, since it currently implements a limited form of this functionality in VertexRDD."
"We envision a variety of other uses for IndexedRDD, including streaming updates to RDDs, direct serving from RDDs, and as an execution strategy for Spark SQL."
G: No Built-in API to Read RDF
Solution: Wait for Spark GraphX In Action mid-2015
G: INAIPL But... PageRank is patented
G: GraphX Performance
(Relative to Established Graph Tools)
"One of the reasons for PowerGraph's superior performance can be attributed to its highly optimized C++ implemenation."
From http://riverpublishers.com/journal/journal_articles/RP_Journal_2245-1439_333.pdf
G: Spark SQL Still Catching up to Hive (and Shark)
SPARK-2562 DATE data type not supported
SPARK-4154 "NOT BETWEEN" not supported
SPARK-4120 JOIN syntax "FROM T1, T2, T3" not supported (but INNER JOIN / OUTER JOIN syntax is supported)
SPARK-2866 ORDER BY attributes must appear in SELECT clause
SPARK-2693 Support for UDAF Hive Aggregates like PERCENTILE
SPARK-3807 SparkSql does not work for tables created using custom serde
Fixed in 1.2.0
Open
SPARK-4135 Error reading Parquet file generated with SparkSQL
SPARK-4073 Parquet+Snappy can cause significant off-heap memory usage
SPARK-4131 INSERT OVERWRITE LOCAL (filesystem)
Other (non-Hive-like) Issues
Fixed in 1.2.0: SPARK-3500 coalesce() and repartition() of SchemaRDD is broken
G: Performance Relative to Impala
AP: Ignore Performance Tuning
Partitioning
Compression
COMPUTE STATISTICS
Not caching tables larger than memory
SQLContext config options
G: Data Skipping From Shark Not Supported
From paper Liwen Sun et al
From AMPLab
Fixed in 1.2.0: SPARK-2961 Use statistics to skip partitions when reading from in-memory columnar data
G: Hot-Adding Nodes to Cluster
YARN:
Perform a bunch of manual steps, and then
su -l yarn -c "yarn rmadmin -refreshNodes"
Mesos:
Probably not possible based on open JIRA tickets:
MESOS-1739: Allow slave reconfiguration on restart
MESOS-890: Figure out a way to migrate a live Mesos cluster to a different ZooKeeper cluster
G: Not as Tunable as Mahout
AP: Thinking Spark is "In-Memory"
Shuffle files get written to disk
At least in 1.2.0, reduce happens on node with largest shuffle file (but shuffle files still written to disk)
AP: Using groupByKey() instead of reduceByKey()
From Aaron Davidson presentation at Spark Summit July 2014
From Aaron Davidson presentation at Spark Summit July 2014
AP: Using map() instead of mapPartitions()
From Patrick Wendell's Spark Summit 2013 Presentation
AP: Going Down the Kryo Rabbit Hole
Many groups just use Serializable instead, despite 10x performance hit
Kryo support is much better than it used to be, though -- dozens of Jira tickets resolved in 1.0.0 and 1.1.0
Jira tickets still outstanding:
SPARK-3601 Kryo NPE for output operations on Avro complex Objects even after registering
SPARK-3630 Identify cause of Kryo+Snappy PARSING_ERROR
G: BlinkDB Requires Spark 0.9.0
AP: Mixing Accumulators With Lazy Operations
From blog.madhukaraphatak.com
Above results in 0,0
Alternative: Use mapPartitions() and forEach() within each partition
G: Key Algorithms Missing
SPARK-2352 Add Artificial Neural Network (ANN) to Spark
Anomaly Detection
Decision Tree Pruning
Full transcript