Loading presentation...

Present Remotely

Send the link below via email or IM


Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.


Stability Patterns at Campanja

No description

David Almroth

on 28 October 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Stability Patterns at Campanja

Example 1:
Use virtual servers
If one server crashes, the rest are (often) not affected.
Stability Patterns
David Almroth,
February 2013
Example 2:
Java thread pools

Do not only use the
default single thread pool. Make more thread pools.
Example 4:
Erlang processes are
by definition bulkheads.

Erlang supervisor hierarchies are bigger bulkheads
Example 3:
Database connection

Use a separate pool for
"extra important" connections
Curcuit Breaker
Bulkheads protects your software from outside threats by limiting the damage.

A ship should not sink if it gets one single hole i the hull.

Your website should not go down when one of the site's many backend systems stop responding.

Java DevOps: Your website should not halt just because one single Java thread pool gets exhausted or one single database connection pool gets exhausted.

Erlang has superior support for bulkheads. For example: No shared data between processes. If one process crashes, a supervisor can handle it and limit the damage.

Use supervisor hierarchies to make larger bulkheads in your system.
Circuit breakers protect your software from "internal heat".

The same way a fuse breaks and saves a House from burning down completely, a circuit breaker in software shuts down one small malfunctioning part of the system instead of snowballing and crashing the complete system.

If your system can't connect to a database and has done X retries the circuit breaker will "break" and your system will stop calling the database. The operator should be informed of course.

In a Java shop I would use Hystrix.
In Erlang I would use supervisors and child definitions. It is easy to set max restarts per time interval in a child definition.
Do not let one malfunctioning
electrical equipment burn down the house.
Cascading failure
Integration points
Sockets have so many risks!

Libraries using sockets often add even more risks.
Limit return data
Socket risk examples:
What if I can't make the initial handshake?
What if it takes ten minutes to do the handshake?
What if I always get a disconnect after the initial handshake was OK?
What if I can make the connection, but I never get any response after the initial handshake?
What if it takes X minutes to respond to my request?
What happens if 10 000 request comes in with a burst?
How long can the TCP connection be quiet before one of the firewalls in between drops it?
Library can add risks like these:
No timeout can be set in the client.
No data volume max can be set.
The socket is not closed properly, only closed later by garbage collection.
Throws the same exception for connectivity problems and bad input problems and bad response data problems.
Resource leaks
Hides exceptions from lower (socket) layers
Attack of self denial
Use timeouts
Connection timeout
Read timeout
Not Stable
How to make it stable
Default values are usually to high.
A high timeout is also dangerous

Sometimes the 3rd party client lib you are using supports this, and then it is a good library.

This is one of the first things to check when evaluating a new client side lib.

Configure connection timeouts and read timeouts.

Always configure this on database connection pools (in the app server / web server)

A really nice lib is "lhttpc"
Case 3:
Clients reconnect without a delay between attempts.
Case 1: Rendering an unlimited data set on a web page. For example a big report.
Small cracks lead to big failures
The system behaves well,
even when abused
Case 2:
Endless loop
Reason for one cascading failure last year in the RTT:
Biggest customer launched mobile site
Traffic increased with 30%. We did not add more web servers.
We did a deployment. When we stopped on server
the other server in the same availability zone was
overload. It crashed with out of memory.
The load on the last 4 servers got to high and they crashed too.
Unknown bugs & risks
Is there a way to limit the impact?
For example:
slow memory leaks
race conditions
bursty traffic
hardware failures
packet loss on network
bugs after the next deployment
In this talk I will focus on stability for distributed

Who is David?
I have been working with Java
since 1999, mostly banks and finance.
Erlang since 2009.
I love operations because I get to see my code run
in the real world.
There is a reason why Google limit your
search result. Use this pattern to make clients and servers in your distributed system more stable (under massive load).

At Campanja we store massive amounts of data in HBase. When we query the database, we always limit the number of lines returned.
Implement and use
"Hood openers"
If you can't open your systems hood, it is hard to track down a problem
Besides the classic Linux toolset we are using the Erlang console and Erlang tools like:
dtop (cpu + memory + inbox per process)
redbug (trace input and output to any function)
folsom (metrics)

riemann (monitoring based on metrics)
graphite (graphs based on metrics)

Sometimes we need more detailed metrics, then we code it and deploy it within a few hours just to get more knowledge about our system.

We use folsom to make sure the added metrics do not crash our system.
Stability Patterns
Use timeouts
Limit return data
Circuit breakers
Hood openers
Decoupling middleware
Thank you for
Contact me:
We are hiring!

Like us at Facebook
Netflix Hystrix
Erlang child specifications
Circuit Breaker
Use decoupling middleware
We are using
Use in your clients.
Set a low time.
Full transcript