Prezi

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in the manual

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

DevOps meetup September, 2013

Monitoring socks!
by Zoltan Nagy on 2 September 2013

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of DevOps meetup September, 2013

$ /usr/local/monitoring/scripts/monitoring_if.sh bond0 rx bytes
3228148
sensu client
subscribes to checks
runs check scripts
publishes results
goals
anyone can add new monitoring
hosts register themselves and their services = no manual configuration
service owners can get woken up (PagerDuty, HipChat, E-mail, ...)
useful dashboards
SaaS
didn't fit our model
immature api
inefficient web ui
riemann
data collector
riemann-tools
cpu
memory
disk
application
custom application metrics
monitoring scripts
custom system metrics
event
index
state
events have TTL
event stream processor
query
dashboard
forwarders shipped with riemann
graphite
email
PagerDuty
Librato
another riemann server (scaling!)
lots more
Event Stream Processor
Dashboards
Graphite
monitoring "dynect_qps" do
script_template "dynect_qps.py.erb"
script_arguments "prezi.com"
script_cookbook "monitoring"
script_variables(
:dynect_user => dynect_user,
:dynect_password => dynect_password
)
alerts [
{
:maximum => 350,
:severity => :warning
},
{
:minimum => 40,
:severity => :critical
}
]
end

check queues
result queue
sensu server
schedules and publishes checks
runs handlers on results
sensu dashboard
RabbitMQ
RabbitMQ
one on each host
any number of instances
custom events
service
host 1
Chef
monitoring API
register/update
config generation
Graphite
periodic config generation
- icinga needs reload - it's slow and blocks icinga
- latency between hosts registering themselves and appearing in monitoring
NRPE
NRPE
polling
network sensitive
causes load spikes
no authentication
IP-based access control makes scaling harder
host 2
host 3
owner
alerts
easily add checks
useful dashboards
check script
wrapper script
1 metric
thresholding
service status
NRPE
lots of scripts
lots of resources
icinga slave
icinga master
nsca
worker
host
host
host
data center
icinga slave
host
host
host
data center
nsca
cloud
host
host
host
host
check queue
result queue
monitoring "dynect_qps" do
script_template "dynect_qps.py.erb"
script_arguments "prezi.com"
script_cookbook "monitoring"
script_variables(
:dynect_user => dynect_user,
:dynect_password => dynect_password
)
alerts [
{
:maximum => 350,
:severity => :warning
},
{
:minimum => 40,
:severity => :critical
}
]
end

Chef
check script
wrapper script
1 metric
threshold
service status
NRPE
lots of scripts
lots of resources
monitoring "dynect_qps" do
script_template "dynect_qps.py.erb"
script_arguments "prezi.com"
script_cookbook "monitoring"
script_variables(
:dynect_user => dynect_user,
:dynect_password => dynect_password
)
alerts [
{
:maximum => 350,
:severity => :warning
},
{
:minimum => 40,
:severity => :critical
}
]
end

Chef
check script
wrapper script
1 metric
threshold
service status
NRPE
lots of scripts
lots of resources
Graphios
{
"broker": "icinga-ec2.prezi.com",
"id": "689",
"attributes": "{}",
"name": "graphite-i-433fdb2c.ec2-us-east-1e",
"ip": "75.101.201.119",
"resource_uri": "/api/v1/hosts/689/",
"roles": "[\"ubuntu_base\",\"site_ec2_us-east-1\",\"graphite_ec2\"]"
}
solution
notification
status
$ /usr/local/monitoring/checks/net_bond0_rx_bytes
3228148|value=3228148;
$ cat /etc/nagios/nrpe.d/net_bond0_rx_bytes.cfg
command[net_bond0_rx_bytes]='/usr/local/monitoring/checks/net_bond0_rx_bytes'
icinga master
Multiple data centers
Distributed monitoring on EC2
We get woken up at 3am
Anyone can define their own monitoring
Anyone can assemble their own dashboards
check
check
check
But...
Achievements
Smaller icinga instances - faster reload - more frequent config generation?
Don't use icinga at all?
Icinga 2 may solve these
or
or
use push-style data collection
reliable transport
and proper authentication
with local scheduling
a number has no meaning
can we return multiple values while keeping it simple enough?
can we schedule monitoring more evenly?
we need some description for the values
Monitoring!
How we do it at Prezi
DevOps Meetup
September, 2013

Endre Hirling
Zoltan Nagy
abesto

grab a can of pizza and a slice of beer
Infrastructure developers at Prezi.com
(everyone loves knows Nagios, right?)
Icinga server
has a full inventory of hosts and services
connects to them and checks their status
receives status data
sends notifications
monitors stuff
sends metrics
schedules checks
receives status data
See the full transcript