At Real Geeks we have lots of internal dashboards: Graphite, Logstash, Munin, and some other internal tools with status pages. Obviously, I didn’t want these open to the public. I was using one of two different ways of restricting access, and both of them didn’t work very well.
Failed Technique #1: Require a password
The first technique I was using was to set up HTTP Basic Auth passwords for our employees, using a hashed password file sitting on the server. This was relatively simple to set up, but also a huge pain. Every time someone forgot their password, I would have to update the server configuration (by SSHing into the server, later by changing the Puppet configuration). When we add or remove employees it’s a manual process. Unless you have the server running on HTTPS, their passwords are sent over plain text.
Failed Technique #2: Restrict access with a firewall to IP addresses
I also tried by restricting access to these dashboards by requiring a certain IP address in the firewall. This solves some of the problems with technique #1 but also introduces new ones. Employees working from home or on their phones couldn’t access the dashboards. This could probably be solved by having an office VPN but setting that up just for dashboard access is a lot of work. Also, sometimes our office IP address would change, which again requires a bunch of configuration changes. Our firewall rules filled up with a bunch of random residential IP addresses so employees could get online from home.
The solution: Oauth
After feeling the pain for awhile, I knew there had to be a better way. All of our technical employees have accounts on the Real Geeks Github Organization. Github also has an oauth2 provider. If you’re not familiar with Oauth2, it’s the technique used by those “Log in with Google/Facebook/etc” buttons you see on a lot of websites. It allows a website to delegate the authentication and account management to another service, in my case, Github.
The next challenge would be to come up with a way to add Oauth2 support to all of my projects. I didn’t want to have to run forks of Graphite, Logstash, and Munin. I wanted my solution to be easy to deploy and be lightweight. I also wanted it to be secure.
After a bit of thinking and searching around, I found google_auth_proxy. It’s an open-source project written in Go that almost solves my problem. It acts as a guardian that sits in front of your application, requiring the user to authenticate with the Google Oauth2 server before allowing them to continue. Once they have authenticated with Google, it stores a session in an HMAC-signed cookie and allows the requests to proxy through to the backend.
The only problem is, google_oauth_proxy only works with Google oauth servers. So, I forked it and made a new project oauth_proxy. oauth_proxy is intended to work with any oauth2 provider, not just google. Since it’s written in go, it compiles to a static binary that can be easily deployed.
Here is an example invocation of the oauth_proxy server:
The client ID and client secret come from the Github oauth control panel. Upstream points to the upstream server that I’m protecting, it should only be listening on localhost. The cookie secret is used to sign the session cookie. The login URL is the token generation endpoint of the oauth provider, and the redirect URL should point to the redirect url on oauth_proxy (it’s always /oauth2/callback, but the hostname changes depending on your server). Finally, the redemption URL is your oauth2 provider’s token redemption URL where the code from the redirect URL gets posted and exchanged for an access token.
Getting fancy with auth tokens
Sometimes, getting an auth token from the oauth2 provider is not sufficient to gain access to your backend. For example, maybe you want to make sure that the person authenticating belongs to a specific Github organization. For this, you can create a user verification command. You can see an example script in contrib that verifies that a given access token belongs to a user in a specific organization. In order for this script to work, I have to request the “user” scope from github.
This was my first time using Go, but I found it fairly simple to figure out what was going on. The go fmt command is pretty amazing, especially for a Go beginner with no idea about Go style. I would like to eventually improve the login screen since it’s just a small button in the corner that says “login”, pretty boring. There are probably a lot of improvements to be made (better support for other oauth providers?) so please send a pull request!
I’ve been reading through the second edition of They Joy Of Clojure for the ClojureHNL meetup group and wanted to see how I could improve my Vim setup for editing Clojure code. This is mostly a note to myself but I thought someone else might find it interesting. The impetus behind this was feeling the pain of trying to balance parentheses and seeing some Emacs jockeys kick ass with nREPL and wanting in on the action. So, without further ado, here is what I did.
1. Install pathogen
If you’ve played with vim plugins at all, you probably already know about Pathogen but if you don’t, you’re in for a treat. Pathogen vastly simplifies installing and manaing your vim plugins. Follow the installation instructions and rejoice.
2. Run the newest version of vim
Vim versions newer than 7.3.803 ship with the vim-static runtime files. These tell vim how to indent and syntax-highlight Clojure.
3. Install vim-fireplace
vim-fireplace is a plugin that puts a Clojure repl inside your Vim. It connects to nREPL (which is what you get when you run lein repl). Here’s how I installed it:
This is the magic plugin that makes it so you don’t have to worry about matching parentheses anymore.
git clone https://github.com/vim-scripts/paredit.vim
I usually start by going to my Clojure project and running lein repl. This gives vim-fireplace something to connect to. Now you can type cqc to get a repl, or cqq to put the code under your cursor into the repl.
Also, indentation works and matching parentheses are automatically added. It’s actually impossible to type parens that don’t match. You can do other awesome stuff with paredit, including slurp and barf! Read more in the paredit docs
Here’s a heavily pared down minimal vimrc that works.
execute pathogen#infect()syntaxonfiletype plugin indent on
Graphite and I have a long history. It all started when reading the amazing blog post introducing StatsD by Etsy’s engineering team. If you haven’t read it yet, I recommend it as it is a nice illustration of why you would want to install Graphite. If you’ve read it already, you probably have been meaning to get Graphite going but haven’t gotten around to it.
Basically, what Graphite does is makes you feel like the badass engineer working on a huge project, intimately familiar with the inner workings of everything. Why should nuclear engineers be the only ones with awesome control panels?
Here is what Graphite looks like:
Having quick access to all these statistics helps you be more productive. But there is quite a bit of work go get off the ground. That brings me to item #1:
1. Deploying Graphite is a pain in the ass
Graphite is such a strange beast. It was developed at Orbitz (maybe to track how many airlines still give out free peanuts?) and is open-sourced under the Apache license. It is written in Python and is actually made up of 3 main parts:
Graphite-web: A Django project that provides a GUI for graphite. This is what you see when you visit a Graphite install in your browser. It also has an admin area that lets you set up users that are allowed to create and save graphs. You can deploy this like any other Django app via wsgi.
Carbon: Carbon is a daemon that runs constantly in the background. Its main job is to take in statistics from the network and write them to disk. It also does something both complicated and fascinating called “aggregation” which I will explore later.
Whisper: Whisper is a time-series database file format. It doesn’t actually run as a server or anything, it’s just a standard format for storing time-series data. These databases are a fixed size, and once the data gets old it falls off the other end. Whisper is a replacement for RRDTool, which I will talk about later. Whisper will also be replaced eventually by its successor, Ceres
I have tried and failed a few times getting Graphite into a state where it could be deployed automatically, and I usually gave up going “wow this tool is hard to deploy, and Munin is good enough anyway” so I kept giving up.
Why is deploying Graphite a pain in the ass?
I think the main reason is that django is a pain in the ass to deploy. After noodling around with this for awhile, I figured out a pretty good way to deploy Graphite, and have made a puppet module for installing Graphite on a RHEL6 or Centos6 server.
The reason this works so well is that the EPEL repository (if you use Centos or RHEL you pretty much have to install EPEL) has, awesomely, packaged Graphite. It’s actually in 3 packages: python-whisper, python-carbon, and graphite-web. The packages make a few assumptions, and after a bit of trial-and-error I was able to figure out what they were expecting.
Graphite and Carbon get installed in the system site-packages directory (/usr/lib/python2.6/site-packages/)
Graphite and Carbon expect their config files to end up in /etc/
You have to run manage.py syncdb after configuring the database in /etc/. This database is just used for a few things on the front-end, not actual graph data, so you don’t have to worry too much about using a real database. I just used sqlite.
Apache will be configured to serve Graphite on port 80
You will probably also want to add some HTTP basic auth so creepers don’t start creeping your graphs. The puppet module will do that for you.
2. Graphite is more powerful than Munin
In the beginning, there was RRDtool. Chances are, you have seen and/or are currently staring at an RRDTool graph right now.
For years, I’ve been using a project called Munin. Munin is by far one of my favorite tools. It is beautiful in its simplicity. Each of your servers that you are monitoring runs a process called munin-node, and on your main graph server, you run a cron job that connects to all these servers and requests data about them periodically. Then it writes this info out into an RRDTool database and generates a graph, as well as an HTML file pointing to all the graphs it has generated.
The munin-node daemon that runs on all your servers is incredibly lightweight, the network protocol is easy to debug, and writing plugins is a joy. However, the graphs that are generated are just static images.
Graphite lets you manipulate your graphs in ways I never dreamed of when I was using Munin. You can run functions on your graphs to change the time scale, summarize the data, and mash together several graphs into a frankengraph at will.
3. There must be 50 ways to feed your Graphite
Once you manage to get Graphite running, there are a lot of great ways to get more data into it. Feeding more and more of data into Graphite became a bit of an obsession for me.
StatsD is the big daddy that started it all, and it works great. It’s written in nodejs, so you’ll need that running on your server. The key thing that StatsD is doing is listening to UDP packets (see the “Why UDP” section in the Etsy blog post for more on this) and batching them up before sending the statistics into Graphite. I use it to send all my application-level statistics.
Also, for some reason there are a zillion client libraries for StatsD. For Python, I like this one (pip install statsd)
Collectd is a daemon that runs on each of your servers and reports back info to Graphite. It comes with a bunch of plugins and there are a zillion more for specialized servers that help you track more statistics. Collectd is in EPEL, but it’s a pretty old version.
Newer versions of Collectd can connect directly to Graphite. The version I found in EPEL is a bit old so I used bucky to connect them together
Bucky is a Python daemon that listens for both StatsD and collectd packets, and sends them to Graphite. So it’s a replacement for StatsD but it also knows how to use Collectd metrics. It is pretty great and easy to deploy. It’s also on the EPEL repository, so it’s only a yum install away on Centos/RHEL.
Etsy also has this logster project which is a Python script that can tail your log files and generate metrics to send back to Graphite. I haven’t actually tried it but it looks interesting. They don’t seem to have many parsers though so I think you’ll have to implement some of your own. Logstash can also send to Graphite.
There are many more ways to feed Graphite in the docs.
4. Make sure that your StatsD flush interval is at least as long as your Graphite interval
This is probably the most important lesson I learned.
What the heck is a StatsD flush interval? Remember how I said that StatsD batches up sending statistics to Graphite? Well, you can configure how often StatsD sends those batches. By default the flush interval is 10 seconds. So every 10 seconds, StatsD is sends a batch of metrics to Graphite.
What happens if you start sending stats faster than Graphite’s highest resolution? Graphite will start dropping metrics.
Perhaps you’re tracking clicks on a button that says “Kevin is Awesome”. This is a very tempting button, so people click it about 5 times a second.
StatsD starts batching things: 10 seconds of people mashing the button means that each batch will have 50 clicks in it.
For this example, let’s say that Graphite’s interval is set to a minute. That means that during that minute, we will have sent 6 batches of 50 clicks for a total of 300 clicks. However, Graphite will only record one of those batches, so it will only record 50 clicks instead of 100.
To make this concept a little easier to understand, here is a little
Processing sketch I whipped that simulates the process. Events are
generated at Event Generation Rate. Those show up in the bottom box.
They are also sent into StatsD as they happen. Next, events are flushed
from StatsD into the Graphite Buffer at StatsD Flush Rate. Finally,
events are written from the Graphite buffer to disk at the Graphite
Storage Interval. Feel free to play around with the number to see how
they affect the count. If your StatsD Flush Rate is smaller than the Graphite Storage Interval, you can see how stats start to get lost. Let it run for awhile and see how the number of events that make it into Graphite (top box) is much lower than the amount that actually happened (bottom box)
Event Generation Rate: per second
StatsD Flush Rate: seconds
Graphite Storage Interval: seconds
5. Turning off legacy namespacing
Once you start exploring the data sent into Graphite, you’ll probably notice that your counter values are stored twice. Depending on your version of graphite and whether you have turned off something called legacyNamespace, they will either be in a folder called “stats_counts” as well as in the “stats” folder, or they will be the “stats/counters” folder.
I think it makes more sense to have all the statsd metrics contained in the “statsd” folder. In order to do this, you have to set the legacyNamespace configuration option to false (it defaults to true).
For the rest of this article, I’ll assume that you have turned off legacy namespacing and refer to the counter values with their non-legacy names.
But why are there two versions of the same counter stats, each with different values? That brings us to the next lesson I learned:
6. The difference between yourstat.count and yourstat.rate
In StatsD, counters keep track of events that happen, as opposed to gauges, which keep track of a value that fluctuates. There are really two interesting things about counts that we might want to graph.
How many times did this thing happen?
At what rate is this thing happening?
Yourstat.count keeps the total number of times that this event happened,
and Yourstat.rate keeps the rate that this thing happens per second.
Since Yourstat.count is actually keeping track of the number of times
that a thing happened, it can also be thought of as a rate, but instead
of happening per second, it’s happening per (your graphite storage
interval. This is because the number of times something happens is
flushed to disk every time the Graphite storage interval happens.
Ah, but how do you configure the storage interval? Glad you asked!
7. Setting up retention
Graphite lets you be insanely specific about how long you want data
kept around. You can decide to keep one metric at per-second resolution
for a year, and another metric at per-day resolution for a month. You
can even keep a metric at high resolution for a few days, and then
reduce the resolution as it ages. Lower resolutions require less space
Retention is configured in a file called storage-schemas.conf.
You can read more about it here in the official
documentation but here is an example:
This rule is named nginx_requests and matches all stats with the pattern
nginx\.requests\.count$. It will keep 1 second resolution for 7 days,
1 minute resolution for 30 days, after that, and 15 minute resolution
for 5 years after that.
It’s also interesting that the data files that store this information
have a fixed size after creation. This is good since they won’t explode
and fill up your disk while you’re on vacation. However, that means
that if you ever want to change the retention of your metrics after
you’ve started collecting them, you wil need to resize the databases on
disk manually. There is a command called whisper-resize.py that can
8. Aggregating data: What??
Here’s something potentially confusing that falls out of letting you
define your retention rates with such complicated schemes: You have to
drop your resolution when you transition from higher resolution to lower
resolution. This means throwing away data. This process is called
Normally, this works fine and you don’t even care about it. However,
there are a couple subtle situations that can trip you up.
If you have a rate metric, which keeps track of, for example, the number
of times somebody signed up on your website per second, and you want to
convert it to the number of times per minute, you can just take the
rate from each of the 60 seconds in a minute and average it. This is
the default method. However, what happens if you have a bunch of null
values in your graph for that minute? Let’s say your network was
malfunctioning for 59 seconds, so you didn’t record any event, and in
the last second, you record 100 signups. Is it fair to say that you
were getting an average of 100 signups the whole time? Probably not.
But if you only lost 1 second of data, you might be a lot more
comfortable with that average. By default, you need at least 50% of the
data to be non-null to store this average value. This is configurable
by changing a setting called xFilesFactor, which is pretty much my favorite variable name ever. Why? No reason.
The other thing that can trip you up is aggregating counts. What
happens when we average 60 seconds of counts? Well, we lose around
1/60th of the events that happened. So for count statistics, we
actually want to sum the number of times that something happened. You
can set this with the aggregationMethod in storage-aggregation.conf
9. hitcount() vs summarize()
I had a bit of trouble figuring out the difference between the
hitcount() function and the summarize() function in Graphite, so
here it is:
hitcount() is used for rate variables to turn them into counts,
as well as changing the resolution of a graph.
summarize() just changes the resolution of the graph.
10. Holy crap there are a lot of Graphite Dashboards
The original version of the article stated that Collectd was unable to connect directly to Graphite. This is incorrect, as newer versions have added this feature.
I also said that Graphite requires memcached. This is apparently incorrect (and I’m glad it doesn’t!)
devicenull pointed this out on Reddit: “Graphite loves SSDs. If you put it on a normal drive you’re going to have to upgrade, so just use a SSD from the beginning.” My stats server is running fine on an oldschool drive, but I can see how IO could start going crazy.