воскресенье, 7 декабря 2014 г.

Graphite scaling and my evaluation of Zipper Graphite stack

Hello!
Many Dev/Ops teams out there are using Graphite – nice tool for collecting and graphing various metrics from your software and/or hardware. It is a really nice tool and a good example of good architecture – you can check out Graphite chapter from famous AOSA book.
But it's also not a big secret that despite Graphite is great tool its scaling is really not an easy task. Until you run it on single server – everything is fine, you can easily spread it over a couple of servers but above that…
·      Problem one. If you are using (default and single production ready) whisper storage, then the single option of clustering is to use normal Graphite cluster mechanism. But then, after adding or removing some nodes from cluster you need to rebalance it, using e.g. carbonate tool. It’s fine but for loaded cluster with hundreds of thousands metrics and tenths servers, it could take not hours - but days and weeks - and during rebalancing Graphite cluster will producing very funny results.
·      Problem two. Current Graphite clustering based on HTTP requests (remote nodes asks each other using same Graphite-web engine) and current code is quite non-optimal, especially for aggregating functions across many nodes. Fixing that is in progress, we already have @bmhatfield's patch in 0.9.x branch (also with not merged yet parallelization improvement), another approach from @jraby which was transformed to patched graphite-web from @datacratic.
·      The third problem is relay / aggregator performance problems. Graphite relays and aggregators are CPU-bound (contrary to carbon-caches, which are IO-bound) and because of python's GIL single process can't use more that one core for CPU-intensive calculations. Then you need to create some loadbalancer-based configurations with many relay processes, which doesn't make things less complex, believe me.

So, if you faced with Graphite scaling problems you have next options:

1. Migrate to OpenTSDB.
Unfortunately, it will require quite big efforts, if you have your Graphite installation up and running for a while, and have many tools and dashboards around it. Also, scaling Hbase is a the little bit harder than throwing more servers into the pool...

2. Migrate to new storage engines.
I'll explain this way little bit.  Which alternative options do we have for now?
a)    @pyr's cyanite
Quite mature, has some production instances running. Uses Cassandra as storage looks like the natural choice for Graphite data because of high write-load tolerance.
I made an evaluation of it and found out that for our metrics it'll require 4x more space for data storage. For us it was "no-go" then, also I was suspecting that its scalability was not so great then (it was year ago, now it's much better)
InfluxDB looks also like quite a natural choice for storage - it's native time series database, with built-in sharding and clustering. Dieter Plaetinck wrote this for Vimeo and run it there in production.
In my tests InfluxDB look much better as storage - it took almost exact same space as whisper, but - only for one month. It's because of InfluxDB still has no retentions with aggregations (current InfluxDB's retentions just purge old data without any aggregation) - so, if you need to make queries across big timespan (e.g. year) you need or store all data with high precision (and wasting space) or make some aggregation by own - but of course, it will hit performance quite bad. Theoretically, you can use continuous queries for aggregation on InfluxDB level, but its support not integrated to graphite-influxdb.
And according to @dieterbe he's running graphite-influx on a single node, for now, so, I also suspect that scaling it across many nodes could be quite an adventurous journey too.
c)    ceres. Looks like abandoned for now.  I know that @dkulikovskiy made some changes in his own repo, including new roll-up mechanism – and Yandex running that on quite a big scale - but anyway, it doesn't look like a right path to go.

So, we still stick up with whisper, so, no solution for problem one yet for us, but if you only start running your Graphite – maybe it’s a good idea to run some new storage in parallel – especially if you already have some Cassandra or InfluxDB in production.
What next? What is about other problems?

We also faced relay scaling problems quite fast, and after struggling with that little bit we adopted Scala-based @markchadwick's graphite-relay – I just make it work correctly with graphite hashing. We lived about a year on that but after it starts to consume too much CPU too.
In that time, my boss @vlazarenko point me on this video from Linux.conf.au 2014 – it’s only 20 minutes and worth watching. I found out from it that Booking.com uses Graphite, and uses it under quite a big load. In this video, Devdas Bhagat also mentioned that after struggling with relay scalability they developed (and what is much better – open-sourced) new and shiny C-based graphite relay, named carbon-c-relay. (Edit: first I mistakenly named it  graphite-c-relay, d-oh... real name is and always was carbon-c-relay). We started using it instantly and from that time and up to now it works very well.
Its main contributor, Fabian Groffen, also implemented aggregation and regexes in a couple of last months, so, for now, carbon-c-relay looks like complete and pretty sane alternative for python-based relay/aggregation daemons. I can find only one downside – its output is still line-based and not-pickle based but that’s completely OK for us. Edit: As @grobian mentioned (and I completely agreed with that) pickle is insecure and bloated - line protocol is better for that case.

So, third problem is solved for now.
But I was really wandered how Booking.com struggling with second scaling problem, and after following @grobian on GitHub I found out how. It seems that they rewrite most of the parts of Graphite stack in Go! And results of it's quite impressive – they’re running about 90 backend servers with more than 55TB of whisper files!
And it looks like this project implemented by only two persons - Damian “@dgryski” Gryski and @grobian. I contacted Damian, asked a couple of questions and checked how their solution works. He also said that he and @grobian will make a blog post about their stack, but they still didn't - so, I'll try to do so. J

So, how normal Graphite cluster stack looks like? I'll take part of Jamie picture to illustrate:

Did you see that 3 graphite-web servers below? They’re communicating between each over (or a single frontend to all backends), so just imagine what happens if you have 10-20-30... backend servers instead of two - and you can imagine that speed of rendering will be very low then.
Then check out Booking.com solution (they need some cool name for it, I will call it zipper-stack for brevity). Please also bear with my non-existing painting skills:


See? Graphite-web talking to the single daemon, named carbonzipper. It talks to all backends, but not over plain text based “pickle-over-HTTP” protocol, but over new “protobuf-over-HTTP” protocol - so, on backends we have a separate daemon for speaking that, named carbonserver. Also, they have special carbonapi daemon - it talks to zipper daemon also but it has some subset of Graphite functions re-implemented in Go, so, his speed is blazing fast - you can switch all your monitoring metrics (which rendering as text and not PNG) there. 
So, looks good - let's deploy it.
As we are doing only evaluation we made it quick-and-dirty – just compile binaries and run it on a server, but for production you’ll need some packaging and configuration, of course. Also, it’s not a manual for Graphite installation – I assume that you have working Graphite cluster already and just want to check how-zipper stack works.
Initial build is easy - go to your build VM, install and set up Go there and run
root@vagrant-ubuntu-precise-64:~# go build github.com/dgryski/carbonapi
root@vagrant-ubuntu-precise-64:~# go build github.com/dgryski/carbonzipper
root@vagrant-ubuntu-precise-64:~# go build github.com/grobian/carbonserver
root@vagrant-ubuntu-precise-64:~# ls -al ~/go/bin/carbon*
total 53552
drwxr-xr-x 2 root root    4096 Nov 28 16:32 .
drwxr-xr-x 5 root root    4096 Aug  8 13:20 ..
-rwxr-xr-x 1 root root 8796848 Oct 28 16:38 carbonapi
-rwxr-xr-x 1 root root 8526040 Oct 28 16:36 carbonserver
-rwxr-xr-x 1 root root 8515904 Oct 28 16:37 carbonzipper

Copy caronserver binary to graphite backends, and carbonzipper and carbonapi binaries to your frontend. If you do not have separate frontend - just make it - install separate server with graphite-web and install binaries there)

Go to backends first and run servers there - do not forget run it under same user as carbon-caches (e.g. using screen when testing):
~$ ./carbonserver -p=8080 -stdout=true -v=true -vv=true -w="/opt/graphite/storage/whisper"
2014/12/07 16:13:10 starting carbonserver (development build)
2014/12/07 16:13:10 reading whisper files from: /opt/graphite/storage/whisper
2014/12/07 16:13:10 set GOMAXPROCS=12
2014/12/07 16:13:10 listening on :8080

Next, run carbonzipper on the frontend. Create its config file first:
~$ cat zipper.json
 {
    "Backends": [
        "http://10.x.y.z1:8080",
        "http://10.x.y.z2:8080",
        
"http://10.x.y.z3:8080"
    ]
}
I think you got the pattern. J
Then run it in debug mode:
~$ ./carbonzipper -c="./zipper.json" -p=8080 -stdout -d=3
2014/12/07 15:58:44 starting carbonzipper (development version)
2014/12/07 15:58:44 setting GOMAXPROCS= 1
2014/12/07 15:58:44 querying servers= [http://10.x.y.z1:8080 http://10.x.y.z2:8080 http://10.x.y.z3:8080] uri= /metrics/find/?format=protobuf&query=%2A
2014/12/07 15:58:44 listening on :8080

Now you need to patch graphite-web little bit - after putting own IP to CLUSTER_SERVERS is not allowed - and that's exactly what we need to do:

--- a/webapp/graphite/storage.py
+++ b/webapp/graphite/storage.py
@@ -31,7 +31,7 @@
def __init__(self, directories=[], remote_hosts=[]):
self.directories = directories
self.remote_hosts = remote_hosts
- self.remote_stores = [ RemoteStore(host) for host in remote_hosts if not is_local_interface(host) ]
+ self.remote_stores = [ RemoteStore(host) for host in remote_hosts ]

if not (directories or remote_hosts):
raise valueError("directories and remote_hosts cannot both be empty")


Then run your frontend with CLUSTER_SERVERS = ['127.0.0.1:8080'] in local_settings.py.

Everything is prepared; you can test your frontend with normal graphs.
You can check how carbonapi works too, but you need to check which functions were re-implemented in Go in https://github.com/dgryski/carbonapi/blob/master/expr.go first:

~$ ./carbonapi -z="http://localhost:8080" -stdout=true -p=9090 -tz="Europe/Amsterdam,3600"
2014/12/07 16:01:27 starting carbonapi (development build)
2014/12/07 16:01:27 using zipper http://localhost:8080
2014/12/07 16:01:27 using fixed timezone Europe/Amsterdam, offset 3600
2014/12/07 16:01:27 listening on port 9090

That's mostly it.
But I want to mention just one important thing. When I test my graphs with zipper-stack with curl rendering speed was quite good. But when I test my last hour graph generated by zipper-stack instance by my eyes it was looking like this:


Normal one looks like that:

Huh? Do you know what's happened there? I know, unfortunately.
As I mentioned, Graphite is a really good piece of software and use quite good engineering solutions to make thing works with quite a big load on pure Python, not even using C. And most known part of this trick is named carbon-cache. E.g. when you put metrics in graphite usually it doesn't flush to disk instantly but goes to RAM instead using carbon-cache daemon, which keeps it in memory, flushes to disk periodically, and responses results to Graphite web, merging on-disk and in-memory results. 
As you can see on my diagram there're no more lines from carbon zipper to carbon-caches for zipper stack. That's right - carbonserver just reads whisper files from disk and didn't ask carbon-caches! It seems that for booking.com instances disk flushing time for every single metric is below 60 seconds, and for “1-minute retention” whisper files (which they and we are using) after each minute every file is updated and fresh. But our Graphite installation is different - we are using SAN disks and plenty of RAM instead of SSDs, so, for us it took up to 40 minutes to flush metric to disk, that’s why we have that graph depletion on “1-hour” graphs…
So, for us, it's not looks like a viable solution, alas! Or we need to implement carbon-cache interface to carbonserver in Go. But in a moment when I test zipper-stack, we were running an internal version of the hack, which shortly become PR #1010 and it works quite well, for now, so, maybe I return to it later. J
Edit: @grobian is making Go-based write daemon now - https://github.com/grobian/carbonwriter - so, maybe it could be combined with carbonserver to make similar to carbon-cache solution.
But YMMV, of course, just try zipper-stack - it looks very good and promising. 





8 комментариев:

Unknown комментирует...

Hi!

I'll try to add my $0.02:
1) You can make graphite-web/api, carbon, etc a lot faster by using pypy 2.4+ instead of python as an interpreter. From my tests it'll consume 3x less CPU, but will use more RAM (twice more or something like that).
2) About Ceres - yes, upstream is dead, though it's also a lot faster than Whisper. I've done tests recently, you can achieve writing performance at least on par with influxdb 0.8, and reading performance is several times better for ceres. Though I haven't finished my tests and the only thing I've tested more or less proper was Ceres and InfluxDB. Also ceres data are a bit more compact and easier to maintain (one metric per folder is definitely easier than several metrics per file) But it have same problem with rebalancing and fetching data as Whisper have. And it was used at Yandex in Production since late 2013 I think. Though they are using not the most optimal layout, but decent hardware. With migrating to pypy and throwing away carbon-relay in favor of Deiterbe's carbon-relay-ng or carbon-c-relay and more aggressive caching, it can achieve at least 3x performance (and I think it could be even more) than they have (and they have 1M datapoints per minute per server, but server have quite good specs - 2xE5-2660, 64GB Ram, Raid 10 of 240GB Intel SSDs)

Also you are right, it's better to have 2-4 SSDs instead of having SAN/Shelf/etc.

And I think the main problem of the Graphite that it doesn't have proper backend for now. All the solutions have their flaws and, if somebody will write normal backend, you'll be able to at achieve at least 10x performance you currently can. Cassandra have it's own problems, InfluxDB don't behave very well on mixed load and even it's writing performance are far from what it can be. Whisper and Ceres are written in Python and they are CPU bound, also they can have better caching mechanism that'll reduce IO pressure a lot.

deniszh комментирует...

Cool!
Thanks a lot, Vladimir, for you addition!
1. I'll definitely try pypy.
2. About Ceres - not only upstream is dead, much worse that it seems nobody cares...
What is space effectiveness of Ceres comparable to InfluxDB?
3. Absolutely agreed with your sentence about backend. I still have an idea to make some prototype of new backend on Riak 2 or Elliptic / HistoryDB...

Unknown комментирует...

2. Yandex cared and dkulikovsky made a lot of work to make it stable. Also there are several groups that used it a lot. I've sent them some PRs that should make it compatible with graphite-api, and so on. I was working at Yandex and we've used it there as is (in different team, so we've got only 2 servers, including sort of a "replica") and at current company I'm using a modified ceres, with some fixes, "yet another rewrite of rollup" and "yet another rewrite of merge". For now it works with load of approx. 500k datapoints/minute on a previous year's variant of Hetzner's EX-40 SSD (i7-2600, 32GB Ram, 2x120GB OCZ Vertex 3) for 3 month now. Also we've got a lot of read requests (10rps for graphite-api in avg, peak is more than 100). In terms of space effectivness - ceres uses approx. 140 bytes for metadata (per node, one file, so it's in fact cluster size) and then 8 bytes per point, constantly. As of my experience it's at least twice more effective than InfluxDB 0.8 with RocksDB engine. And also in terms of read performance, both ceres and whisper far better than Influx.
3. I have same thoughts too, but I'd like to design and implement my own databases, partly based on Ceres concepts (with some thoughts I've got when I was working at Yandex) and write in in C++. Though don't have time for that right now :) In fact, people at yandex are thinking to migrate to one of the Yandex's inhouse databases, that can perform approx. 70M datapoints/minute on the same node (and it's purely network bound in that case, and also you'll need to figure out where to store that amount of data :) ), so it's definitely possible to get far more performance than we have right now with Ceres/Cassandra/Influx/Whisper.

Unknown комментирует...

Вы что там родной язык забыли в своих Амстердамах? Ничего не понятно же)))

deniszh комментирует...

Если связаны с IT - учите английский, батенька, пригодится. Русскоязычный блог тоже есть но переводить тексты туда-сюда довольно напряжно.

Unknown комментирует...

but carbon-c-relay's aggregation still use only one thread:(

Unknown комментирует...

but carbon-c-relay still use only one thread for aggreation....

deniszh комментирует...

Yes, but it's best aggregator what we have for now, unfortunately. :( Python-based aggregator is much worse, and single-threaded too - and aggregator-friendly load balancing for python relay doesn't work really well too...
We're not using any aggregator at all, just aggregating some values on collectd level, using aggregation plugin.