воскресенье, 26 марта 2017 г.

TSDB on PostgreSQL? But why?

This week I stumbled across two new time-series databases (TSDB) - which is good, but both were based on PostgreSQL, which is... confusing, to be honest.

First was named Tgres, it was created by Grisha Trubetskoy and already reach v0.10.0, after 18 month in development, which is quite impressive (no joke). It's written in Go, partially Graphite-compatible and even outperforming "ye olde Graphite" on a single ec2 i2.2xlarge instance (not go-carbon, just normal python Graphite, which is quite not fair IMO). It's also based on rollup archives idea as Whisper and also more effective as storage - around 8 bytes per point (whisper has 12 bytes).

I'm totally fine with people who want to develop something, and I'm not gonna say that Grisha does not understand what he's doing - he's experienced developer and Tgres looks very impressive. But to be honest all rational behind Tgres is really puzzling me. You can read it on the link above (you can go to part named "Avoid Solving the Storage Problem", but it's worth to read all article).
Grisha says: "Someone once said that “anything is possible when you don’t know what you’re talking about”, and nowhere is it more evident than in data storage. File systems and relational databases trace their origin back to the late 1960s and over half a century later I doubt that any field experts would say “the storage problem is solved”. And so it seems almost foolish to suppose that by throwing together a key-value store and a consensus algorithm or some such it is possible to come up with something better? Instead of re-inventing storage, why not focus on how to structure the data in a way that is compatible with a storage implementation that we know works and scales reliably?"
With all respect, but I think that's a wrong direction. Yes, filesystems and databases are in development from the 1960s - and what result do we have? The storage problem is not solved, indeed, but saying "OK, screw it, let's create something on top of weak foundation and hope that it'll fine" is also wrong.
I think that storage engine is the best part of any database and it creates and limits any DB - relational, column or time-series - doesn't matter. Whisper is a good example. It has its own weak points (e.g. no subsecond resolution, IO intensity, 12 bytes per point, only local storage) - and its good points (quite good speed, built-in rollups). But most of Graphite users know its limitations very well - and these limitations limiting Graphite usage from one side - but on the other hand, they created all this new generation of TSDBs / monitoring solutions which are flourishing last time.
And in the same way Tgres inherits all scalability flaws as PostgreSQL (as any relational database) has e.g. good vertical scalability, but quite weak horizontal one. Yes, the author mentions clustering for Tgres, but it's the same approach as we saw already in Whisper - it's external clustering, not built-in in storage.

Another PostgreSQL-based database, named TimescaleDB looks bit better - it still based on Postgres although it uses an own storage engine with built-in clustering and sharding. You can check their paper, it's quite interesting. Now it looks like early InfluxDB, but authors are saying that their approach is better because you can use all real SQL power across all your timeseries.
Let's see. TimescaleDB is quite young, less than 6 months in development, maybe we'll get something useful out there. They have a good and stable foundation, let's see how it will fit in TSDB world.

I still have a strong opinion that in database's world storage engine is a king, and horizontal scalability is a must for any modern data software.