Lies, Damn Lies and Database Benchmarks

57 points by eigenBasis 5 days ago|28 comments

•

bitlad 3 days ago

Reminds me of the recent Terminal Bench controversy [1][2][3]

If theres a benchmark, people will cheat, lie and optimize for that benchmark. Honest depends on the compliance enforced on teams. But if, compliance itself is weak, it is going to be taken advantage of. Like growing up india, you would optimize for the exam and not what you learn from it.

[1] https://news.ycombinator.com/item?id=47920787

[2] https://www.tbench.ai/news/leaderboard-integrity-update

[3] https://debugml.github.io/cheating-agents/

•

SOLAR_FIELDS 2 days ago

There's just too many nuances to take any measurement less than an order of magnitude of difference seriously without further investigation. Even a 2x can be a simple configuration change for these things. Usually the differences of >1 order of magnitude are enough that you cannot hand wave away the difference without a grossly obvious oversight in configuration

•

puzpuzpuz-hn 3 days ago

Exactly! The task gets even trickier when you're benchmarking lots of systems of different kinds: cloud databases, self-hosted ones, embedded engines, CLI tools.

•

jaapz 3 days ago

Anyone here using QuestDB in production? What is your use case? What is your experience?

We want to migrate away from InfluxDB eventually (because of their 180 on OSS, and their tendency to reinvent the product every major release), and QuestDB seems like an interesting option.

•

hansvm 3 days ago

I used it for awhile for "prod" for a toy project. I was scraping nearly every sale across every Target location for awhile. It was fast and easy to use. At the time many (read) queries had bugs of various kinds, requiring strange workarounds to get certain joins and whatnot to work (and not just throwing errors either -- often reporting zero data when there was data there, things like that) and not really being composable. Their CEO (CTO?) responded to my saying as much at one point and mentioned that they had just spent a lot of time fixing all their query logic and writing enough tests that it's actually usable now. I haven't had a reason to check yet, but the next time I'm doing any time-series thing and don't want to write the data layer myself I probably will.

•

gandreani 2 days ago

Man I have the strangest Deja Vu with this comment. I swear it's like the third time I read this??

•

hansvm 2 days ago

At most the second from me, and last time I didn't talk about the project (haven't talked about the project online at all IIRC).

•

gandreani 2 days ago

I believe you! I wasn't implying anything about your project. I just can't quite shake the feeling.

•

hansvm 2 days ago

I'm probably not the only culprit then :) I'd be curious to know who else has done things like that.

•

simplesocieties 2 days ago

Been using it for half a year now in prod to collect sensor data from IoT devices.

My only complaints are:

1) Memory usage is a bit high. We went with the AWS instance they recommended in the docs and even that went over our provisioned memory. It's not much but I think it could be improved

2) You need to buy their enterprise plan if what you're storing is remotely sensitive like health data, PII, etc. Any row level security or credential features are locked behind that license. Our use case isn't that sensitive so we can get away with putting it in a VPN and password protecting it, but if you need DB-level security the FOSS license is severely behind Postgres in terms of features.

Other than that, it's never gone down, it's very, very fast and comes with it's own webui for querying your data. We migrated from AWS Timestream and couldn't be happier with the switch.

•

kjellsbells 3 days ago

The database wars of the late 1990s were full of this kind of stuff. Oracle, Sybase, IBM etc invested heavily in tuning specifically for benchmarks like TPC-C just so they could post ads in the Wall St Journal saying theirs was faster.

I do sympathize with OP, though, their objection to measuring cold-start queries is incomplete without also describing how often cold start needs to happen. If you restart once every five years then it doesnt matter as much if it takes 20 minutes to be warm. Every hour, that would be a real problem.

•

ozgrakkurt 3 days ago

The dataset they use is <14GB of parquet [1] so the "cold start" seems to be intended to also measure having a dataset that doesn't fit in memory in a way.

I don't think this is an oversight but it is just what they found to be feasible. This is explicitly written in [1]. Also the guy who setup this benchmark is very serious about benchmarking under difficult conditions [2]

My personal opinion is that you need a massive amount of data and massive number of different variables to test for separately. For example you might want to monitor how many cache misses/hits there were, p99 latency etc. And you want to do it under full load, expected load etc. And you want to compare the different versions of the same database because comparing different databases makes things combinatorially more difficult, unless you have a real production use case that you are optimizing for ofc.

The swisstable talk on cppcon is a good example of a useful benchmark and optimization that shows how difficult it is to really asses performance effects of even "small" changes. [3]

[1] https://github.com/ClickHouse/ClickBench#data-loading

[2] https://www.youtube.com/watch?v=CAS2otEoerM

[3] https://www.youtube.com/watch?v=ncHmEUmJZf4

•

hilariously 3 days ago

Yeah, the tl;dr is that benchmarking is freaking hard because what you actually care about is "does my workload today and in the future run better or worse given current setup?" but identifying what your workload actually is, what systems you are going to be allowed to run it on, what tweaks would even be possible if you know the interiors of a system and how it aligns with your hardware, and it all comes with the price tag of "and if you do anything different tomorrow with any of these variables it might not hold."

•

fragmede 3 days ago

Yeah, but also, I want to know the p50 warm performance, not just the p99. Run the same query twice in a row after cold start. And then another 10 times. Then do another different set of queries and at the end of the day, or a week, still have no real idea how the system will perform in prod for your particular use case.

Benchmarking is hard, no argument from me!

•

hilariously 3 days ago

Yep, I actually want to know the system has some sort of baseline performance that only hockey sticks under conditions I can monitor and control... but also the business wants to try new feature X and vendor is promising new performance for feature Y, and new patches are coming in affecting ???.

•

fragmede 2 days ago

And oh, you just have to set this one setting that you need to pay a consultant to know all the settings in order to tweak and tune the database for your use case. Still the TPC competitions such that oracle won because they managed to hand two assembly from five instructions down to four as an impressive bit of computer engineering nerditry.

•

Boxxed 3 days ago

ClickBench is of very limited utility already because it doesn't have a single join in it. Which is maybe less weird in the context of ClickHouse not being great at joins.

•

sdairs 3 days ago

clickhouse is not bad at joins these days. But its true maybe clickbench needs some joins & larger data sets these days

•

N_Lens 3 days ago

Same with LLM benchmarks these days.

•

Metaluim 3 days ago

Well, the pelican benchmark is easily verifiable.

•

echoangle 3 days ago

Kind of hard to judge though, it’s not really objective how good a pelican looks.

•

supercoco9 3 days ago

Or a bicycle!

•

simplesocieties 2 days ago

Having Deja Vu reading this. Remember just recently when Spacetime DB fudged their benchmark numbers for their 2.0 release.

https://www.youtube.com/watch?v=C7gJ_UxVnSk

•

Devont 2 days ago

SpacetimeDBs “benchmarks” might be the most egregious, blatantly misleading benchmark I’ve ever seen. Lost a lot of respect for the team behind it

•

dkdcdev 3 days ago

see also “ Fair Benchmarking Considered Difficult: Common Pitfalls In Database Performance Testing” by the DuckDB folks with a classic Figure 1

•

puzpuzpuz-hn 3 days ago

Thanks for the reference. Will check!

•

ozgrakkurt 3 days ago

Really respectable writing and perspective. Questdb blog posts that get posted here never disappoint

•

puzpuzpuz-hn 3 days ago

Thanks! We do our best to be as transparent as possible when it comes to benchmarking.