Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
HyperDex: A Searchable Distributed Key-Value Store (hyperdex.org)
139 points by lpgauth on Feb 22, 2012 | hide | past | favorite | 88 comments


From the FAQ <http://hyperdex.org/faq/>:

"So, the CAP Theorem says that you can only have one of C, A, and P. Which are you sacrificing?

HyperDex is designed to operate within a single datacenter. The CAP Theorem holds only for asynchronous environments, and well-administered datacenters enable us to sidestep this tradeoff entirely."

I'd like to see how they pull that off when a node goes down. I guess in "well-administered" data centers, nodes don't go down.

Sounds like they're sacrificing "A" to me because they're doing synchronous replication.


It's a bit of a middle ground. Yes, the replication is synchronous, which impacts availability. However, the master can remove a failed replica from the chain fairly quickly. In principle, with proper tuning, a node failure would merely cause a brief hiccup. This would feel more like a period of increased latency than a full-blown outage. So there really needn't be much sacrifice of availability.

However, there's also a sacrifice of partition tolerance. If the master is unable to communicate with any replica, the system can't serve requests. Also, the master is implemented as a collection of Paxos nodes; if these nodes are partitioned from one another, the entire system would grind to a halt.

Since this is intended for intra-datacenter use, one could argue that a full network partition might be unlikely. (Depending on what sort of data center you hang out in.) But in CAP terms, it's possible, of course.

(I base all this on the value-dependent chaining paper cited below.)


This sounds like a CP system to me. There's nothing wrong with that btw, I don't know why people are so reluctant to admit this.

AP systems have some useful properties, but they're also (typically) more difficult to reason about.

The "hiccups" you describe are periods of unavailability. The increased latency is caused by an element of the system waiting for the data to become available again, a totally valid strategy for coping with transient failures/partitions.

Your argument about intra-datacenter partitions being unlikely are true, but they do happen. You also make a good point about such partitions also affecting client applications. Both of these are indicative of CP systems and, like I said: there's nothing wrong with that.

Personally, I think both AP and CP distributed systems are equally interesting. What I consider a red flag is attempting to rationalize how a system "beats CAP".


That's what I was getting at. It's appears to be a CP system. I wish all new distributed databases would have a nice little badge that says, "CP", "AP" or "CA".

Whenever they claim to be distributed but not subject to CAP, I automatically become skeptical.

There's nothing wrong with a CP database; HBase is a CP database and it's quite popular.


To be precise, I would argue this is actually a CA system.

"C" because there are consistency guarantees, which are upheld even in the face of failures / partition.

"A" because the system will continue making progress even after a node failure. What I called a "hiccup" can be made arbitrarily short, in principle at least. The system can work around failed nodes, it does not need to wait for them to be repaired.

Not "P", because if the network falls apart sufficiently, the system will generally not be able to progress.


> Not "P", because if the network falls apart sufficiently, the system will generally not be able to progress.

Not be able to progress means it is not available.


There is no CA.

You cannot guarantee consistency and availability simultaneously in the face of network partitions. Once the line of communications is cut or overloaded (slow enough = a partition), you have to pick one or the other. It's basic physics.

If two entities can't communicate, they can't synchronize state, so one (or both) of them have to quit acting like they have a consistent view of the data.


It's not exactly clear from the paper what CA should mean.

I've seen people claim it means "you guarantee both consistency and availability, as long as there are no network partitions (you don't have to handle those because you haven't chosen P)". That's a supportable claim. So you can do that, but it's kind of a useless choice, because as long as there are no network partitions, both CP and AP systems can also guarantee full consistency and availability.

I lay the CAP family out thus:

* CP: on network partition, lose availability

* AP: on network partition, lose consistency

* CA: on network partition, lose both

It seems to be a common thread among distributed systems engineers who claim to have beaten CAP: "network partitions don't matter for whatever reason, so therefore I can always guarantee both availability and consistency and so CAP must be wrong yaaay!!"

Sorry, no, nice try.


> If two entities can't communicate, they can't synchronize state, so one (or both) of them have to quit acting like they have a consistent view of the data.

With the exception of quantum entanglement, of course.



That blog is quite wrong about partition tolerance. First, the definition is just bizarre:

"Handling a crashed machine counts as partition-tolerance. (Update: I was wrong about this part."

He then goes on to give Stonebraker crap about claiming that "failures" never happen, simply because he doesn't understand the difference between failures and partitions.

Look, the point of "CAP" is this: if you assume nothing about the network, then you can not guarantee CA in the presence of ARBITRARY network partitions. It doesn't say that you can't provide CA under some or even many network partition scenarios. So, the question you should be asking is "what kinds of network partitions happen in practice?". Stonebraker's point was that network partitions are such rare and wholly catastrophic events that worrying about them pulls focus away from much more practical concerns. Hyperdex' point on partition tolerance (admittedly not clearly spelled out) is much more subtle. They offer tolerance of a specific class of partitions. To simply say they are "AP" or "CP" ignores the very important fact that they do in fact tolerate partitions and maintain the CA. This whole "CAP" pick any two is a gross over-simplification that obscures very real distinctions like this.


What you said was right on. I just wanted to add a few things.

The coordinator is only involved for recovering from failures, so the cluster can still serve requests until server (non-coordinator) nodes start failing too.

I would also add that if there is a intra-datacenter partition so severe as to violate HyperDex's failure assumptions, it will likely impact applications built on top of HyperDex as well. It would be necessary to survive such failures with an inter-datacenter system (which could be built on top of HyperDex).


Well first Thank you for providing another data storage possibility, and a great one at that.

I'd just like to ask - the benchmarks where HyperDex beats even Redis - these are strictly clustered benchmarks - is that true?

Or is the way HyperDex stores data so efficient, that it beats Redis even on a single core / single thread?

Thanks!


Without more context graphs like that is pretty useless.. For all we know they just invented those numbers. I'm not saying they did, but you get my point..

Those numbers seem way to low for running on the same machine, and if not shouldn't the network be the bottleneck and show similar results for both?

I'm sure there's a reasonable explanation, just as I'm sure they picket benchmarks that makes themselves look good.


"if these nodes are partitioned from one another, the entire system would grind to a halt."

Keeping consistency at the expense of availability in the event of a partition is precisely what makes this a CP system.


HyperDex uses value-dependent chaining, which offers fault tolerance properties similar to those provided by chain replication (http://www.cs.cornell.edu/home/rvr/papers/osdi04.pdf).

A single node failure will be recovered from quickly without issue. Multiple concurrent failures are handled the same as the single failure case, so long as our failure assumptions are not violated (e.g., every node in the datacenter fails simultaneously).


That can work if the server process is killed so that the master is immediately notified. But what about other failure modes? For example, if the disk has soft errors and writes start taking several seconds to complete, the system can't decide in a small amount of time that the node is dead.


In this case,the node should report its own failure.

It must be able to do so, otherwise it will be deemed faulty by the entity it reports to.


When a node dies, the master reconfigures all the servers and clients with a new topology excluding the failed node. "Operations which are interrupted by reconfiguration exhibit at-most-once semantics." So while the system is reconfiguring after a node failure, updates can be lost.

Time windows of "at most once semantics" mean the system has none of C, A, or P. Which doesn't mean it's not a good database for many purposes.


The only scenario in which the operation has "at most once semantics" is when the node the client is directly communicating with fails. No other failure is visible to the clients. Furthermore, every failure scenario provides the following guarantees:

* If the result of an operation is visible by one client, it is visible by all clients, always and immediately

* Updates to the same key are always applied in the same order on all servers.

The presence of "at most once semantics" do not harm our consistency guarantee. In the database world, this would be equivalent to a client sending the final "commit" message, and then losing internet connectivity. In such a scenario, the operation may or may not happen, but the client will not know one way or the other.

Edit: Formatting of the list


The only scenario. In essentially every case of a node failing in an active database, clients will be communicating with it. When talking about durability, you have to assume things are failing in operation.


HyperDex utilizes value-dependent chains for replication. Updates move forward in the chains, while acknowledgements flow in reverse.

To issue a PUT or a GET, the client contacts the head of the chain responsible for the object it is modifying/accessing. If other nodes in the chain fail, the chain will transparently recover. If the point leader fails (the head of the chain), then the client does not know if the operation completed.

This is analogous to a database library opening a socket and sending "BEGIN; INSERT INTO data ("x", "y", "z"); COMMIT" and then the client losing connection (or crashing entirely). There is always some point at which the server may complete, and then the client may immediately crash before receiving notification that the operation is complete. Even if this happens, however, HyperDex's GET and PUT operations are linearizable.


What happens if a crash happens part way through the acknowledgement chain?

For example, in your insert, if the node containing "x" crashes before it receives the ACK from the node containing "y" - do the dangling "y" and "z" insertions ever need to be cleaned up?


The chains heal in a similar fashion to chain replication (http://www.cs.cornell.edu/home/rvr/papers/osdi04.pdf).

There will be no dangling insertions.


"I'd like to see how they pull that off when a node goes down. I guess in "well-administered" data centers, nodes don't go down."

They offer f-fault tolerance. They can have f nodes go down in a single "zone" and keep chugging as long as no more nodes in the same zone go down before the master reconfigures. Note that the f faults are per-zone, not per-system, so in fact many more than f nodes can be down in a single system without a problem.

But, more importantly, you seem to be confusing partition tolerance and fault tolerance. CAP is about partition tolerance: offering "CA" in the presence of arbitrary partitions. They offer a specific form of fault tolerance: "CA" in the presence of any failure or partition that affects less than f nodes.


> "So, the CAP Theorem says that you can only have one of C, A, and P. Which are you sacrificing?

Which is wrong. You can only have two of the three at the same time: CA, CP or AP.

If they get something as fundamental as this wrong, you have to wonder about the rest of the project.


Or it could just be a typo on the brand new webpage of a brand new project. Which it obviously is if you look at their explanation of CAP.


Technically you can have at most two of the three of the same time. It's well possible for some stores to provide only one or no property at all!


Well, yes. But that's not what the FAQ says; it says you can have at most one. The distinction is significant.


Just a simple typo. I've fixed it.

Thanks for pointing this out.


PACELC[1] helps distinguish just the kind of CAP sacrifice this makes.

[1] http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-an...


This is a pretty nice implementation of an old design concept that has merit for modern distributed problems. It is worth studying as a model for distributed systems.

I am aware of a couple commercial distributed systems that use it (not this implementation but the underlying algorithm). Several organizations seem to have reinvented it over the last five years.

Organizing data in this way was studied in the 1980s but was poorly suited to the computing systems of the time. Relatively little was written about it because it was viewed as a dead end and modern literature has all but forgotten about it. Most of the research was done by companies rather than academics. Designs like this have fallen into the blackhole of "if it is not on the Internet then it doesn't exist". Back when I was studying these models I found more crusty old patents related to these types of models than relevant papers on the Internet. It will be valuable to have some modern literature pertaining to these designs.


For those that want to take a quick look at the source code without physically cloning it, there's a GitHub link buried somewhere in the site (I forget where I found it): https://github.com/rescrv/HyperDex

Their Python client seems to be using Cython for extra speed: https://github.com/rescrv/HyperDex/blob/master/hyperclient/p...


I'm looking for a good paper or a book detailing patterns or strategies of storing complex data structures and relations, or designing complex schemas on top of key/value stores, possibly with real-world use cases / success stories. A bonus would be the same on top of key/value/range(search) stores, just like HyperDex is. Any suggestions HN?

Technically, it should be even possible to implement SQL on top of a store like HyperDex (there's the Tenzing paper from Google). But I'm not looking specifically at SQL, but at more general scope.


It looks like they trade the ability to scan ranges of keys for the ability to get single objects via multiple attributes. The value-dependent stuff is also a neat way of solving the consistency issues with multiple node updates. Interesting stuff.

If you were to swap this with your Cassandra cluster, you'd be losing multi-datacenter replication. Although partitions within a data center are pretty rare, you'd also lose some availability there as well. However, Cassandra is usually hash-partitioned so it needs to do broadcast for a scan (AFAICT, it even needs to do a broadcast for a lookup on a single secondary attribute), so you'd probably gain quite a bit of performance with HyperDex.

I can't tell if it's possible to dynamically change the set of secondary attributes being indexed without rebuilding the entire data set. Or how value-chaining works with missing attributes.

Also, apparently consistency has some... gaps... when you search via a secondary attribute:

"The searches are not strongly consistent with concurrently modified objects because there is a small window of time during which a client may observe inconsistency."


I'm reading through the paper, and I'm curious if I understand the implications of key subspacing and value dependent chaining correctly - do all reads and writes for a given key get forced to a single node? I understand how the replication that's described allows for failover when the point-leader fails, but does it also allow for scaling writes and key lookups?


Reads and writes for a given key always go to a single node. As you add more machines with the same workload, you are handling fewer keys per machine. The flip side is, if the number of keys per machine stays constant, more machines directly translates into the ability to handle more keys.

HyperDex scales linearly as you can see in our scalability graph.


I think you're missing the point of the question. I'm not asking how you scale in general, I'm asking whether your replication scheme is only for failover or if it contributes to your scaling story - i.e., if a given key goes hot, does everything have to run through the point leader, or can each replica take writes/serve requests? It sounds like the former, which is fine, just wanted to clarify.


It's the former. I'd rather not sacrifice our strong consistency by using replicas to serve GET requests.


How does it compare to elasticsearch?


Elasticsearch is a Fulltext search solution not a K/V Database.


Technically yes, but you can use elasticsearch as a distributed key-value store. I use it in production and it works amazingly well for me. Yes, searches may return stale data (one second old by default) but that is not much of a problem in my app.


That sounds both interesting and non-intuitive. Can you go into more detail about how you have been using ElasticSearch as a k/v store?

EDIT: Color me surprised. I didn't see it mentioned at all in the ElasticSearch docs or in their Github repo, but apparently this is a fairly popular thing to do.

"Elastic search as a database?" http://stackoverflow.com/questions/6636508/elastic-search-as...

"ElasticSearch, datastore for user and social data?" http://stackoverflow.com/questions/8190288/elasticsearch-dat...

"Elasticsearch and NoSql database" http://stackoverflow.com/questions/8026880/elasticsearch-and...


Did you compare the performance with (local) BerkeleyDB?

Also, is there a locking mechanism?

I'm the author of Torrus (torrus.org), and BerkeleyDB stability and non-network nature are quite painful. But I'm relying on its speed, concurrent locking, and some ways to acquire an exclusive lock on a table. It would be interesting to offer an alternative backend for torrus.


I'm not sure exactly what you're looking for, but Berkeley DB is embedded in your application on a single host, while HyperDex is distributed. It sounds like that's exactly what you want.

What do you use exclusive locks for?


I use exclusive locks where multiple processes concur for data access, and these locks guarantee the data consistency.

For example, the GUI engine may eventually start cleaning its old cache, and the exclusive lock prevents other GUI processes (which are mod_perl or fcgi processes) to do the same.

another example is when the database is big enough, and when multiple collector processes start at the same time, I use the exclusive lock to let them initialize and slurp the data sequentially: this prevents from excessive disk seeks.

but as you mentioned in another thread, your solution is locked to a single architecture, and I cannot afford that.


and yes you're certainly right that BDB is embedded into the application process. Then any slight memory corruption may destroy the database and lockup the whole service. We've seen that, especially with graphics libraries.

But BDB offers tremendous speed, so it won't be easy to decouple the DB engine from the application process.


Is it open source? I browses around the site and FAQ and didn't spot a mention of the license.


The code is open source. However, they applied for patents on the algorithm:

http://www.wipo.int/patentscope/search/en/WO2011137189


While waiting for the answer, what does this generally means if the patent is granted? Since the code is open-sourced under BSD, can the algorithm be used in other open-source BSD-licensed products, and can corporations then use these products to store data, without needing to pay the royalties on the patent?


The BSD license does not contain a patent grant, so it's technically possible that you'll have to pay to use this software in the future even though it's under the BSD license. I'd really like to see a switch to the Apache license, which does contain a patent grant.


So if I release my code under Apache license, and later it is found that it happens to infringe on a patent, am I liable to pay the damages on behalf of anyone using my code and getting sued based on the patent infringement?

Or, to put it another way - what is the best "free" license for a developer to release his/her code under? I don't really even care about mentioning my name in derivative works.


I'm no expert, but as far as I'm aware the Apache grants the user a license for your relevant patents. The GPL3 goes a step further and grants the user a license for any patents you have a license for whether its your own or one you licensed from somebody else. Since neither situation likely applies in your case, the BSD and Apache licenses are basically functionally identical. However, by choosing the Apache license you are telling potential users that you don't have any "sleeper" patents, so you might get better uptake.

If somebody later asserts a patent against your code, both you and your users are screwed no matter what license you choose. But the choice of license might influence who might help you out if that happens...


Yes! We've released HyperDex under the 3-clause BSD license.

You can get binaries and source from the downloads page (http://hyperdex.org/download/).


Any comment on the patent application that jandrewrogers pointed out?


I think maybe only certain versions of Ubuntu are supported? It won't install because I don't have a repo that provides libgoogle-glog0


The packages are built for Ubuntu 11.10. What version of Ubuntu are you running?


Thanks for letting me know the target version. I am running 10.4.4 LTS


Since they compare it with redis, I wonder if this can handle data bigger than memory. The other properties seem nice :)


We compared it to Redis as it is one of the many key-value/document-database systems that provide the high throughput and low latency properties that make it comparable to HyperDex. We are expanding the systems we benchmark against, aiming to compare against as many systems as possible.

HyperDex is designed for bigger-than-memory data.


The comparison with Redis is a bit strange at a first glance, what I mean is, if in Redis I replace the actual fetching of data with "return 'foo'" at every query it still returns 150k/requests per second for every operation. The same will do memcached per single core. So either in your code there is some good trick to improve speed in the networking layer, or you are comparing multiple cores/nodes to single core/node.

About benchmark "E", this is a primitive that Redis lacks, so it is in my opinion a bit misleading.

Another question is, in the benchmarks with Redis, does the data set fits inside Ram also in HyperDex DB (or OS cache)? In such a case you are comparing two in memory databases in practical terms, but at the same time you say that HyperDex is suitable for datasets bigger than RAM (in a portion of text near the Redis benchmark), but the numbers that the users will experiment in actual production setups will be different if the data set does not fit in RAM.

Finally, in the benchmark it is not clear that Redis can perform a lot of work in a single operation, for example a variadic LPUSH can insert 10 elements inside a list in the same time an LPUSH does this with 1 elmenet, more or less. If you multiply this for N cores in a stand alone Linux box you get 1 billion operations per second or alike, but I would be not comfortable with writing this in the Redis home page. I mean, if you don't specify very well the methodology this unfortunately is just marketing material without real interest for the field.


Both systems were single process per host.

I agree that Redis does not have good support for "E", so anyone looking at the results should look at A-D,F.

Both systems performed entirely out of main memory to avoid touching disk.

The Redis benchmark results only came in last night. We'll be writing up the methodology soon. We used the YCSB Redis binding from branch master. I'd be interested in your opinion on how well the bindings map to what is truly best for Redis.


Redis uses a single thread, not just a single process, so if Hyperdex is multi-threaded you are comparing single core vs multiple cores. As you can see with memcached that instead is able to use multiple cores (a feature that Redis is going to implement soon) this leads to a big performance improvement in this kind of benchmarks.

EDIT: (I checked that Hyperdex actually uses threads and multiple cores) If you want a fair comparison you should run Hyperdex on a single core as well, or you can run N instances of Redis (one per number of core) and write the benchmark so that it uses all the instances.

I'll check the YCSB Redis bindings, I never looked at them before. Thanks for the reply.

EDIT2: there are also problems with the YCSB Redis bindings:

1) It basically forces an object-store data model on Redis, so only uses hashes to store objects, and every time an object is stored or deleted, a sorted set is updated as well. This is a possible use case of Redis but not a very idiomatic / representative one.

2) Even for benchmarks not involving searching, the sorted set is anyway updated.

3) There is no pipelining used to alter the object and store the sorted set. Every operation pays 2x the Round Trip Time in the Redis bindings.

4) Even worse, there is no pipelining in the "search" operation, so you may the RTT a lot of times when you do a scan operation with this bindings.

The minimal change to the YCSB bindings is to modify the bindings to use pipelining when possible (almost always, actually). Still I think that an intermediate layer to turn Redis into an automatically-indexing object store does not make sense. Another big problem is that you are comparing multi-cores vs single-core.

So if you really are interested in a comparison between HyperDex and Redis you should pick an use case and model it accordingly with the best tools of both the databases.

What I would recommend is to use the following use cases.

1) Populate the two DBs with 10 millions of hashes, then write a benchmark where 50 clients simultaneously get and set specific fields.

2) Like "1" but increment a field by 10 at every write query.

3) Simulate a leader board where 50 clients simultaneously update the scores of the different "players" in one operation, and ask the top 10 users in the leader bord with another operation.

4) Simulate a capped collection where you always add the latest news in a web site, and you can get the top-10 to show in the home page. Every of the 50 clients should write a single item and fetch top-10 items.

And so forth. Always use 10 million objects and mixed reads and writes with 50 clients at the same time. Write the best code for both the DBs (I can help with Redis).

This time you are truly comparing the two DBs in a real world scenario.


I think YCSB is meant so that experts in each technology can tweak it for their system. So, it looks to me like you're in a great place to make that happen, and I'd love to see it!

Also, It's extremely difficult to get speedup of 4x on 4 cores, so I'll believe that argument when I see it. It seems to me that with the current implementation of Redis, you'd run into some serious problems with memory management if you run 4 redis nodes on one 4 core machine.

Also, you say: "This is a possible use case of Redis but not a very idiomatic / representative one." Can you elaborate? What is idiomatic? From the front page of the Redis site: "Redis is an open source, advanced key-value store."

Also, on a slightly different topic, I can't seem to find any real documentation about the consistency guarantees of Redis, so I thought you might be able to point me in the right direction. It appears that the master/slave replication scheme in Redis just backs up onto the slaves eventually, but the master immediately returns. Is that true? If a master goes down (disk and all), could a write have been confirmed to the client which is missing on a slave? What replication protocol does it use? I find the documentation lacking in this regard and it'd be great if you could point me to some actual technical specifications. Thanks!


This reminds me of http://xanadu.com/zigzag/ .


HyperDex is different. ZigZag focuses on data visualization. HyperDex provides you the same key-value interface of other systems, while also providing an efficient search primitive.

Check out http://hyperdex.org/tutorial/ for examples of the Python API.


Is this better than hadoop?


the FAQ mentions that you target x86_64 architecture. Does the server work on i386?

also, I hope it's doing proper memory alignment and endiannes independence? (because Mongo sucks on that)


We use many lock-free datastructures which rely heavily upon the x86_64 architecture. Further, the expanded virtual address space enables us to mmap everything.

All network traffic is packed, and in network byte order.


> the expanded virtual address space enables us to mmap everything

This sounds like a disaster waiting to happen when a node's working set is larger than RAM. Have you considered the impact on performance due to excessive I/O resulting from this kind of overcommit?

Redis performance, for example, suffers tremendously when its database size exceeds the available RAM, which is why the authors advise implementors to cap its database size. The only real difference I can see here is that Redis allocates its storage anonymously, while Hyperdex uses file-backed pages; in either case, performance under memory pressure will largely be governed by how the VM chooses to move pages in and out of its backing store -- behavior the application has no control over.

Optimizing the performance of working sets larger than RAM is hard. Redis had a (thankfully) aborted attempt to do so (the VM is gone in the most recent stable release); and the InnoDB buffer pool in MySQL has been refined for many years and is subject to quite a bit of tuning for specific workloads. (See also http://blog.kennejima.com/post/1226487020/thoughts-on-redis for some thoughtful discussion on the subject.)

You claim on the one hand that Hyperdex was designed to work on data sets larger than RAM, but on the other hand you admit that all the benchmarks were performed with a working set smaller than RAM. I'd like to see comparative benchmarks where the working set is larger than RAM to be convinced.


> This sounds like a disaster waiting to happen

Well, define "disaster". A memory-mapped dataset will cause pathological performance only when there is thrashing: If you are accessing a small percentage of the entire dataset, then unused data will be paged out and remain paged out. If the bulk of data being accessed exceeds the amount of available physical RAM, then you will get I/O trashing.

Memory-mapping is a good alternative to static allocation or a home-grown paging system because it lets the kernel handle the dynamics of allocation, letting your application transparently and gracefully handle RAM tension situation by relinquishing memory space to other apps. Kernels (including Linux and Windows) and CPUs are extremely efficient at paging I/O, much more efficient than a hand-written paging system because there's no need for the application's code to check whether a page is in physical memory — that's handled by the CPU itself.

Of course, any I/O incurred by a too-large dataset will drastically reduce performance compared to in-memory speed. But paging in itself does not necessarily lead to "disaster".


Did you read my entire comment? I specifically said, "when the working set is larger than RAM."

Your observations, while true in the abstract, don't reflect real-world behavior under these conditions.


Working set larger than ram doesn't imply thrashing. It all depends on the page replacement algorithm employed by the OS, and the applications that are running.


If you're building a distributed KV stores, you should benchmark against other distributed KV stores. Mongo and Cassandra aren't really. But Riak is.

Plus, since its "distributed" here's the benchmark I'd like to see:

1. Set up a cluster of 8 nodes. Set data replication to 3. 2. Load 3TB of data into the cluster, across 1M documents (or some data set of that order) 3. Run your tests. Optimize each of the DBs for the best way to access them (Eg: link walking vs. map reduce on Riak if that's faster, or secondary indexes if that's faster, or Riak Search if that's faster-- there are many ways to search Riak.) 4. Throw out the results of #3. 5. Pull the plug on 2 nodes. EG: Shut down completely, no longer on the net work at all. Poof, gone. Pick the two nodes by rolling the dice. If you have SPFs, and a bad role of the dice would have brought the whole cluster down, remove the word "distributed" from your marketing. 6. Let the cluster sit for 30 minutes. 7. Run your benchmarks.

The benchmarks developed in #7 are the ones I want to see.

I have no clue how HyperDex would perform in this situation. It could kick Riak's butt (but then, small cluster performance is just one of the criteria that is important to me). I just wish people did benchmarks like this (though I know its a PITA to do it this way.)


Why would a competitor build this Riak benchmark? How are they supposed to know which Riak configuration performs best?

That's the point of YCSB. Each vendor can submit the optimal configuration for their system, and they all run the same benchmark.

At the end of the day, each vendor is going to publish benchmarks that show their system performing better than all others.

It's your job, not theirs, to verify those benchmarks for your particular work load.

Edit: Also, while Mongo wouldn't classify as a distributed store, Cassandra definitely would. In fact, it's more "distributed" than Riak, since you have to pay Basho for multiple DC support (unless that's in the open source version?), whereas Cassandra is completely free.


If things haven't changed in the past year you are right about not being able to do replicated multiple DC support.


Have you looked at YCSB? It's a pathologically poorly designed benchmark. It essentially tests how good a developer is at implementing a Java wrapper around a toy object model and interfacing that with a database.

A proper benchmark would hold constant things which are reasonably expected to be constant: the use case scenario, the data, the warmup requirements, the concurrency, and the count. If feeling frisky, the hardware, operating system and networking environment. And then step back.

For example, "On an EC2 Extra Large instance running Ubuntu 11.10 with whatever tuning the package recommends, what is the transactions per second when 50 vendor-provided clients on the same LAN are attempting to write and read random records out of a pool of 10mm 5k records?"


What exactly makes Cassandra "not really" a KV store, while Riak "really" is a KV store? I'm calling No True Scotsman.


In my opinion they're both KV stores. Where they differ is what data structure they store. Cassandra stores BigTable rows indexed by a key and Riak is a binary blob + metadata indexed by a key.

I suppose it all depends on how be define a key/value store. Is it simply a data structure that can be accessed by a primary key. If that's the fact, nearly all databases fall under that description. If, on the other hand, a key/value store is simply a database that stores a blob indexed by a key, then nirvana is correct in saying "Cassandra" is not a KV store, but where does that get us?

I think that comparing HyperDex to Cassandra is a valid comparison. Both support pk lookups and secondary key lookups. The same is true about Riak. It would be nice to see a YCSB benchmark for properly configured Riak cluster.

I am pretty ignorant of the YCSB benchmark but it would be nice to have a benchmark that tests fault tolerance as well as raw performance.


Has it been proven in a pr0n environment? If not, I'll wait and see.


Interesting.. keep it coming.

We need a unified nosql language.. Basically what SQL is.


What's wrong with SQL?


* COBOL-inspired syntax that is neither human or machine parseable (easily).

* Strange nested query rules

* Language encourages Cartesian products for virtually every non-trivial calculation. This makes join order crucial for reasoning about performance.

* Humans don't have the brain capacity to reason about these joins, forcing the logic onto the query planner.

* Humans don't have the brain capacity to reason about the query planner reasoning about the joins.

* End result: bizarro-english syntax that's supremely hacked-up to twerk the query planner to do your bidding.


What? SQL is virtually plain English and is efficient across a range of database architectures for moderate-to-complex join situations. Its just an implementation of Boolean Algebra with a thin abstraction layer that uses words like SELECT, JOIN, FROM, WHERE, etc.

If you have performance issues, its invariably because the structure of your data is less than optimal or you have many linked external tables. Or -- you are really bad at efficient query design. Even highly experienced SQL-ers always look at old queries and realise that they could be three times faster and half as big, and if there is a performance issue then that's the first thing they'll review.

Humans have been 'understanding' SQL very well for nearly forty years -- without modern refinements like graphical query designers, either.

Not all data is a suitable candidate for an RDBMS solution, I am the first to agree, but there's not actually much wrong with SQL when what you need is a relational table store.

There are a few bad implementations of it, though, it is true, and a lot of people who can't seem to grasp SQL for what it is.

Edit: I am myself a huge fan of NoSQL solutions and always seek to rationally justify a non-relational datastore solution where practical or possible. NoSQL isn't a replacement for SQL (with all its imperfections) for the kind of slicing and dicing where it is needed.


SQL is a very, very poor implementation of relational algebra. For example, the output of an SQL query is not even a valid relation variable (in E. F. Codd's terminology).

If you want to look at how SQL might have looked if it had been relational, take a look at "Tutorial D", the language proposed in a book called The Third Manifesto, by Hugh Darwen and C. J. Date: http://thethirdmanifesto.com/. There are several experimental implementations.


Nothing really. When people talk about the nosql movement, they're mostly referring to the 'non-relational database' movement. The fact that they all implement SQL as their dsl is after the fact.


there is sparql





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: