31/07/2012 | Blog
So you probably built an application or two with Node.js. You saw the power of it and started wanting even more. You probably discovered Cluster module and had another wave of awe. Now you could easily scale your application to multiple cpus or cores. It couldn”t get any better than that. Or so you thought… Collective.js is here to change that. It’s something small, but very powerful, something simple, but with the ability to handle complex things. It’s a data synchronization tool across multiple Node.js instances going beyond one machine boundaries.
A little of background
The architecture of Collective.js
There is no much of it to be honest. I mean what architecture can be drawn from
250 275 lines of code? What I did was create two one-way connections to all Node.js instances using Net module which then would be used to synchronize data. Meaning, if one instance updated it’s data, all the other would receive the same update. The updates themselves are non-blocking and asynchronous, and performance is not hindered at all (I hope). Unless, of course, your Tcp connections are very slow and you are trying to pump out huge amounts of data. In which case Collective.js is not for you. This is because the only true bottleneck is your tcp connection. If you overflow it with data faster than it can drain it – you’ll have problems.
So how does Collective.js works again?
Here’s a quick and dirty explanation. Let’s imagine you have 3 Node.js instances. They are on the same or different machines (does not matter really) with different configuration ports.
- You start your first instance. Collective.js looks for any other instances which were provided in the configuration.
- It finds none and that does not matter. You use set/get commands normally.
- Then you start another instance, which imediatelly looks and finds that there is another instance alive.
- Two one-way tcp connections are established and are kept alive for as long as instances are running.
- All of the data from the first instance is sent to the second one.
- After that, any set command will be synchronized with both instances.
- You start your third instance and it’s the same as mentioned above with the exception that initial data synchronization (5) is done from a random connection. Meaning, it could come from the first one or the second one.
Pretty simple, huh? It is! You can use it with Cluster or on different machines.
But Memcached is simple and clean, and fast, and more reliable, and stable, and and…
Yes it is. Memcached is a great thing for distributed caching – there’s not doubt it. I love and use it constantly. It’s just that with Node.js you have the option to don’t care about it anymore. Plus, and this is very speculative, Collective.js has it’s advantages over Memcached:
- You don’t require Memcached dependacy (well, duhh…).
- Memcached stores data in a simple key/value pair, Collective.js can have deep storage (Remember? It’s a JS var).
- Memcached distributes data across it’s servers, Collective.js replicates it (Yes, this point is really debatable).
- There was another very important point, but somehow I forgot it…
The bad and the ugly
The system is not without it’s caveats. The main problem – race conditions. There’s a possibility of overwriting newer data with older one. For example, the second and the third instances do a `set` command, the latter being 1 ms later. You could expect that the final data would be as is in the third instance. But that may not be the case. If the drain event is lagging in the second instance, it’s quite possible that the final data can become as is in that instance. Is it a big problem? Yes and no. It depends. If you need 100% precision – it’s a problem. If you do A LOT (thousands) of sets – it’s a problem. But, if you mostly do reads and don’t care if the data differs in a timeframe of a few miliseconds across instances, then it’s not a problem. All is not lost though, I’m figuring out how to prevent these kind of situations. Few attempts failed miserably so far.
This has been fixed in v0.2 release. The solution was to use data set timestamps and compare them when new data arrives. If a received command contains newer timestamp – a replace will occur. If not – it is ignored. While this solution is not perfect (the data is still sent every time and you rely on time itself)) it gives a slight control over unexpected overwrites.
I also have ideas how to reduce total traffic across instances. Currently the system supports only the REPLACE command, meaning, that the whole chunk of data has to be sent even on a small change in it. A possible APPEND command would significantly reduce traffic if there’s a lot of arrays for example. An automated checking of what data has been changed is also a possibility, but I have doubts about this in terms of performance. It would introduce a significant overhead in my opinion. Maybe, as an optional thing, where’s bottleneck is not processing power, but the network.
The last ugly thing are counters and atomic operations. Not supported yet, but planned in future releases.
Support of simple math (+/-) has arrived in v0.2 release too. Now you can have a fail-safe way to increment/decrement (it’s not limited to a number of 1) your counters.