Collective.js – increase your Node.js application performance even more.

NOTE: Updated to reflect changes in v0.2 release.

So you probably built an application or two with Node.js. You saw the power of it and started wanting even more. You probably discovered Cluster module and had another wave of awe. Now you could easily scale your application to multiple cpus or cores. It couldn”t get any better than that. Or so you thought… Collective.js is here to change that. It’s something small, but very powerful, something simple, but with the ability to handle complex things. It’s a data synchronization tool across multiple Node.js instances going beyond one machine boundaries.

A little of background

When I switched to Node.js from Php the first thing I noticed and took a liking to was the fact that I could have easily accessible data outside the http request. No storing to database, using Memcached or anything like that. I just had a plain javascript variable with any data I wanted and I could access it directly without any problems. I quickly developed a habit of preloading data outside my http server. The performance boost, as expected, was apparent. But this technique somewhat lacked when using with more than one Node.js instance. The problem was with data synchronization. Because every instance had it’s own preloaded data – a desynchronization would occur if any of the instances changed it locally. There, of course, were obvious solutions, such as using databases to indirectly modify the data and constantly forcing a reload on every instance. Another solution was to use Cluster messaging system, but it constrained everything to local Node.js instances. I didn’t like either of those. I wanted it simple and clean, and, most importantly, fast. Thus, after thinking for a long time, I came up with the idea of using TCP connections to synchronize data across all Node.js instances.

The architecture of Collective.js

There is no much of it to be honest. I mean what architecture can be drawn from 250 275 lines of code? What I did was create two one-way connections to all Node.js instances using Net module which then would be used to synchronize data. Meaning, if one instance updated it’s data, all the other would receive the same update. The updates themselves are non-blocking and asynchronous, and performance is not hindered at all (I hope). Unless, of course, your Tcp connections are very slow and you are trying to pump out huge amounts of data. In which case Collective.js is not for you. This is because the only true bottleneck is your tcp connection. If you overflow it with data faster than it can drain it – you’ll have problems.

So how does Collective.js works again?

Here’s a quick and dirty explanation. Let’s imagine you have 3 Node.js instances. They are on the same or different machines (does not matter really) with different configuration ports.

  1. You start your first instance. Collective.js looks for any other instances which were provided in the configuration.
  2. It finds none and that does not matter. You use set/get commands normally.
  3. Then you start another instance, which imediatelly looks and finds that there is another instance alive.
  4. Two one-way tcp connections are established and are kept alive for as long as instances are running.
  5. All of the data from the first instance is sent to the second one.
  6. After that, any set command will be synchronized with both instances.
  7. You start your third instance and it’s the same as mentioned above with the exception that initial data synchronization (5) is done from a random connection. Meaning, it could come from the first one or the second one.

Pretty simple, huh? It is! You can use it with Cluster or on different machines.

But Memcached is simple and clean, and fast, and more reliable, and stable, and and…

Yes it is. Memcached is a great thing for distributed caching – there’s not doubt it. I love and use it constantly. It’s just that with Node.js you have the option to don’t care about it anymore. Plus, and this is very speculative, Collective.js has it’s advantages over Memcached:

  1. You don’t require Memcached dependacy (well, duhh…).
  2. Memcached stores data in a simple key/value pair, Collective.js can have deep storage (Remember? It’s a JS var).
  3. Memcached distributes data across it’s servers, Collective.js replicates it (Yes, this point is really debatable).
  4. There was another very important point, but somehow I forgot it…

Anyway, Collective.js is no contender to Memcached, it’s just another approach to data caching/syncing.

The bad and the ugly

The system is not without it’s caveats. The main problem – race conditions. There’s a possibility of overwriting newer data with older one. For example, the second and the third instances do a `set` command, the latter being 1 ms later. You could expect that the final data would be as is in the third instance. But that may not be the case. If the drain event is lagging in the second instance, it’s quite possible that the final data can become as is in that instance. Is it a big problem? Yes and no. It depends. If you need 100% precision – it’s a problem. If you do A LOT (thousands) of sets – it’s a problem. But, if you mostly do reads and don’t care if the data differs in a timeframe of a few miliseconds across instances, then it’s not a problem.

All is not lost though, I’m figuring out how to prevent these kind of situations. Few attempts failed miserably so far.

This has been fixed in v0.2 release. The solution was to use data set timestamps and compare them when new data arrives. If a received command contains newer timestamp – a replace will occur. If not – it is ignored. While this solution is not perfect (the data is still sent every time and you rely on time itself)) it gives a slight control over unexpected overwrites.

I also have ideas how to reduce total traffic across instances. Currently the system supports only the REPLACE command, meaning, that the whole chunk of data has to be sent even on a small change in it. A possible APPEND command would significantly reduce traffic if there’s a lot of arrays for example. An automated checking of what data has been changed is also a possibility, but I have doubts about this in terms of performance. It would introduce a significant overhead in my opinion. Maybe, as an optional thing, where’s bottleneck is not processing power, but the network. The last ugly thing are counters and atomic operations. Not supported yet, but planned in future releases.

Support of simple math (+/-) has arrived in v0.2 release too. Now you can have a fail-safe way to increment/decrement (it’s not limited to a number of 1) your counters.

Final words

So there you have it. A simple and unoriginal tool. Feedback is always welcome. Forking and improving – even more. Hope you enjoy my first public Node.js module. You can find it at github and npm.



1 Comment

  1. Pingback: Collective.js – a data synchronization tool for Node.js.

Add Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>