The paper can be found here: http://research.microsoft.com/apps/pubs/default.aspx?id=64551
I’ve read a number of Jim Gray’s papers before & always found them to be incredibly insightful & useful. Typically I’ve also been almost shocked that they could have been written so long ago, and yet be so relevant now, for example ‘The Transaction Concept: Virtues and Limitations‘ published back in 1981 and which next time I re-read it I’ll try & get something up here about. So coming to this paper, published just recently in 2004, I was very much looking forward to seeing what I might get from it.
The abstract sets out the scene as Jim Gray saw it in 2004, some of which remains very reasonable as a statement of current affairs, for example, the movement of intelligence to the periphery of the network, the integration of queues in to relational database platforms, the expectations we have that RDMSs are highly available and low cost to maintain, the rise of column oriented storage (vNext of SQL Server for example has an engine which provides a column oriented store, not too mention the various BigTable like stores out there now). In fact the only prediction/statement of his which I can draw serious contention with, having the obvious advantage of hindsight, is that ‘XML and xQuery will be the main data structure and access pattern’. I can’t express just how glad I am that this isn’t the case, but that’s for another post (or more typically a boozy twrant).
For me though, the abstract is where Gray stops being right. The revolution which Gray describes is I think perhaps only a revolution for the manufacturer of the RDBMS, but is very little one for its user, and is not perhaps a revolution caused by being in touch with their market, but from losing contact with it. And yet taken a little differently, his analysis seems, perhaps unsurprisingly, almost spot-on.
These are not the revolutions you’re looking for
Firstly let me explain why I feel that Grays revolution is not in fact the revolution at all. Gray’s revolution is:
- the ability to execute ‘OO’ code in the RDBMS (so Java in Oracle, & C# in SQL Server);
- relational databases presenting services that are accessible from the web;
- the inclusion of queuing systems inside the RDBMS platform;
- the arrival of cubes as a way of managing & modeling aggregations;
- the arrival of data mining;
- the ‘rebirth’ of column stores;
- that RDBMs now deal with ‘messy’ data better, such as text, temporal, and spatial data;
- RDBMs working with semi-structured data, in particular Gray points to the integration of the RDBMS and the filesystem (so, I imagine that he’s thinking here of something like SQL Server’s Filestream attribute & the capability that this enables). He also mentions XML.
- The requirement for Stream Processing – as he puts it in the abstact, that now the data finds millions of queries, rather than the queries acting over millions of rows of data.
- A movement towards pub-sub styles of replication;
- A need for query plans which take into account changing load on the system, skews in the distribution of the data, and changing statistics about the data;
- The substantial changes in terms of the size of available storage, both on disk & in memory, and the consequential shift in the latency of reads from each of these. The relative increase in the cost of random access reads over sequential reads.
- The possibility to move the RDBMS platform down to the disk, so that instead of a disk being organised around files, it becomes organised as a relational database.
- ‘Self-managing and always up’ is how he puts it, and we are all familiar with that idea now surely.
In all of these areas I think, in terms of the change in capabilities, Gray is clearly correct. However if I look at that list I think I could split it into those areas where, although capability has been added the existence of the capability has had little relative uptake or impact; those areas where other technologies than the relational database are dominating, or look set to; and, those areas that whilst the capability has been added, and has had uptake, it has not had a revolutionary impact.
The ability to execute imperative (or OO if you like) code inside of the database has had little uptake, and in general has been rejected by both the DBA community and the programmer communities. Similarly, the inclusion of messsage queues inside the RDBMS has had little uptake or impact, with most organisations which choose to adopt message queuing approaches preferring technologies which exist outside of their RDBMS (such as WebSphereMQ, Tibco, ActiveMQ, RabbitMQ, etc…) I’m not going to go into why I think this has happened here, and I don’t have any figures to back this up – but I’ve certainly not witnessed any uptake of these things, nor widespread discussion of & interest in them. I’m not suggesting that these capabilities are never used, not that they are never appropriate nor useful, just that they haven’t had anything like a revolutionary impact – unless you’re a company/engineer concerned with making RDBMSs. This said, I’ve certainly heard of platforms like Redis being used as high performance queues, and I’ve used CouchDB myself as a distributed, durable queue. So I think Jim Gray is correct that there is a union between database platforms and queuing technologies that is happening to an extent (though I’m not sure it constitutes a revolution), but despite these capabilities being bolted onto RDBMSs, they’re not where its happening.
Similarly, though relational database platforms have now enabled the exposing of data as ‘services’ through the web for a number of years, I have not seen this become a popular thing. Risk averse organisations are typically unwilling to risk exposing there data platforms in these ways (perhaps because they have been trained to see the DB as the king in a chess game of security), and the capabilities which these platforms expose (& I’m thinking specifically of SQL Server Astoria here). I could suggest that a lot of webservices which I’ve seen do very little except to directly expose CRUD like operations onto databases, but these remain an external wrapper around the database. If the relational database platform has been unsuccessful in promoting this approach then perhaps the NoSQL community has been more successful. Products like CouchDB take exposing the database as a web service to their core, and other products such as Neo4J, Riak, and HBase have followed in providing various interpretations of RESTful webservices. Perhaps this could be because the kind of organisation which might adopt NoSql stores assesses & manages risks in different ways to many which will not adopt these. So in one sense here Jim Gray is absolutely correct, it’s just that this hasn’t really happened for the RDBMS.
Perhaps paralleling this, Complex Event Processing definitely seems to have gained momentum over the last few years with platforms & products growing in capability & number which enable the concurrent, near-Real-Time processing of vast streams of events. Again however the RDBMS platform has failed to achieve much traction in this space, regardless of the capabilities which have been shoe-horned on to these products. Even the languages being used in many of the CEP products are clearly inspired by SQL, but this is where the closeness between the platforms perhaps end. Jim Gray is clearly correct about there being a revolution, perhaps still on the way, which will place stream processing far more centrally in how IT can enable, but it has failed, at least to date, to be reflected in the usage, or demands, of the RDBMS.
Cubes and Data mining are an area where clearly the RDBMS market has done very well, and can be viewed as an area where, in terms of usage & demand a revolution might be seen to have occurred in the market, rather than purely inside the vendors. It is also an area where the RDBMS vendors face a lot of competition as a plethora of other approaches and platforms have exploded on to the market competing in this space, such as the Map-Reduce implementations found in products like Hadoop and Greenplum.
When it comes to the changes in the hardware-scape onto which RDBMS vendors must prepare their products for deployment the revolution that Gray describes has most definitely occurred & is still occurring in every sense, and with the growth also of flash memory & SSDs perhaps more so than he anticipated. The challenges which he describes the engineers of RDBMS platforms as facing seem as relevant as when he wrote this piece, if not more so.
In conclusion then, I think that if the revolution is viewed as a revolution in the skills & projects of the teams working on Oracle or SQL Server then maybe Jim Gray is right in practically every aspect. If it is viewed as a list of capabilities that various different DBMSs may become popular for providing, then again, he is broadly correct. If though it is to be viewed as a revolution in what the market will demand and use in an all singing, all dancing, RDBMS then I think not only is Jim Gray wrong, but that this thinking which has clearly imbued places like Oracle and SQL Server over the last few years (decades?) is perhaps why these RDBMSs have become the hulking great behemoths that they now are, so overladen with features and capabilities, that both they & the organisations that place them at their core may find it difficult to maintain a level of organisational agility which, at least through technology, might allow them to achieve a competitive advantage. In particular the paper sets me thinking (& perhaps I’m echoing them already) about the papers by Michael Stonebraker on ‘The End of an Architectural Era’.