Posts Tagged ‘database’

It seems self-evident that to be relevant a benchmark must speak to a particular class of problems. For example, it should be obvious that a benchmark designed to measure the performance of transaction processing will not be a good fit when assessing the appropriateness of systems intended soley for use in reporting. The specificity argument can though of course lead to ever more specific benchmarks. In ‘The Set Query Benchmark‘, a paper which forms a part of ‘The Benchmark Handbook‘, a volume edited by Jim Gray on Benchmarking databases, Gray presents 4 criteria which a domain-specific benchmark must meet if it is to be useful:

  • Relevant;
  • Portable;
  • Scaleable;
  • Simple.
Patrick O’Neil, in his paper  The Set Query Benchmark published in the same volume, gives 4 characteristics of his Set Query benchmark. Whilst two of these are held in common with the 4 criteria that Jim Gray proposes, namely Portability and Scalability, (and his benchmark meets the other 2) he also elects to include Functional Coverage and Selectivity Coverage. These are, perhaps, a little less self-explanatory than the criteria that Gray proposes. Selectivity Coverage refers to the extent to which a benchmark covers the potential spectrum of selectivity, from one row being returned by a query (most selective), to all of the rows being returned by a query (no selectivity). Functional Coverage refers to the extent to which the benchmark covers the range of queries commonly run in commercial settings. For both of these criteria, O’Neil points out that users of his benchmark can examine the subset of measurements (so, in an interstitial space existing at the intersection of functionality & selectivity, as well as hardware/infrastructure & scale).

The Star Schema Benchmark is described in a 2007 paper of which the lead author, Pat O’Neil, is also the author of The Set Query Benchmark discussed above. It describes a domain-specific benchmark that has been specifically designed to enable the comparison of star schema performance across different products. The benchmark itself is a derivative of the TPC-H standard but where the structure of the database has been transformed into a star schema, also dropping columns, for instance, text columns from the fact table, and in other ways described at length in the paper reworking the database so that it aligns with the advise and practices considered optimal by Kimball. Following the The Set Query Benchmark paper, this paper also contains a consideration of the functional and selectivity coverage aimed for by this benchmark. In terms of the functional coverage, the benchmark provides a relatively small number of queries, each exploring different numbers of predicates on dimension, and fact, columns. From this point of view a number of common star schema query scenarios are not provided, or possible, including, for example, where degenerate dimensions, junk dimensions, factless fact tables, fact dimensions, or joins from the fact table to non-leaf level dimension attributes are present. Selectivity coverage is provided for by varying across the queries the number of rows from the fact table which must be fetched in order to provide the results. The queries then are split into four ‘flights’ of queries, where Flights 1 – 4 each have involve restrictions on the corresponding number of dimensions (so Flight one has a restriction on dimension, while Flight 4 has a restriction on 4 dimensions).

Another thing which the Star Schema Benchmark brings to the table, and that I have found useful, is a tool which will generate synthetic datasets with consistent cardinalities between the fact table & dimension tables, and distributions within these tables. In the course of my work, whether in selecting a database platform, or in proving the effectiveness of different approaches to a query design, having access to a suitably modeled & distributed dataset is very helpful. An option which I have used previously & seen being used is the use, sanitised of course, of a ‘real world’ (aka client) dataset. Beyond the sanitisation required such an approach also brings a number of other complicating factors. DeWitt in his paper, The Wisconsin Benchmark: Past, Present, and Future, explains why in the creation of this benchmark he & his team opted for synthetic rather than empirical (so pre-existing real world) data. The arguments he puts forward are that:

  • Empirical databases are hard to scale;
  • The values in empirical databases make it more difficult to systematically benchmark a system. E.g. creating queries that allow for precise levels of selectivity;
  • Empirical databases don’t tend to have uniformly distributed values;
  • Through the use of a synthetic database the simplicity of the structure and distributions of attribute values could be ensured enabling those using the benchmark to quickly understand the database and to design new queries for it.

Whilst if, through the investment of effort, these difficulties are overcome such a dataset may result in a benchmark which concerns a more focused, and therefore perhaps, relevant domain the cost of this effort should not be underestimated (though in IT of course, it probably will be). Also, depending on the questions which must be answered, the usefulness of such a specific dataset may not be any greater. In Doing Your Own Benchmark (again part of the same Jim Gray volume) Sawyer suggests that before, & whilst, undertaking a benchmarking exercise (or adventure as he puts it) it is necessary to ask three interacting questions:

  1. What do you want to learn?
  2. How much are you prepared to invest?
  3. What are you prepared to give up?

A lot of the time I suspect, when faced with the answer to 2. and given the alternative of generating an otherwise fit for purpose synthetic dataset, the answer to 3. will include giving up an empirical dataset, along perhaps with some of the more specific questions it might have additionally answered, but which probably are not of core importance.

The dbgen tool which was created for The Star Schema Benchmark is derived from the TPC-H dbgen tool and enables its user to generate files containing pipe separated records which match the tables required by this benchmark. Furthermore, in using this tool a scale-factor value is provided which acts as a multiplier on the number of rows which will be generated. So that with a scale-fact of 1 nearly 6,000,000 fact table rows are generated with various smaller amounts of rows for each of the dimension tables depending on their cardinality relative to the fact table, but given a scale factor of 10 the figure is 60 million with the rows for the dimension tables being scaled appropriately (so for dates not at all).

I’ve briefly blogged about the getting started using the dbgen tool with SQL Server here.

The Benchmark Handbook can be found freely available online here, courtesy of Microsoft Research. Together with links to pdfs for each of the chapters from it that I have cited here.

There are two Star Schema Benchmark papers that I have been able to access & use. The 2007 paper here, and the 2009 revision (the 3rd revision apparently) of this paper here.

There is a version of the dbgen tool available here on Github.

Read Full Post »

The paper can be found here:

I’ve read a number of Jim Gray’s papers before & always found them to be incredibly insightful & useful. Typically I’ve also been almost shocked that they could have been written so long ago, and yet be so relevant now, for example ‘The Transaction Concept: Virtues and Limitations‘ published back in 1981 and which next time I re-read it I’ll try & get something up here about. So coming to this paper, published just recently in 2004, I was very much looking forward to seeing what I might get from it.

The abstract sets out the scene as Jim Gray saw it in 2004, some of which remains very reasonable as a statement of current affairs, for example, the movement of intelligence to the periphery of the network, the integration of queues in to relational database platforms, the expectations we have that RDMSs are highly available and low cost to maintain, the rise of column oriented storage (vNext of SQL Server for example has an engine which provides a column oriented store, not too mention the various BigTable like stores out there now). In fact the only prediction/statement of his which I can draw serious contention with, having the obvious advantage of hindsight, is that ‘XML and xQuery will be the main data structure and access pattern’. I can’t express just how glad I am that this isn’t the case, but that’s for another post (or more typically a boozy twrant).

For me though, the abstract is where Gray stops being right. The revolution which Gray describes is I think perhaps only a revolution for the manufacturer of the RDBMS, but is very little one for its user, and is not perhaps a revolution caused by being in touch with their market, but from losing contact with it. And yet taken a little differently, his analysis seems, perhaps unsurprisingly, almost spot-on.

These are not the revolutions you’re looking for

Firstly let me explain why I feel that Grays revolution is not in fact the revolution at all. Gray’s revolution is:

  1. the ability to execute ‘OO’ code in the RDBMS (so Java in Oracle, & C# in SQL Server);
  2. relational databases presenting services that are accessible from the web;
  3. the inclusion of queuing systems inside the RDBMS platform;
  4. the arrival of cubes as a way of managing & modeling aggregations;
  5. the arrival of data mining;
  6. the ‘rebirth’ of column stores;
  7. that RDBMs now deal with ‘messy’ data better, such as text, temporal, and spatial data;
  8. RDBMs working with semi-structured data, in particular Gray points to the integration of the RDBMS and the filesystem (so, I imagine that he’s thinking here of something like SQL Server’s Filestream attribute & the capability that this enables). He also mentions XML.
  9. The requirement for Stream Processing – as he puts it in the abstact, that now the data finds millions of queries, rather than the queries acting over millions of rows of data.
  10. A movement towards pub-sub styles of replication;
  11. A need for query plans which take into account changing load on the system, skews in the distribution of the data, and changing statistics about the data;
  12. The substantial changes in terms of the size of available storage, both on disk & in memory, and the consequential shift in the latency of reads from each of these. The relative increase in the cost of random access reads over sequential reads.
  13. The possibility to move the RDBMS platform down to the disk, so that instead of a disk being organised around files, it becomes organised as a relational database.
  14. ‘Self-managing and always up’ is how he puts it, and we are all familiar with that idea now surely.

In all of these areas I think, in terms of the change in capabilities, Gray is clearly correct. However if I look at that list I think I could split it into those areas where, although capability has been added the existence of the capability has had little relative uptake or impact; those areas where other technologies than the relational database are dominating, or look set to; and, those areas that whilst the capability has been added, and has had uptake, it has not had a revolutionary impact.

The ability to execute imperative (or OO if you like) code inside of the database has had little uptake, and in general has been rejected by both the DBA community and the programmer communities. Similarly, the inclusion of messsage queues inside the RDBMS has had little uptake or impact, with most organisations which choose to adopt message queuing approaches preferring technologies which exist outside of their RDBMS (such as WebSphereMQ, Tibco, ActiveMQ, RabbitMQ, etc…) I’m not going to go into why I think this has happened here, and I don’t have any figures to back this up – but I’ve certainly not witnessed any uptake of these things, nor widespread discussion of & interest in them. I’m not suggesting that these capabilities are never used, not that they are never appropriate nor useful, just that they haven’t had anything like a revolutionary impact – unless you’re a company/engineer concerned with making RDBMSs. This said, I’ve certainly heard of platforms like Redis being used as high performance queues, and I’ve used CouchDB myself as a distributed, durable queue. So I think Jim Gray is correct that there is a union between database platforms and queuing technologies that is happening to an extent (though I’m not sure it constitutes a revolution), but despite these capabilities being bolted onto RDBMSs, they’re not where its happening.

Similarly, though relational database platforms have now enabled the exposing of data as ‘services’ through the web for a number of years, I have not seen this become a popular thing. Risk averse organisations are typically unwilling to risk exposing there data platforms in these ways (perhaps because they have been trained to see the DB as the king in a chess game of security), and the capabilities which these platforms expose (& I’m thinking specifically of SQL Server Astoria here). I could suggest that a lot of webservices which I’ve seen do very little except to directly expose CRUD like operations onto databases, but these remain an external wrapper around the database. If the relational database platform has been unsuccessful in promoting this approach then perhaps the NoSQL community has been more successful. Products like CouchDB take exposing the database as a web service to their core, and other products such as Neo4J, Riak, and HBase have followed in providing various interpretations of RESTful webservices. Perhaps this could be because the kind of organisation which might adopt NoSql stores assesses & manages risks in different ways to many which will not adopt these. So in one sense here Jim Gray is absolutely correct, it’s just that this hasn’t really happened for the RDBMS.

Perhaps paralleling this, Complex Event Processing definitely seems to have gained momentum over the last few years with platforms & products growing in capability & number which enable the concurrent, near-Real-Time processing of vast streams of events. Again however the RDBMS platform has failed to achieve much traction in this space, regardless of the capabilities which have been shoe-horned on to these products. Even the languages being used in many of the CEP products are clearly inspired by SQL, but this is where the closeness between the platforms perhaps end. Jim Gray is clearly correct about there being a revolution, perhaps still on the way, which will place stream processing far more centrally in how IT can enable, but it has failed, at least to date, to be reflected in the usage, or demands, of the RDBMS.

Cubes and Data mining are an area where clearly the RDBMS market has done very well, and can be viewed as an area where, in terms of usage & demand a revolution might be seen to have occurred in the market, rather than purely inside the vendors. It is also an area where the RDBMS vendors face a lot of competition as a plethora of other approaches and platforms have exploded on to the market competing in this space, such as the Map-Reduce implementations found in products like Hadoop and Greenplum.

When it comes to the changes in the hardware-scape onto which RDBMS vendors must prepare their products for deployment the revolution that Gray describes has most definitely occurred & is still occurring in every sense, and with the growth also of flash memory & SSDs perhaps more so than he anticipated. The challenges which he describes the engineers of RDBMS platforms as facing seem as relevant as when he wrote this piece, if not more so.

In conclusion then, I think that if the revolution is viewed as a revolution in the skills & projects of the teams working on Oracle or SQL Server then maybe Jim Gray is right in practically every aspect. If it is viewed as a list of capabilities that various different DBMSs may become popular for providing, then again, he is broadly correct. If though it is to be viewed as a revolution in what the market will demand and use in an all singing, all dancing, RDBMS then I think not only is Jim Gray wrong, but that this thinking which has clearly imbued places like Oracle and SQL Server over the last few years (decades?) is perhaps why these RDBMSs have become the hulking great behemoths that they now are, so overladen with features and capabilities, that both they & the organisations that place them at their core may find it difficult to maintain a level of organisational agility which, at least through technology, might allow them to achieve a competitive advantage. In particular the paper sets me thinking (& perhaps I’m echoing them already) about the papers by Michael Stonebraker on ‘The End of an Architectural Era’.

Read Full Post »

%d bloggers like this: