Archive for the ‘Benchmarking’ Category

So, having described a bit about the Star Schema Benchmark let’s put it in to action with SQL Server 2008. Whilst I could run it locally I thought, it would be more interesting to run it on an Amazon EC2 SQL Server instance. Now obviously, in terms of HDD performance & all round IO this is not an ideal platform for this kind of thing, but the point of this post is just to provide a bootstrap for someone looking to use this benchmark & its data generation tool with SQL Server.

First things first, a quick trip to my EC2 console & in less than 2 minutes I have a nice Windows instance up and running (though of course I have to wait > 15 mins to get the password ūüė¶ The pains of Windows on EC2 instead of Linux but ah well). I plumped for a m1.xlarge instance which gives me 4 virtual cores & 15GB of RAM on Windows Server 2008R2 Datacentre edition with SQL Server 2008R2 Standard edition also installed and ready. Not only that, but given the cluster machine types are not available with the Windows OS’s then it also gives me the best I/O of the selection. With the instance started, the administrator password retrieved, and me now logged in I took care of some basic tasks. Firstly, getting git installed courtesy of msysgit. Secondly, cloning my repo for this project from GitHub, and thirdly starting up the SQL Server service.

With that all done, I was ready to generate some files. From the command prompt, and with the dbgen tool & the dists.dss file in a new folder ready to hold the generated files I ran the command:

dbgen -s 2 -T a

to generate all of the tables with a scale-factor of 2. With that taking less than the time it took me to flip back to my laptop & send some tweets, I drew confidence and ran the same command again, but this time with a scale factor of 10, which gave me 59,986,214¬†rows in my fact table. Not a big database by any standards, but big enough to get going with I figure. This took about 10 mins to generate the data (I didn’t time it, but it seemed like about 10 mins to me) and so I then ran my script with bulk insert statements to insert all of the data in to the database (I’d generated the database & tables using my scripts whilst the data was being gen’d). During the load into SQL Server I took the opportunity to look at the resource monitor and, unsurprisingly, it showed that whilst memory & CPU were hardly being touched, Disk I/O was going at full pelt as it read from the lineorder table and pushed the data into the tempdb. Unfortunately, during the upload of the LineOrder table I also ran out of disk space for tempdb to use. Lesson Learned: Use a bigger HDD. So, I created a new 25GB EBS Volume for tempdb & another of 50GB for the database both of which I then attached to the instance before restarting it. With the instance now showing the new drive I remapped the tempdb data and log files to the new drive, and detached the StarSchemaBenchmark database so that I could move its files to the new drive & reattach them there:

ALTER DATABASE tempdb MODIFY FILE (NAME = ‘tempdev’, ¬†FILENAME = ‘D:\SqlData\tempdb.mdf’)

ALTER DATABASE tempdb MODIFY FILE (NAME = ‘templog’, ¬†FILENAME = ‘D:\SqlData\tempdb.ldf’)

Now I could run the import of the LineOrder table again.

This time it was successful. So I ran the standard queries as a batch, which, in case anyone cares took 2 minutes 05 seconds to complete. A bit of index tuning later and I had that down to 21 seconds. But don’t pay attention to the results, the point of this article (& the series it’s a part of) is to help people in using the Star Schema Benchmark & its dbgen tool. From here I could, and will, create cubes out of the same data, compare different approaches to writing sql to see where they might carry performance benefits, investigate indexing, all sorts really. I’ll probably also be setting up a server I bought off ebay a while back (amazing what ¬£100 will get you in terms of hardware) and putting a 500 million dataset on it & see how that works out, which will also let me play around more with the effect of moving things around different drives, partitioning, etc… not to mention running the Enterprise edition. (and of course, the whole, much larger world that isn’t MS SQL Server, prob. starting with Greenplum CE, Postgres, & MonetDB).

The scripts I used for this are all available from my GitHub account here.

Nick Haslam has blogged about working with the TPC-H standard & SQL Server here.

Read Full Post »

It seems self-evident that to be relevant a benchmark must speak to a particular class of problems. For example, it should be obvious that a benchmark designed to measure the performance of transaction processing will not be a good fit when assessing the appropriateness of systems intended soley for use in reporting. The specificity argument can though of course lead to ever more specific benchmarks. In ‘The Set Query Benchmark‘, a paper which forms a part of ‘The Benchmark Handbook‘, a volume edited by Jim Gray on Benchmarking databases,¬†Gray presents 4 criteria which a domain-specific benchmark must meet if it is to be useful:

  • Relevant;
  • Portable;
  • Scaleable;
  • Simple.
Patrick O’Neil, in his paper ¬†The Set Query Benchmark¬†published in the same volume, gives 4 characteristics of his Set Query benchmark. Whilst two of these are held in common with the 4 criteria that Jim Gray proposes, namely Portability and Scalability, (and his benchmark meets the other 2)¬†he also elects to include Functional Coverage and Selectivity Coverage.¬†These are, perhaps, a little less self-explanatory than the criteria that Gray proposes. Selectivity Coverage refers to the extent to which a benchmark covers the potential spectrum of selectivity, from one row being returned by a query (most selective), to all of the rows being returned by a query (no selectivity).¬†Functional Coverage refers to the extent to which the benchmark covers the range of queries commonly run in commercial settings. For both of these criteria, O’Neil points out that users of his benchmark can examine the subset of measurements (so, in an interstitial space existing at the intersection of functionality & selectivity, as well as hardware/infrastructure & scale).

The Star Schema Benchmark is described in a 2007 paper of which the lead author, Pat O’Neil, is also the author of The Set Query Benchmark discussed above. It describes a domain-specific benchmark that has been specifically designed to enable the comparison of star schema performance across different products. The benchmark itself is a derivative of the TPC-H standard but where the structure of the database has been transformed into a star schema, also dropping columns, for instance, text columns from the fact table, and in other ways described at length in the paper reworking the database so that it aligns with the advise and practices considered optimal by Kimball. Following the The Set Query Benchmark paper, this paper also contains a consideration of the functional and selectivity coverage aimed for by this benchmark. In terms of the functional coverage, the benchmark provides a relatively small number of queries, each exploring different numbers of predicates on dimension, and fact, columns. From this point of view a number of common star schema query scenarios are not provided, or possible, including, for example, where degenerate dimensions, junk dimensions, factless fact tables, fact dimensions, or joins from the fact table to non-leaf level dimension attributes are present. Selectivity coverage is provided for by varying across the queries the number of rows from the fact table which must be fetched in order to provide the results. The queries then are split into four ‘flights’ of queries, where Flights 1 – 4 each have involve restrictions on the corresponding number of dimensions (so Flight one has a restriction on dimension, while Flight 4 has a restriction on 4 dimensions).

Another thing which the Star Schema Benchmark brings to the table, and that I have found useful, is a tool which will generate synthetic datasets with consistent cardinalities between the fact table & dimension tables, and distributions within these tables. In the course of my work, whether in selecting a database platform, or in proving the effectiveness of different approaches to a query design, having access to a suitably modeled & distributed dataset is very helpful. An option which I have used previously & seen being used is the use, sanitised of course, of a ‘real world’ (aka client) dataset. Beyond the sanitisation required such an approach also brings a number of other complicating factors.¬†DeWitt in his paper,¬†The Wisconsin Benchmark: Past, Present, and Future,¬†explains why in the creation of this benchmark he & his team opted for synthetic rather than empirical (so pre-existing real world) data. The arguments he puts forward are that:

  • Empirical databases are hard to scale;
  • The values in empirical databases make it more difficult to systematically benchmark a system. E.g. creating queries that allow for precise levels of selectivity;
  • Empirical databases don’t tend to have uniformly distributed values;
  • Through the use of a synthetic database the simplicity of the structure and distributions of attribute values could be ensured enabling those using the benchmark to quickly understand the database and to design new queries for it.

Whilst if, through the investment of effort, these difficulties are overcome such a dataset may result in a benchmark which concerns a more focused, and therefore perhaps, relevant domain the cost of this effort should not be underestimated (though in IT of course, it probably will be). Also, depending on the questions which must be answered, the usefulness of such a specific dataset may not be any greater. In Doing Your Own Benchmark (again part of the same Jim Gray volume) Sawyer suggests that before, & whilst, undertaking a benchmarking exercise (or adventure as he puts it) it is necessary to ask three interacting questions:

  1. What do you want to learn?
  2. How much are you prepared to invest?
  3. What are you prepared to give up?

A lot of the time I suspect, when faced with the answer to 2. and given the alternative of generating an otherwise fit for purpose synthetic dataset, the answer to 3. will include giving up an empirical dataset, along perhaps with some of the more specific questions it might have additionally answered, but which probably are not of core importance.

The dbgen tool which was created for The Star Schema Benchmark is derived from the TPC-H dbgen tool and enables its user to generate files containing pipe separated records which match the tables required by this benchmark. Furthermore, in using this tool a scale-factor value is provided which acts as a multiplier on the number of rows which will be generated. So that with a scale-fact of 1 nearly 6,000,000 fact table rows are generated with various smaller amounts of rows for each of the dimension tables depending on their cardinality relative to the fact table, but given a scale factor of 10 the figure is 60 million with the rows for the dimension tables being scaled appropriately (so for dates not at all).

I’ve briefly blogged about the getting started using the dbgen tool with SQL Server here.

The Benchmark Handbook can be found freely available online here, courtesy of Microsoft Research. Together with links to pdfs for each of the chapters from it that I have cited here.

There are two Star Schema Benchmark papers that I have been able to access & use. The 2007 paper here, and the 2009 revision (the 3rd revision apparently) of this paper here.

There is a version of the dbgen tool available here on Github.

Read Full Post »

%d bloggers like this: