So, having described a bit about the Star Schema Benchmark let’s put it in to action with SQL Server 2008. Whilst I could run it locally I thought, it would be more interesting to run it on an Amazon EC2 SQL Server instance. Now obviously, in terms of HDD performance & all round IO this is not an ideal platform for this kind of thing, but the point of this post is just to provide a bootstrap for someone looking to use this benchmark & its data generation tool with SQL Server.
First things first, a quick trip to my EC2 console & in less than 2 minutes I have a nice Windows instance up and running (though of course I have to wait > 15 mins to get the password The pains of Windows on EC2 instead of Linux but ah well). I plumped for a m1.xlarge instance which gives me 4 virtual cores & 15GB of RAM on Windows Server 2008R2 Datacentre edition with SQL Server 2008R2 Standard edition also installed and ready. Not only that, but given the cluster machine types are not available with the Windows OS’s then it also gives me the best I/O of the selection. With the instance started, the administrator password retrieved, and me now logged in I took care of some basic tasks. Firstly, getting git installed courtesy of msysgit. Secondly, cloning my repo for this project from GitHub, and thirdly starting up the SQL Server service.
With that all done, I was ready to generate some files. From the command prompt, and with the dbgen tool & the dists.dss file in a new folder ready to hold the generated files I ran the command:
dbgen -s 2 -T a
to generate all of the tables with a scale-factor of 2. With that taking less than the time it took me to flip back to my laptop & send some tweets, I drew confidence and ran the same command again, but this time with a scale factor of 10, which gave me 59,986,214 rows in my fact table. Not a big database by any standards, but big enough to get going with I figure. This took about 10 mins to generate the data (I didn’t time it, but it seemed like about 10 mins to me) and so I then ran my script with bulk insert statements to insert all of the data in to the database (I’d generated the database & tables using my scripts whilst the data was being gen’d). During the load into SQL Server I took the opportunity to look at the resource monitor and, unsurprisingly, it showed that whilst memory & CPU were hardly being touched, Disk I/O was going at full pelt as it read from the lineorder table and pushed the data into the tempdb. Unfortunately, during the upload of the LineOrder table I also ran out of disk space for tempdb to use. Lesson Learned: Use a bigger HDD. So, I created a new 25GB EBS Volume for tempdb & another of 50GB for the database both of which I then attached to the instance before restarting it. With the instance now showing the new drive I remapped the tempdb data and log files to the new drive, and detached the StarSchemaBenchmark database so that I could move its files to the new drive & reattach them there:
ALTER DATABASE tempdb MODIFY FILE (NAME = ‘tempdev’, FILENAME = ‘D:\SqlData\tempdb.mdf’)
ALTER DATABASE tempdb MODIFY FILE (NAME = ‘templog’, FILENAME = ‘D:\SqlData\tempdb.ldf’)
Now I could run the import of the LineOrder table again.
This time it was successful. So I ran the standard queries as a batch, which, in case anyone cares took 2 minutes 05 seconds to complete. A bit of index tuning later and I had that down to 21 seconds. But don’t pay attention to the results, the point of this article (& the series it’s a part of) is to help people in using the Star Schema Benchmark & its dbgen tool. From here I could, and will, create cubes out of the same data, compare different approaches to writing sql to see where they might carry performance benefits, investigate indexing, all sorts really. I’ll probably also be setting up a server I bought off ebay a while back (amazing what £100 will get you in terms of hardware) and putting a 500 million dataset on it & see how that works out, which will also let me play around more with the effect of moving things around different drives, partitioning, etc… not to mention running the Enterprise edition. (and of course, the whole, much larger world that isn’t MS SQL Server, prob. starting with Greenplum CE, Postgres, & MonetDB).
The scripts I used for this are all available from my GitHub account here.
Nick Haslam has blogged about working with the TPC-H standard & SQL Server here.