Working with emacs


When it comes to coding the use of IDEs is generally very popular, and dominant within most corporate/enterprise spaces. In particular:

  • Visual Studio (with Resharper) for my day-to-day development tasks (and have done since 2002) in .NET land & it’s SQL Server derivatives for working with that (SSMS, BIDS, & now SSDT)
  • Eclipse & its derivative IBM Rational System Architect
  • Apple’s XCode for MacOS and iOS development with Objective-C
  • JetBrains IDEs – especially IntelliJ for Java

These are great, but with the possible exception of Eclipse, they tend to be very focused on working with a particular language or platform (eg JVM or CLR). When working with these mainstream techsets, despite some bloat related slowness, they do provide a very productive development experience for the amount of investment in skills they require to use them effectively. They are intuitive, simple, and can be very powerful.

When working with less mainstream technologies, and for those willing to invest in the effort to master it, GNU Emacs offers a very effective experience. It also offers an advantage when attempting to learn a new language and its associated toolset. Unlike environments like Visual Studio, Eclipse, and IntelliJ which often hide some of what is going on from the user (such as working with the compiler, or debuggers) it’s typically less complete support in these areas requires the learner to understand them in order to work. This provides a range of advantages after transferring back to an IDE, in particular when things don’t work as expected.

Emacs benefits from having a wide range of modes available for it, and with the use of an emacs package manager (I use Marmalade) getting these packages installed is typically very simple. Even where the mode you want is not available as a compatible package, it tends to be very simple to get it installed (typically a matter of putting it in a suitable location) and then editing the .emacs file using emacs. Most languages tend to have a mode developed for them, and often before they become available in IDEs. In some communities emacs tends to be the standard editor to use, with no IDEs gaining comparable mindshare. For example with lisp based languages like common lisp, clojure and scheme (emacs is written in lisp, and can be fully customised using its own elisp variant). Additionally erlang has great support in emacs, with the Wrangler refactoring tool complementing its major mode.

Emacs has a certain reputation for the complexity of its keyboard shortcuts. Its use of keyboard combinations such as Ctrl-x, r, Space, #register_number to save the current position of the point (the caret) in a register draw humorous derision from users of other editors (in particular the Vi(m) community). Complaints about this tend to be unusual with most emacs users, and typically come from the outside. It can though be intimidating to a new user of the system.

Because of this I created a small github site to capture in one place emacs commands that I use a lot.

Because of this I’ve created a small github site to capture in one place common emacs commands. Twitter bootstrap was used to put the site quickly together.

You can find it here: emacs_shortcuts. Pages are included for SML Mode (Standard ML Mode) and the HTML Mode. Other modes may be added in time (though use of Ctrl-h, m) makes this a bit moot, and pull requests are very welcome.

Writing Code


When starting new projects, one of the first things to do, especially with a new team, is to agree how the code should be written. Writing code is a process, a part of the wider process of shipping valued software, and so any advice has to cover the process of coding, not just the output itself. This is one of the reasons that I don’t find value particularly in ‘coding standards’ documents. It’s not that these things aren’t useful, it’s just that their use is quite small.

There are many great books written on this subject, and for me it’s important that developers read these and discuss what they think. Good code is written by thinking developers, developers who are reflective practitioners.

I’ve purposely left out in what follows popular acronyms like DRY, SOLID, CQS, DI, etc…, but they’re definitely echoed.

What follows has a C# bias, but I think it has a general applicability well beyond that (simply substitute Resharper for another similar tool, or NUnit for JUnit). I might be accused of being a software conservative from reading it. That would probably be right. I don’t think that’s a bad thing. C# as a language has been designed with that bias, to buck it, to go against its grain, is to seek problems. Besides that, I think to prefer the explicit is a good thing. I think type systems are a good thing. I think …

So…

  1. Use your professional judgement. As a professional you should have, develop, and apply your judgement. Ultimately that’s the hard bit of being a coder. If you don’t use your judgement, you won’t develop it. Recognise though the weaknesses in your ability to judge and seek the counsel of others, none of us are perfect, and none of us are right all of the time.
  2. Be consistent. Preferably be consistent with others. More preferably still be consistent in making the right choices. If in doubt about what the right choice is, ask a colleague, if they don’t know ask another colleague. If a colleague is very sure, challenge their view – very little is black and white. Don’t spend all day doing this, balance the need to maintain momentum.
  3. A Resharper settings file should be shared across the team (solution team shared). Always save Resharper settings to this. Before changing settings, check with your team.
  4. Listen to Resharper – generally, it is right and you are wrong. If you are confident you are right, check with a colleague. Try to ensure that there are no Resharper warnings present in the file – Resharper should show your code as ‘green’. If Resharper is wrong, add a comment to explain why. If the thing Resharper is warning about is not important then check with your team, and if they agree, alter the settings to mute the warning (perhaps making it just a hint) and save this setting to the team shared Resharper settings file.
  5. If adding comments (including TODO comments) to the code, also add your name, or an abbreviation of it that will be recognised (I typically use first letter of my first name followed by the first two letters of my surname, eg NRO for Neil Robbins). It’s very useful to know who made a comment so it can be followed up in person if helpful.
  6. If a method can be static, then make it static. Not marking a method that could be static as static does not make it better. Then consider whether it belongs in the class it is in. Generally speaking, static functions belong in static classes, mixing static and non-static methods in the same class suggests a lack of internal cohesion to the class.
  7. Methods which return state (non void methods) should not mutate observable state (so, its ok to log the call, but not to change some state in a way that effects future calls to this or other methods). Methods which mutate observable state should not return anything (they should be void). This is known as the Command Query Separation principle. It is fundamental to ensuring code can easily be reasoned about. Code should not have surprising effects. Code should have a rather dull, Ronseal like quality. Code which cannot be easily reasoned about (by someone other than you) makes a project more expensive.
  8. Understand the code you are writing, and the tools, libraries, frameworks, etc… you are using. If you don’t understand these things then either gain that understanding, or don’t use them. Think about the code you are writing and its effects
    1. computational (does it produce a correct result, does it do this efficiently)
    2. business (does it meet the need of the business)
    3. social (other people who will need to understand it).
  9. Names should be descriptive of purpose/role. Don’t be afraid of long names, auto-completion will save your fingers from caring. Don’t use abbreviations or acronyms without very good reason, unless they are totally ubiquitous (like Id for Identifier).
  10. When naming things try to stick to idiomatic practices. If resharper warns you about your naming (eg you called something userID instead of userId), you’re not being idiomatic.
  11. Interfaces represent roles, things that objects can do. Separate roles demand separate interfaces. Try to keep the number of methods exposed by your interfaces minimal.
  12. Think about how you will test your code. Even if you’re not doing a test first style of development, consider how you or others will be able to test it. Testing things typically requires the thing being tested to be isolatable from its dependencies. This encourages loose coupling. This keeps the codebase flexible. A good thing.
  13. Keep your classes small. I use as a general rule, if your class has more than 100 lines, it’s too big. If the class inherits from another class, include the number of lines in each class in the hierarchy in this count.
  14. Keep your methods small. If a class should only do one thing, then that applies at least as much to a method. Composition starts with the composing of appropriately named functions.
  15. Prefer composition over inheritance. This piece of advice is the best thing by far in the Gang of Four patterns book. Inheriting behaviour is to be questioned. Providing protected ‘utility’ functions in base classes is an anti-pattern. This isn’t to say that inheritance is bad, it’s not, it is a very powerful, useful thing. It’s also massively abused. Prefer composition through the passing of collaborators to the objects/functions that require them.
  16. Seek internal cohesion. Ideally, every method in a class would use all of that classes fields. If you find that one subset of methods uses one subset of fields, and another subset uses another subset of fields, then your class lacks internal cohesion. Extract the subsets creating classes out of them. An object should do one thing only, that is serve only one purpose.
  17. Couple to interfaces for services. Function parameters that are purely data can be classes, but parameters which provide services should be specified in terms of interfaces. As mentioned earlier, these interfaces represent roles, so the function (and it could be the constructor) is specifying the roles that it requires collaborators to fulfil.
  18. Pass in collaborators, don’t construct them within functions. Preferably this will be through the constructor, if that is not possible then pass them into the function itself. Avoid the use of properties to provide collaborators.
  19. Don’t abstract too early. When is too early? It’s hard to tell, but wait until the last responsible moment, which is to say, until not having the abstraction is patently absurd/costly. Do abstract though. A nice maxim is, do it once, do it twice, refactor and abstract. Which is to say don’t be afraid to write essentially the same code three times before recognising and harvesting the commonality. This may seem like waste, but the cost of going down proverbial rabbit holes with early inappropriate abstractions tends to be far higher.
  20. Create types to express concepts, even if that means you have a lot more types. If you capture a suppliers name, create a type called SuppliersName, don’t just use a string. It takes seconds, but makes a codebase a lot more readable. Use implicit operators to provide for transparent casting back to a string, or from a string, if that makes it simpler to use when persisting to a database or similar. This is the closest I’ve found you can get to aliasing types like you can in languages like F#. Once you have this concept in your codebase you may find that you start to find behaviour that should belong to it. This applies also to collection types. Creating custom collection types that hide the data structure which they use internally, can provide far richer semantics making code much easier to understand, and giving behaviours a single, sensible, home. We’re paid to write code, don’t be afraid to write it. Less code is not always a good thing (cough APL - life←{↑1 ⍵∨.∧3 4=+/,¯1 0 1∘.⊖¯1 0 1∘.⌽⊂⍵} ). Making things explicit always is.
  21. Use Ruby/sentence style naming for tests (eg when_something_happens given_a_context_then_something_is_observed). Don’t feel required to use BDD style semantics. Use the semantic that best describes what you are testing (as with any method). A separation between the fundamentals of setting up the context, performing the action being tested, and examining the result to see if it is within the bounds of correctness, is pretty fundamental and useful however.
  22. Use plain-old nunit (assuming C#). It does the job plenty well. Most other C# testing libraries are overly complicated opinionated guff, or just a lesser copy of nunit. So, no Specflow, mspec, Cucumber, Gherkin, MSTest, MBUnit, etc… If Nunit seems to lack a killer feature of another library, you almost certainly don’t need it, and probably do need to simplify what you’re doing. If you’re lucky enough to work with stakeholders who are happy to sit with you and co-own tests that conform to the Gherkin DSL then congratulations, you are the <1% and using a Cucumber clone (or Cucumber) might benefit you. For everyone else it won’t. It’ll just add extra cost to what you’re doing. If you need libraries like Selenium that drive applications then great, use it, just use it with Nunit.
  23. If an action being tested requires a number of separate assertions consider creating each assertion as a separate method, with the action itself and the context it requires being carried out in a separate method, and only once for the fixture (use the TestFixtureSetup attribute). Name the assertion methods appropriately. Consider moving any complex logic required for the assertion into separate functions. I often use extension methods for this, or hand crafted mocks or spys.
  24. Have as few Projects in your Solution as reasonable. More projects == slower compilation. A project produces a .dll which is a unit of deployment/distribution. If the code needs to be deployed/distributed differently to the way the other projects are, then you need a new project. If not, then you probably don’t. One test project is usually plenty. Use test categories to mark different types of tests (the Category attribute of Nunit), don’t create extra projects.
  25. Not all external code (including a lot from Microsoft – cough their ASP.NET and ASP.NET MVC stacks) supports good practice. Consider using gateway code, that is, writing adaptors and façades. This decouples your code from the externally owned code and can keep the issues in the external code from bleeding into yours. It can also help to keep your code reasonably testable.
  26. Mocking is there to provide test doubles (classes that take the place of ones used in production) which are capable of verifying their use in a context. If you don’t need to verify anything then don’t need a mock, you just need to provide a data structure. If interfaces are properly defined and properly narrow then this shouldn’t be a problem. If they aren’t then change your code so that they are.
  27. Don’t mock types that you don’t own. This advice, at least popularised by the people who created the idea of mocking and the first mocking library, is good advice. A lot of the time, like with Microsoft frameworks, you won’t even be able to mock it anyway without resorting to ‘powerful’ mocking tools. These powerful mocking tools are unnecessary, because you shouldn’t be mocking anything you don’t own.
  28. Not all code needs to be of an equal standard. Some code can be pretty shocking and it won’t matter so much. Where it costs no more to get it right than to write crappy code (which is most of the time I find) try to think about the code you’re writing and write the better code.

I’m using Standard ML of New Jersey, the latest release available at this time is the 110.74 version from Jan 20, 2012. This can be found here (where any newer versions released since should also be linked to): http://www.smlnj.org/dist/working/ Unfortunately, at least for me, it isn’t the easiest to find this page as google’s top hit sent me to the Sourceforge site for Standard ML of New Jersey (http://smlnj.sourceforge.net/) where the latest version available is the 110.60 version from back in December 2006. I used the Intel specific Mac OS installer package and this placed the files in the folder /usr/local/smlnj-110.74 Navigating to there and running sml from the bin (./bin/sml) provides access to a repl console. To exit the repl use ctrl+d. Next up for me was to get the sml-mode for emacs working. Sadly I couldn’t see any elisp packages available to do this for me (M-x list-packages), and besides doing these things manually provides more learnings I find. To get sml and emacs working together happily involved:

  1. getting the sml-mode elisp files and placing them in a new folder ‘~/.emacs.d/sml-mode’ – I like to keep all my emacs mode & other mod files together here
    1. I downloaded the sml-mode using the link here: ‘http://www.iro.umontreal.ca/~monnier/elisp/‘ – I plumped for the Download 5.0 July 2012. Running make for me resulted in a fair few warnings, but nothing too concerning I thought.
  2. editing my ~/.emacs file so that it can find that folder and will use the sml-mode when a .sml file is loaded into emacs, and so that the sml inferior mode is available to give me access to the sml repl from within emacs

;;; use sml-mode when file ends in .sml
(autoload ‘sml-mode “sml-mode” “Major mode for editing Standard ML code.” t)
(autoload ‘run-sml “sml-proc” “Inferior mode for running the compiler in a separate buffer.” t)
(setq auto-mode-alist
(append ‘((“\\.sml$” . sml-mode)) auto-mode-alist))
(setq sml-program-name “/usr/local/smlnj-110.74/./bin/sml”)

With this done if you load a file with the .sml extension or otherwise when you meta-X sml-mode then you get some syntax highlighting and help with indentation.

Also, if you meta-X run-sml and then hit enter (presuming that you’ve adjusted the sml-program-name from my path above if necessary) then you’ll get a new buffer opened with the sml repl running.

EDIT: I’ve created a small site with details of useful emacs commands, as well as detailing the SML Mode commands: http://neilrobbins.github.com/emacs_shortcuts/

To get this going I found this site on sml-mode very helpful and I recommend looking at it: http://www.smlnj.org/doc/Emacs/sml-mode.html#SML-Mode

This page also has some useful advice: http://www.cs.washington.edu/education/courses/cse341/07sp/help/sml-repl.html

 

 


So, having described a bit about the Star Schema Benchmark let’s put it in to action with SQL Server 2008. Whilst I could run it locally I thought, it would be more interesting to run it on an Amazon EC2 SQL Server instance. Now obviously, in terms of HDD performance & all round IO this is not an ideal platform for this kind of thing, but the point of this post is just to provide a bootstrap for someone looking to use this benchmark & its data generation tool with SQL Server.

First things first, a quick trip to my EC2 console & in less than 2 minutes I have a nice Windows instance up and running (though of course I have to wait > 15 mins to get the password :( The pains of Windows on EC2 instead of Linux but ah well). I plumped for a m1.xlarge instance which gives me 4 virtual cores & 15GB of RAM on Windows Server 2008R2 Datacentre edition with SQL Server 2008R2 Standard edition also installed and ready. Not only that, but given the cluster machine types are not available with the Windows OS’s then it also gives me the best I/O of the selection. With the instance started, the administrator password retrieved, and me now logged in I took care of some basic tasks. Firstly, getting git installed courtesy of msysgit. Secondly, cloning my repo for this project from GitHub, and thirdly starting up the SQL Server service.

With that all done, I was ready to generate some files. From the command prompt, and with the dbgen tool & the dists.dss file in a new folder ready to hold the generated files I ran the command:

dbgen -s 2 -T a

to generate all of the tables with a scale-factor of 2. With that taking less than the time it took me to flip back to my laptop & send some tweets, I drew confidence and ran the same command again, but this time with a scale factor of 10, which gave me 59,986,214 rows in my fact table. Not a big database by any standards, but big enough to get going with I figure. This took about 10 mins to generate the data (I didn’t time it, but it seemed like about 10 mins to me) and so I then ran my script with bulk insert statements to insert all of the data in to the database (I’d generated the database & tables using my scripts whilst the data was being gen’d). During the load into SQL Server I took the opportunity to look at the resource monitor and, unsurprisingly, it showed that whilst memory & CPU were hardly being touched, Disk I/O was going at full pelt as it read from the lineorder table and pushed the data into the tempdb. Unfortunately, during the upload of the LineOrder table I also ran out of disk space for tempdb to use. Lesson Learned: Use a bigger HDD. So, I created a new 25GB EBS Volume for tempdb & another of 50GB for the database both of which I then attached to the instance before restarting it. With the instance now showing the new drive I remapped the tempdb data and log files to the new drive, and detached the StarSchemaBenchmark database so that I could move its files to the new drive & reattach them there:

ALTER DATABASE tempdb MODIFY FILE (NAME = ‘tempdev’,  FILENAME = ‘D:\SqlData\tempdb.mdf’)

ALTER DATABASE tempdb MODIFY FILE (NAME = ‘templog’,  FILENAME = ‘D:\SqlData\tempdb.ldf’)

Now I could run the import of the LineOrder table again.

This time it was successful. So I ran the standard queries as a batch, which, in case anyone cares took 2 minutes 05 seconds to complete. A bit of index tuning later and I had that down to 21 seconds. But don’t pay attention to the results, the point of this article (& the series it’s a part of) is to help people in using the Star Schema Benchmark & its dbgen tool. From here I could, and will, create cubes out of the same data, compare different approaches to writing sql to see where they might carry performance benefits, investigate indexing, all sorts really. I’ll probably also be setting up a server I bought off ebay a while back (amazing what £100 will get you in terms of hardware) and putting a 500 million dataset on it & see how that works out, which will also let me play around more with the effect of moving things around different drives, partitioning, etc… not to mention running the Enterprise edition. (and of course, the whole, much larger world that isn’t MS SQL Server, prob. starting with Greenplum CE, Postgres, & MonetDB).

The scripts I used for this are all available from my GitHub account here.

Nick Haslam has blogged about working with the TPC-H standard & SQL Server here.


It seems self-evident that to be relevant a benchmark must speak to a particular class of problems. For example, it should be obvious that a benchmark designed to measure the performance of transaction processing will not be a good fit when assessing the appropriateness of systems intended soley for use in reporting. The specificity argument can though of course lead to ever more specific benchmarks. In ‘The Set Query Benchmark‘, a paper which forms a part of ‘The Benchmark Handbook‘, a volume edited by Jim Gray on Benchmarking databases, Gray presents 4 criteria which a domain-specific benchmark must meet if it is to be useful:

  • Relevant;
  • Portable;
  • Scaleable;
  • Simple.
Patrick O’Neil, in his paper  The Set Query Benchmark published in the same volume, gives 4 characteristics of his Set Query benchmark. Whilst two of these are held in common with the 4 criteria that Jim Gray proposes, namely Portability and Scalability, (and his benchmark meets the other 2) he also elects to include Functional Coverage and Selectivity Coverage. These are, perhaps, a little less self-explanatory than the criteria that Gray proposes. Selectivity Coverage refers to the extent to which a benchmark covers the potential spectrum of selectivity, from one row being returned by a query (most selective), to all of the rows being returned by a query (no selectivity). Functional Coverage refers to the extent to which the benchmark covers the range of queries commonly run in commercial settings. For both of these criteria, O’Neil points out that users of his benchmark can examine the subset of measurements (so, in an interstitial space existing at the intersection of functionality & selectivity, as well as hardware/infrastructure & scale).

The Star Schema Benchmark is described in a 2007 paper of which the lead author, Pat O’Neil, is also the author of The Set Query Benchmark discussed above. It describes a domain-specific benchmark that has been specifically designed to enable the comparison of star schema performance across different products. The benchmark itself is a derivative of the TPC-H standard but where the structure of the database has been transformed into a star schema, also dropping columns, for instance, text columns from the fact table, and in other ways described at length in the paper reworking the database so that it aligns with the advise and practices considered optimal by Kimball. Following the The Set Query Benchmark paper, this paper also contains a consideration of the functional and selectivity coverage aimed for by this benchmark. In terms of the functional coverage, the benchmark provides a relatively small number of queries, each exploring different numbers of predicates on dimension, and fact, columns. From this point of view a number of common star schema query scenarios are not provided, or possible, including, for example, where degenerate dimensions, junk dimensions, factless fact tables, fact dimensions, or joins from the fact table to non-leaf level dimension attributes are present. Selectivity coverage is provided for by varying across the queries the number of rows from the fact table which must be fetched in order to provide the results. The queries then are split into four ‘flights’ of queries, where Flights 1 – 4 each have involve restrictions on the corresponding number of dimensions (so Flight one has a restriction on dimension, while Flight 4 has a restriction on 4 dimensions).

Another thing which the Star Schema Benchmark brings to the table, and that I have found useful, is a tool which will generate synthetic datasets with consistent cardinalities between the fact table & dimension tables, and distributions within these tables. In the course of my work, whether in selecting a database platform, or in proving the effectiveness of different approaches to a query design, having access to a suitably modeled & distributed dataset is very helpful. An option which I have used previously & seen being used is the use, sanitised of course, of a ‘real world’ (aka client) dataset. Beyond the sanitisation required such an approach also brings a number of other complicating factors. DeWitt in his paper, The Wisconsin Benchmark: Past, Present, and Future, explains why in the creation of this benchmark he & his team opted for synthetic rather than empirical (so pre-existing real world) data. The arguments he puts forward are that:

  • Empirical databases are hard to scale;
  • The values in empirical databases make it more difficult to systematically benchmark a system. E.g. creating queries that allow for precise levels of selectivity;
  • Empirical databases don’t tend to have uniformly distributed values;
  • Through the use of a synthetic database the simplicity of the structure and distributions of attribute values could be ensured enabling those using the benchmark to quickly understand the database and to design new queries for it.

Whilst if, through the investment of effort, these difficulties are overcome such a dataset may result in a benchmark which concerns a more focused, and therefore perhaps, relevant domain the cost of this effort should not be underestimated (though in IT of course, it probably will be). Also, depending on the questions which must be answered, the usefulness of such a specific dataset may not be any greater. In Doing Your Own Benchmark (again part of the same Jim Gray volume) Sawyer suggests that before, & whilst, undertaking a benchmarking exercise (or adventure as he puts it) it is necessary to ask three interacting questions:

  1. What do you want to learn?
  2. How much are you prepared to invest?
  3. What are you prepared to give up?

A lot of the time I suspect, when faced with the answer to 2. and given the alternative of generating an otherwise fit for purpose synthetic dataset, the answer to 3. will include giving up an empirical dataset, along perhaps with some of the more specific questions it might have additionally answered, but which probably are not of core importance.

The dbgen tool which was created for The Star Schema Benchmark is derived from the TPC-H dbgen tool and enables its user to generate files containing pipe separated records which match the tables required by this benchmark. Furthermore, in using this tool a scale-factor value is provided which acts as a multiplier on the number of rows which will be generated. So that with a scale-fact of 1 nearly 6,000,000 fact table rows are generated with various smaller amounts of rows for each of the dimension tables depending on their cardinality relative to the fact table, but given a scale factor of 10 the figure is 60 million with the rows for the dimension tables being scaled appropriately (so for dates not at all).

I’ve briefly blogged about the getting started using the dbgen tool with SQL Server here.

The Benchmark Handbook can be found freely available online here, courtesy of Microsoft Research. Together with links to pdfs for each of the chapters from it that I have cited here.

There are two Star Schema Benchmark papers that I have been able to access & use. The 2007 paper here, and the 2009 revision (the 3rd revision apparently) of this paper here.

There is a version of the dbgen tool available here on Github.


The paper can be found here: http://research.microsoft.com/apps/pubs/default.aspx?id=64551

I’ve read a number of Jim Gray’s papers before & always found them to be incredibly insightful & useful. Typically I’ve also been almost shocked that they could have been written so long ago, and yet be so relevant now, for example ‘The Transaction Concept: Virtues and Limitations‘ published back in 1981 and which next time I re-read it I’ll try & get something up here about. So coming to this paper, published just recently in 2004, I was very much looking forward to seeing what I might get from it.

The abstract sets out the scene as Jim Gray saw it in 2004, some of which remains very reasonable as a statement of current affairs, for example, the movement of intelligence to the periphery of the network, the integration of queues in to relational database platforms, the expectations we have that RDMSs are highly available and low cost to maintain, the rise of column oriented storage (vNext of SQL Server for example has an engine which provides a column oriented store, not too mention the various BigTable like stores out there now). In fact the only prediction/statement of his which I can draw serious contention with, having the obvious advantage of hindsight, is that ‘XML and xQuery will be the main data structure and access pattern’. I can’t express just how glad I am that this isn’t the case, but that’s for another post (or more typically a boozy twrant).

For me though, the abstract is where Gray stops being right. The revolution which Gray describes is I think perhaps only a revolution for the manufacturer of the RDBMS, but is very little one for its user, and is not perhaps a revolution caused by being in touch with their market, but from losing contact with it. And yet taken a little differently, his analysis seems, perhaps unsurprisingly, almost spot-on.

These are not the revolutions you’re looking for

Firstly let me explain why I feel that Grays revolution is not in fact the revolution at all. Gray’s revolution is:

  1. the ability to execute ‘OO’ code in the RDBMS (so Java in Oracle, & C# in SQL Server);
  2. relational databases presenting services that are accessible from the web;
  3. the inclusion of queuing systems inside the RDBMS platform;
  4. the arrival of cubes as a way of managing & modeling aggregations;
  5. the arrival of data mining;
  6. the ‘rebirth’ of column stores;
  7. that RDBMs now deal with ‘messy’ data better, such as text, temporal, and spatial data;
  8. RDBMs working with semi-structured data, in particular Gray points to the integration of the RDBMS and the filesystem (so, I imagine that he’s thinking here of something like SQL Server’s Filestream attribute & the capability that this enables). He also mentions XML.
  9. The requirement for Stream Processing – as he puts it in the abstact, that now the data finds millions of queries, rather than the queries acting over millions of rows of data.
  10. A movement towards pub-sub styles of replication;
  11. A need for query plans which take into account changing load on the system, skews in the distribution of the data, and changing statistics about the data;
  12. The substantial changes in terms of the size of available storage, both on disk & in memory, and the consequential shift in the latency of reads from each of these. The relative increase in the cost of random access reads over sequential reads.
  13. The possibility to move the RDBMS platform down to the disk, so that instead of a disk being organised around files, it becomes organised as a relational database.
  14. ‘Self-managing and always up’ is how he puts it, and we are all familiar with that idea now surely.

In all of these areas I think, in terms of the change in capabilities, Gray is clearly correct. However if I look at that list I think I could split it into those areas where, although capability has been added the existence of the capability has had little relative uptake or impact; those areas where other technologies than the relational database are dominating, or look set to; and, those areas that whilst the capability has been added, and has had uptake, it has not had a revolutionary impact.

The ability to execute imperative (or OO if you like) code inside of the database has had little uptake, and in general has been rejected by both the DBA community and the programmer communities. Similarly, the inclusion of messsage queues inside the RDBMS has had little uptake or impact, with most organisations which choose to adopt message queuing approaches preferring technologies which exist outside of their RDBMS (such as WebSphereMQ, Tibco, ActiveMQ, RabbitMQ, etc…) I’m not going to go into why I think this has happened here, and I don’t have any figures to back this up – but I’ve certainly not witnessed any uptake of these things, nor widespread discussion of & interest in them. I’m not suggesting that these capabilities are never used, not that they are never appropriate nor useful, just that they haven’t had anything like a revolutionary impact – unless you’re a company/engineer concerned with making RDBMSs. This said, I’ve certainly heard of platforms like Redis being used as high performance queues, and I’ve used CouchDB myself as a distributed, durable queue. So I think Jim Gray is correct that there is a union between database platforms and queuing technologies that is happening to an extent (though I’m not sure it constitutes a revolution), but despite these capabilities being bolted onto RDBMSs, they’re not where its happening.

Similarly, though relational database platforms have now enabled the exposing of data as ‘services’ through the web for a number of years, I have not seen this become a popular thing. Risk averse organisations are typically unwilling to risk exposing there data platforms in these ways (perhaps because they have been trained to see the DB as the king in a chess game of security), and the capabilities which these platforms expose (& I’m thinking specifically of SQL Server Astoria here). I could suggest that a lot of webservices which I’ve seen do very little except to directly expose CRUD like operations onto databases, but these remain an external wrapper around the database. If the relational database platform has been unsuccessful in promoting this approach then perhaps the NoSQL community has been more successful. Products like CouchDB take exposing the database as a web service to their core, and other products such as Neo4J, Riak, and HBase have followed in providing various interpretations of RESTful webservices. Perhaps this could be because the kind of organisation which might adopt NoSql stores assesses & manages risks in different ways to many which will not adopt these. So in one sense here Jim Gray is absolutely correct, it’s just that this hasn’t really happened for the RDBMS.

Perhaps paralleling this, Complex Event Processing definitely seems to have gained momentum over the last few years with platforms & products growing in capability & number which enable the concurrent, near-Real-Time processing of vast streams of events. Again however the RDBMS platform has failed to achieve much traction in this space, regardless of the capabilities which have been shoe-horned on to these products. Even the languages being used in many of the CEP products are clearly inspired by SQL, but this is where the closeness between the platforms perhaps end. Jim Gray is clearly correct about there being a revolution, perhaps still on the way, which will place stream processing far more centrally in how IT can enable, but it has failed, at least to date, to be reflected in the usage, or demands, of the RDBMS.

Cubes and Data mining are an area where clearly the RDBMS market has done very well, and can be viewed as an area where, in terms of usage & demand a revolution might be seen to have occurred in the market, rather than purely inside the vendors. It is also an area where the RDBMS vendors face a lot of competition as a plethora of other approaches and platforms have exploded on to the market competing in this space, such as the Map-Reduce implementations found in products like Hadoop and Greenplum.

When it comes to the changes in the hardware-scape onto which RDBMS vendors must prepare their products for deployment the revolution that Gray describes has most definitely occurred & is still occurring in every sense, and with the growth also of flash memory & SSDs perhaps more so than he anticipated. The challenges which he describes the engineers of RDBMS platforms as facing seem as relevant as when he wrote this piece, if not more so.

In conclusion then, I think that if the revolution is viewed as a revolution in the skills & projects of the teams working on Oracle or SQL Server then maybe Jim Gray is right in practically every aspect. If it is viewed as a list of capabilities that various different DBMSs may become popular for providing, then again, he is broadly correct. If though it is to be viewed as a revolution in what the market will demand and use in an all singing, all dancing, RDBMS then I think not only is Jim Gray wrong, but that this thinking which has clearly imbued places like Oracle and SQL Server over the last few years (decades?) is perhaps why these RDBMSs have become the hulking great behemoths that they now are, so overladen with features and capabilities, that both they & the organisations that place them at their core may find it difficult to maintain a level of organisational agility which, at least through technology, might allow them to achieve a competitive advantage. In particular the paper sets me thinking (& perhaps I’m echoing them already) about the papers by Michael Stonebraker on ‘The End of an Architectural Era’.


I’ve spent a lot of the last month or so learning Erlang in my free time. It’s been a very rewarding experience.

I wanted to look at Erlang because for a number of years now I’ve been coming across very impressive systems (amongst others RabbitMQ, CouchDB & Riak) written using this language and its OTP framework. A number of the core ideas in Erlang also aligned very well with a number of my recent interests in IT, including distributed systems design, building reliable systems, functional approaches to programming, messaging, and building systems that can support working with ‘big data’. Additionally I’ve noticed over the past few years the growing influence of the ideas that Erlang has at its core, in particular the Actor approach to concurrency (which now has a firm place in the Scala/Java world thanks to the Akka framework).

So enough background to why I have found myself ever more enamoured with Erlang, what have I learned which may help you to get coding in Erlang?

Well, for starters code is organised into modules, the modules have a name (which you give them), and the file which they’re saved in will share this name, but have the extension ‘.erl’.

So your first move might be to create a file ‘hello_erlang.erl’. In that add the code:

-module(hello_erlang).

Note two things about this code straight away:

  1. The text ‘hello_erlang’ is not in quotes, this is because it’s not a string. Erlang has the concept of an atom. More on these later, but be aware that this is an atom;
  2. The full stop at the end of the line. Erlang writes a lot like English does, so as you finish a sentence with a full-stop in English, so you finish a statement in Erlang with a full-stop.

Straight away you have something which, once saved, can be compiled. So let’s do that. Save it, and then start the erlang console.

Couple of things at this point.

  1. I’m going to assume that you have erlang installed. If you don’t then I recommend the instructions I found on the Basho site, they’ve got me going on both Ubuntu & MacOS very nicely. You can find them here: http://wiki.basho.com/Installing-Erlang.html (brew FTW on the MacOS). If using Windows I’d recommend running up a VM to use and installing something like Ubuntu or CentOS on it, but JIC you don’t want to do that you can find it here http://www.erlang.org/download.html.
  2. I’m assuming that, like me, you’re doing your editing in emacs & happy at a command prompt. If neither of these things are true then I’d strongly recommend that you make them true. Getting the basics of emacs only takes about an hour, but Erlang has some integration with emacs out of the box, has other projects which offer further integration, and it’ll be worth the initial WTFness of it I think.

Ok, so you’ve got erlang installed now (and hopefully emacs too), now navigate (in the command prompt/bash) to the folder where you saved hello_erlang.erl and start erlang (type erl & hit enter). What this will give you is the erlang repl console (Read Evaluate Print Loop), in many ways like the ones you may have come across if using like Ruby or F#. From here you can straight away try some things & see what happens, like:

  1. Fred = “Fred”.
    • note the full stop to complete the statement
  2. Fred == “Fred”.
    • should evaluate to true
  3. Fred = “Bert”.
    1. should complain that ‘no match of right hand side value “Bert”‘. This is because once a variable has been assigned a value, it cannot be assigned another value. Like you’ll find in F#, variables are immutable.
  4. fred == Fred.
    • will give false, you just compared an atom to a variable that holds the value of a string
  5. Bob = bob.
    • assigns the atom bob to the variable Bob
  6. Bob == bob.
    • will evaluate to true as the variable Bob holds the atom bob, which is clearly the same as the atom bob
  7. true == false.
    • will evaluate to false because the atom true is not equal to the atom false.

More of the basic types in Erlang in a future post, but worth noting here that:

  • Variables always start with a capital letter;
  • An atom always starts with a lower case letter;
  • Variables can point to both atoms & strings (& other things, but we haven’t seen them yet).

I tend to think of atoms as kind of like powerful enums, but this comparison isn’t really fair. They’re so much more than that. Every function name, module name, true, false, etc… they’re all atoms. You don’t have to declare them up front, at the point where they are used, they exist, and they continue to exist from then forwards. (there are some gotchas around this, it’s not quite that simple, but we’ll worry about that in the future)

Anyhow, back to that module. Still in the erlang console write

c(hello_erlang).

This will compile your module and you should get this response:

{ok,hello_erlang}

Which gives us a good moment to introduce another important type in Erlang, the tuple. .NET had a tuple type introduced (IIRC) in version 4 of the framework which allows you to write things like

var myTuple = new Tuple<string,int>();

or even,

var myTuple = Tuple.New("Fred",42).

Well, erlang lets you simply write:

{"Fred",42}

for exactly the same effect. Tuples are an incredibly important thing in erlang & we’ll see a lot more of them, but for now I think it’s enough just to be aware that the response you just got by compiling your module was a tuple consisting of two atoms, and it tells you that the module was compiled ‘ok’. Which is nice.

So anyway, lets add some code to that module, and like all good devs (cough, cough) let’s start with a test.

Erlang ships with a JUnit like (so, NUnit like for those of us of a .NET persuasion) testing framework called EUnit (strange that isn’t it). To use EUnit we’re going to add some code that’s the equivalent of a using/import statement and then we’ll be able to write our test.

With the hello_erlang.erl file back open again in the editor of your choice (which I trust is emacs – though did I mention that there is Eclipse integration available for Erlang too? No? Well, move along, nothing to see here,…) add this line of code below the -module(hello_erlang). line.

-include_lib("eunit/include/eunit.hrl").

With that in place we can now write our first erlang function, here it is:

greeting_should_be_hello_erlang_test() ->
?assertEqual("Hello Erlang!", greeting()).

Few things to note here:

  • That the function name is an atom, and that the empty brackets signify that it expects no arguments to be supplied when it’s called – basically it looks a lot like a C# or Java method signature, but with out the superflous crap like access modifiers, or having to declare a return type.
  • That appending _test to the test name is mandatory. It’s how the test library will be able to recognise this function as one to execute as a test. There are no attributes/annotations available in Erlang, so functions can’t have metadata attached to them, this means that conventions such as this are necessary. Java people, you may remember this sort of thing from working with JUnit in pre 1.5 versions of Java. If I’m honest this is a bit disappointing to me, but really and truly I don’t thing it matters much.
  • The -> bit is basically identical to => in C#. It denotes that a function definition is about to be provided. Java people, apologies, but if you’re going to use a dead language then this may look unfamiliar, get on the scala train & then come back :p
  • the ? tells us that assertEqual is a macro. More on them later, a lot later.
  • We expect that the function greeting() will be equal the string “Hello Erlang”. Note, I didn’t say that the result of the function should equal the string “Hello Erlang”, especially as the greeting function takes no arguments, we should expect referential transparency!
  • No curly brace crap, but meaningful indentation.
  • We’ve embedded the test in the same file as the code we’re writing. We don’t have to do this, but I find it works quite well when doing TDD. Compiler directives can be used to ensure that the tests don’t make their way into the released versions of the compiled code.
So, now we need to create our greeting function (yes I know I’m skipping the red in red, green, refactor). This is really trivial:
greeting() ->
"Hello Erlang!".
With that added to your file (personally I’d put it between the module declaration & the include_lib statement), go back to the erlang console, recompile your code and the run your tests by writing:
hello_erlang:test().
This should result in a message that tell’s you your tests passed (all one of them). That colon is very similar to how we use a full stop to separate the name of a class from the name of one of its methods when coding in C# or Java. So where we would have written hello_erlang.test in one of those languages, here we use a colon. Simple.
Try this though to execute the greeting function:
hello_erlang:greeting().
You should get this message:
** exception error: undefined function hello_erlang:greeting/0
It fails because we haven’t exported the function, or in terms of Java/C# made it public. By default all functions are not exported (available outside the module that they’re declared in). The test method becomes available because of our use of the ?assertEqual macro, and so we are not required to export that.
To export the function add this line of code after the module declaration statement:
-export([greeting/0]).
Couple of things to note here:
  • The square brackets signify a list [] (definitely more to be said about lists, but not in this post).
  • The /0 following on from greeting signify that this is a function with 0 arguments expected. You saw it earlier in the error message that erlang gave when we tried to execute the function before we had exported it.
And for now, that’s it. I hope from this, if you’ve never seen erlang before you’ve been able to get going. But you haven’t really seen anything of either the beauty & power of the language, the OTP framework, or the other libraries & tools available for it (such as the incomparable WebMachine). In future posts I’ll try & cover these too.
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: