Nuno Job ❦

Geek. Open-source enthusiast. Shaping the future of the node.js ☁ @nodejitsu. Founder @thenodefirm& curator @lxjs
This is the old blog, check the new one at nunojob.com

Github ~ Twitter ~ Lanyrd ~ Resume

Why SQL Sucks for NoSQL Unstructured Databases

As some of my readers know I have now worked in two document databases: IBM pureXML, a native XML database built on top of a relational engine (pun intended) that offers both relational (SQL/XML) and unstructured (XQuery) query languages, and MarkLogic, a database built from scratch on a new database paradigm (call it NoSQL if you like) that understands unstructured data and offers an unstructured query language (XQuery).

Another relevant tidbit of information is this emerging trend amongst NoSQL database vendors to implement SQL (or SQL-like interfaces). An example would be the recent push on Cassandra with CQL, or even the more mature Hadoop based SQL interfaces. I see this as NoSQL trying to grow enterprise which overall is a good thing.

I'm not going to argue on whether these NoSQL vendors are doing the right choice with SQL, or even to talk about the fact that enterprise is about more than just bolting on a SQL interface. I'm also not going to discuss why some data models lend themselves better to SQL than others, e.g. Cassandra vs. MongoDB (But if you want to discuss those topics just leave a comment).

In this post I'll focus on some lessons learned about mixing the worlds of relational and unstructured databases.

When the Two Worlds Collide

NoSQL is about No SQL. What this means to me is a shift of focus towards non-relational database alternatives that might even explore different interfaces to the database (and not caring about being politically correct). That is a good thing! Blindly accepting the suckyness of SQL for the sake of enterprise? Well, even if SQL is the right choice for your product, you still need to reason about the consequences and make sure things are well aligned between the two worlds. In other words, it means removing the "blindly" part and reducing the "suckyness" to a bearable minimum for your developers.

But be warned: things will get messy. SQL isn't pretty and it's about to collide with the awesome unstructured truck (Slightly biased)!

Calvin and Hobbes

Data Model

In relational you have:

  RowSet -> SQL -> RowSet

A RowSet is something like:

 RowSet -> Item+
 Item   -> INT | VARCHAR n | ...

I'm unaware of a data model for JSON so I'll talk about data model I'm fairly familiar with: the XPath Data Model:

 XDM -> XPath/XQuery -> XDM

And the XDM is something like:

 XDM        -> Item+
 Item       -> AtomicType | Tree
 AtomicType -> integer | string | ...
 ...

(Both these definitions are oversimplified but serve the purpose).

A thing that is different about a data model for document is that trees are not flat:

 {
   "namespace": "person-2.0",
   "comments": "This guy asked me for a dinosaur sticker. What a nutter!",
   "person": {
     "handle": "dscape",
     "comments": "Please do not send unsolicited mail."
   }
 }

So there's multiple interpretation to what this could mean:

 SELECT comments from PERSON where handle = "dscape"

What "comment" element is the query referring to? If you look at SQL/XML (which is a terrible, terrible thing) your query would be something like:

 SELECT XMLQuery('$person/comments')
 FROM PERSON
 WHERE XMLExists('$person/person/handle')

Which brings me to this obvious conclusion: Trees need a way to navigate. In XML that is XPath, in JSON maybe that will be JSONSelect, maybe something else. But you still need a standard way to navigate first.

Something that makes this challenge even more interesting is schema versioning and evolution. While this has been ignored for ages in relational world (with serious business implications due to downtime during those fun alter-table moments), it really, really, REALLY can't be ignored for documents. Think of Microsoft Word - how many different document versions do they support? Word 2003, 2005, etc..

Schema-less, Flexible, Unstructured: Pick your word but they all lend themselves to quick evolution of data formats. In this query we assume that handle is a child of person, and that the comments about me being an idiot are a direct descendent of the tree. This is bound to change. And SQL doesn't support versioning of documents, thus you will have to extend it so it does.

A true query language for unstructured data must be version aware. In XQuery we can express this query as something like:

 declare namespace p = "person-2.0" ;

 for $person in collection('person')
 let $comments-on-person := $person/p:comments
 where $person/p:handle = "dscape"
 return $comments-on-person

Frankenqueries by Example

Someone once referred to me (talking about SQL/XML) as those Frankenqueries. The term stuck to my head up until now. Let's explore that analogy a little further and look for the places where the organic parts and bolts come together.

Let's imagine two shopping lists, one for Joe and one for Mary

  marys-shopping.json
  { "fruit": {
    "apples": 2
  }, "apples": 5 }


  joes-shopping.json
  { "fruit": {
    "apples": 6,
    "oranges": 1
  } }

Now with my "make believe" SQL/JSON-ish extension I do:

  SELECT apples
  FROM LISTS

What does this return? Remember RowSet goes in, RowSet comes out?

  2, 5
  ---
  6

So, even though you are clearly asking for a list of quantities of apples, you get two RowSets instead of three, and one of the RowSets will have a list of quantities. If you instead decide to return three things, you had two RowSets come in and three RowSets come out. I'm no mathematician but that doesn't sound good.

Once again this is not a problem if you use something that can deal with unstructured information. You don't have this problem in javascript and certainly won't have it in XQuery. In both javascript and XQuery it's all organic. (or bolts if you prefer)

Conclusion: The awesome languages for unstructured data, unicorns and pixie-dust!

While XQuery is a great language for unstructured information my point here is not advocating for it's use. The point I'm trying to make is the need for a real language for unstructured data, whatever you (read: the developers) choose it to be.

But I do ask you (developers) not to accept the "suckyness of SQL" back. She's gone and you have this new hot date called NoSQL. Just give it some time and it will grow on you. Plus it's lots of fun writing javascript code that runs on databases: Don't let them take that away from you.

SQL for unstructured data will fail. Then PL-SQL for unstructured data will fail. So if a vendor is pushing something your direction don't settle for anything less than a full fledged programming language: you can write your full app in javascript and store it in a CouchApp, or you can write your full app in XQuery and store it in MarkLogic. And it should remain like that!

Here's a checklist of things to look for on a query language for unstructured information (feel free to suggest some more):

Navigation Language
Data Model
Regular expressions
Lambdas
High order functions
Functional flavor
Good string handling
Modules so you can build your own libraries
App Server aware: has functions that serve REST

You can choose to ignore this advice but you might end up feeling like a frustrated silverlight developer. And we, the guys that love to innovate in databases, will feel frustrated that the developers have chosen to accept the suckyness back!

See you at Open Source Bridge

If you want to talk more about this topic I would like to invite you to join me, J Chris Anderson (CouchDB) and Roger Bodamer (MongoDB) at Open Source Bridge in Portland this month. We will be hosting a panel-ish un-talk about data modeling in a session called No More Joins. So go on register and we will see you there!