Querying Complex Data

One of the flagship features of RAW is the ability to query complex data. By complex data we mean datasets that have complex structures, with JSON or XML data being the most common examples.

Traditional databases struggle to cope with complex data, and often require it to be "flattened" into tables and columns. This is cumbersome to do and often means some semantic information implicit in the structure is lost. More recent support - for JSON types for instance - helps somewhat, but it still falls short of providing enough query capabilities for complex data.

As we will shall see, RAW supports complex data with ease and with few SQL extensions.

Querying JSON documents

Let's start by a moderately simple JSON document.

The file at https://raw-tutorial.s3.amazonaws.com/trips.json contains the following data:

  {"reason": "Holidays",
   "origin": "GVA",
   "destination": "KEF",
   "dates": {"departure": "2016/02/27", "arrival": "2016/03/06"}},
  {"reason": "Work",
   "origin": "GVA",
   "destination": "SFO",
   "dates": {"departure": "2016/04/10", "arrival": "2016/04/17"}},
  {"reason": "Holidays",
   "origin": "GVA",
   "destination": "LCY",
   "dates": {"departure": "2016/05/22", "arrival": "2016/05/29"}}

The dates field contains a JSON object with two fields: departure and arrival.

These can be queried in RAW as in:

The . in dates.departure and dates.arrival is used to navigate inside the dates field.

Now let's query file at https://raw-tutorial.s3.amazonaws.com/sales.json, which contains the following data:

    {"country": "CH",
     "products": [
         {"category": "Keyboard", "cost": 50},
         {"category": "Keyboard", "cost": 70},
         {"category": "Monitor", "cost": 450}]},
    {"country": "US",
     "products": [
        {"category": "Keyboard", "cost": 20},
        {"category": "Monitor", "cost": 200}]}

This is a list of two rows, in which the products field is a nested list of products. Each product has a category and a cost field.

Let's start by a simple RAW query:

Note that the data can be queried directly. The Jupyter client renders the nested output as shown above.

We can now shape it into a flat table-like shape, by "unnesting" the nested list of products:

We see sale.products appears in the FROM.

How does this work?

Let's take a step back and look at the FROM syntax in SQL.

In SQL, we can say SELECT row.column FROM table AS row. This means bind row to each row of table in turn; we can then refer to row.column in the projection.

When doing FROM READ("...") AS sale, sale.products AS product, if we focus on sale.products, we can think of it as if it were its own table: it is the table of products in each sale. In fact, if we refer back to the original data, we see that the first row of the file had products set to the "nested table":

         {"category": "Keyboard", "cost": 50},
         {"category": "Keyboard", "cost": 70},
         {"category": "Monitor", "cost": 450}

... and the second row of the file had products set to the "nested table":

        {"category": "Keyboard", "cost": 20},
        {"category": "Monitor", "cost": 200}

We can call these "nested tables" (or more precisely, "nested collections") in RAW.

Therefore, FROM READ("...") AS sale assigned each row of the file to the name sale. Then, for each sale, sale.products AS product assigns each row of the products nested table to the name product.

This operation is called "unnesting" in RAW and is commonly-used to process nested data.

Querying XML documents

Let's start by the following XML document available at https://raw-tutorial.s3.amazonaws.com/article.xml:

<?xml version="1.0" encoding="UTF-8"?>
  <title>This is an article.</title>
    <name title="Mr">John</name>
    <name title="Dr">Jane</name>

The first thing to notice is that this XML is not a collection: there is a top-level record article, which contains nested elements title, authors (a collection), and contents.

We can let RAW describe the structure using automatic schema inference:

The value of is_collection is set to False because the data is a record.

Therefore, if we would try to query this with SELECT, it will fail:

The error indicates the file is not a collection. We can still use SELECT if we use the AS keyword. But let's just run the READ without a SELECT.

It evaluates to a record.

This query starts to show the "true nature" of RAW.

RAW is a programming language that is "disguised" to look as SQL. In fact, SELECT is just a keyword used in RAW to query "collections of data". The example above, however, is a query over a non-collection, so we do not use SELECT.

In fact, other queries are possible in RAW, which do not produce collections.

The simplest example is perhaps:

The output of this query is a number: not a collection, not a table. Just the number 2.

Going back to the original XML, we can query the title or the contents by doing:

This query outputs a single string with the title.

We can do the same for authors, and this time we get a record with a field name.

The field name has two inner fields: @title and #text. These two are special field names generated by RAW. The @ is used a prefix for the XML attribute title, while #text is a special field that refers to the text of an XML element.

<name title="Mr">John</name>

Let's turn these author names into a table with a single RQL query.

The backticks are used to escape special characters in field names in RAW.

Next: Producing Complex Data