Comparison

RAW’s design allows it to address a variety of use cases ranging from traditional BI queries, to some types of data cleaning or ETL tasks, or to analytical queries. Many other systems address a subset of these use cases already, so it’s helpful to understand the differences in more detail.

Traditional BI

Traditional data warehouses, or their big data variants such as Presto or Impala, are specifically designed for traditional BI queries. These systems are characterised by strict ANSI SQL compatibility. They also follow a traditional database model where schemas are defined in advance. The result is a high-performance ANSI SQL compatible query engine, which can be easily plugged into a variety of other BI tools.

Schemaless BI

Systems like Drill go beyond traditional BI with the possibility to do “schema less” queries, i.e. the schema is only defined on read. This allows users to query data directly from files without data loading, but the common (and recommended) practice is to users to first “load the data” by converting it into a binary format (typically Parquet) to have acceptable performance. The performance often falls short of traditional BI engines but there is an added gain in flexibility and scalability to larger volumes of data.

These systems also typically allow for more complex structures than SQL’s regular tables: Drill for instance, provides some extensions for nested data structures that goes beyond what is available on SQL. The goal here is to accommodate JSON data structures. One consequence of this decision is that their SQL isn’t strictly ANSI SQL; this is required to accommodate extensions for nested data. One consequence is that system like Drill don’t provide as wider integration with other BI tools as ANSI SQL conforming systems.

Another example with a similar design space is SparkSQL. SparkSQL, however, differs from Drill in that it is tightly integrated with Spark. This allows users to “skim” data using SparkSQL before doing more advanced analytics directly in the Spark ecosystem.

RAW

At a high-level, RAW appears to the user as a schemaless BI system. There are, however, many important differences.

The first is related to the query language. Like schemaless BI systems, RAW isn’t strictly ANSI SQL compatible. This is required to accommodate a variety of language extensions for querying complex data structures. But unlike schemaless BI, RAW SQL extensions aren’t restricted to minor syntactic conveniences for “shallow nested data structures”. In fact, RAW implements a full functional language capable of expressing a variety of queries that are impossible to represent in other systems that have a relational algebra foundation. These language extensions, in turn, mean that RAW can perform various type of analytical queries directly in the query language where in systems like SparkSQL, one would fallback to Scala or Java code to perform them.

The second difference is related to the data model. The number of data types supported in either traditional or schemaless BIs is growing. However, given that these systems are grounded in relational algebra and relational operators, they remain limited in scope. The JSON extensions for “nested data” - which effectively exist as a separate set of “custom-made” functions - are a prime example in most systems. At the conceptual level, JSON and XML have many similarities in their data models, but in most systems users need to memorize different operations to work with each of these formats. In fact, in PostgreSQL, JSON is a “type”, which is somewhat of a misnomer: JSON is simply a way to format nested data. That’s because JSON and XML are “hardcoded” in most systems as extensions rather than being part of the underlying data model. In RAW, querying JSON and XML follows the exact same syntax, principles, features and capabilities, make it much easier to use in practice. Moreover, RAW supports multidimensional arrays or map types, which are “out-of-scope” for most other systems: which means that if your data contains multidimensional arrays or it is best modeled by a map (e.g. you are consuming data from a key-value store), you are out of luck.

Finally, there is performance. RAW can read data directly from files and external sources but this would be too slow in practice. Therefore RAW will autonomously decide when and how to cache/convert data into more efficient formats, and in more efficient storage. This is tightly integrated to the design and query language: for instance, a query in RAW can specify how “up to date” the data should be. Since RAW knows how data is being used and how fresh it must be, it can make a reasoned decision on how to optimize the data accesses and the cache creation. For instance, large tables may be best cached in Parquet; large maps may be best cached in a key-value store; and arrays in a MDA-format.

In fact, this advantage is tightly coupled with RAW’s rich data model. Since RAW understands many data models, it can decide to cache the intermediary data of a query plan: for instance, the temporary “map” created during the execution of a GROUP BY could be cached in a key-value store for later reuse.

Finally, RAW isn’t tightly bound to one particular execution engine. Drill, Presto, Impala are built for the Hadoop ecosystem. SparkSQL requires Spark. RAW can accommodate a variety of distributed infrastructures, or even operate on a single node. This architecture means that RAW can be used over Hadoop, or over Spark, or even provide an independent database frontend over classical distributed file systems.

Streaming vs Batch Queries

Although RAW seems to be primarily geared to batch queries, it supports a wide range of stream-like queries due to its caching and query analysis algorithms. For instance, given a directory of XML files, the user is interested in extracting the data that matches a certain condition. If new files are regularly added to the directory, RAW will ensure that its caches do not rescan previously processed files, effectively ensuring a stream-like behaviour in that only new files will be processed, and the query results merged with previously cached results.