Introduction

RAW is a query engine that allows users to pose questions in a SQL-like language to their data without any previous processing. RAW has several defining features that distinguish it from other query engines:

  • RAW has a rich data model that goes well beyond the scope of typical query engines. For instance, RAW copes with hierarchical data more comprehensively than other systems. It also supports multidimensional arrays or map types. These data structures are not usually supported in databases. In practice it means RAW can “talk to your data” directly, without needing to transform it into “tabular” or “table-like” structures.

  • RAW provides a rich, SQL-like, high-level declarative language. This means users with SQL experience can get started using RAW quickly with a short learning period. In addition, it is usually possible to do most ETL tasks directly in the RAW language, without needing to revert back to Python scripts, Java/Scala code, or external tools to handle complex tasks. In fact, in some scenarios, RAW scripts have been sufficient to handle all required data processing steps, from extraction, to transformation, to producing business reports.

  • RAW handles performance-related data administration tasks autonomously. In other systems, users must decide which indexes to create, which data to replicate or in which format (Parquet, Avro, etc). In RAW, these decisions are done autonomously by the system based on usage patterns and needs.

RAW’s main features are:

  • Queries data in-place, without requiring data loading or schema creation;

  • Supports complex structured data, including hierarchical data and multidimensional arrays;

  • Provides multiple extensions to SQL to support additional operations ranging from data cleaning to log parsing;

  • Supports multiple input locations, including HDFS, HTTP, Amazon S3, Dropbox and relational database systems;

  • Supports multiple input formats, including CSV, JSON, HJSON, XML, Microsoft Excel, log files;

  • Supports multiple output formats, including JSON, HJSON, Parquet among others;

  • Autonously caches and optimizes data and queries, based on usage patterns and without DBA intervention.

Target Use Cases

RAW is designed for analytical processing and not for online transactional processing.

Moreover, RAW is designed primarily to analyze data at source: directly from a database, file system or data lake. In these scenarios, the “source systems” continue to be the primary repositories of data; RAW fetches data from these systems as needed and creates transient caches. It is expected that the “source systems” continue to be available and host the data.

Despite being designed to analyze data at source, RAW has also been extended to provide basic support for creating tables, inserting and deleting data directly in RAW. In these scenarios, RAW owns and hosts the data on a permanent basis. This is important for ephemeral data sources (e.g. streaming data), to use RAW as a long-term data archival, or for some performance-sensitive use cases.