RAW can be deployed in multiple configurations; for instance, RAW may use a local file system, HDFS or S3 for its internal caching layer. This guide assumes the deployment of RAW on a Spark cluster using HDFS as the caching layer, which is one of the deployment possibilities

The following diagram describes the main components of RAW (shown in red) along with the other services that RAW interacts with.


RAW consists of four main components:

  • “RAW Client”, for instance the Command-Line Client, or the Python API;

  • “RAW Executor”, which executes the RAW query;

  • “RAW Credentials”, which is responsible for holding access credentials to source systems;

  • “RAW Storage”, which is responsible for tracking cached data.

Life of a Query

A user submits a query in RAW using the RAW Client.

  1. The “RAW client” registers any required source credentials in the “RAW Credentials” service;

  2. The “RAW client” submits the query to the “RAW Executor”, via a REST API;

  3. The “RAW executor” prepares the query for execution and validates credentials in the process;

  4. The query is submitted for execution to the Spark cluster;

  5. The query may revalidate credentials during execution;

  6. The query contacts external sources to retrieve data as needed;

  7. The query may cache data in HDFS; the bookkeeping metadata is kept in the “RAW Storage”;

  8. Results and/or logs may be collected back into the “RAW Executor”;

  9. Results and/or logs are sent back to the “RAW Client”.