In a typical scenario, analyzing data in RAW is a two-step process:
First, we need to ensure that RAW can retrieve data from the source system. This requires adding any necessary access credentials to RAW. This step is only required once, and only for source systems that require credentials.
Once credentials are registered, you can write queries, views, or packages in RAW, which actually perform the desired data analysis.
In this scenario, the source systems continue to be the primary repositories of data; RAW fetches data from these systems as needed and maybe create transient caches for performance reasons. It is expected that the source systems continue to be available and host the data.
RAW provides language constructs in the RQL query language to refer to each source system directly, and to specify how “fresh” data should be. For instance:
READ("dropbox://data.json", cache := "interval 5 minutes")
The query above reads a JSON file from Dropbox. This data will typically be cached in RAW, since reading from Dropbox is an expensive operation. Once the data is cached in RAW, its cache is valid for 5 minutes: if the same source is referenced within a 5-minute interval, it will usually be served from RAW’s cache (unless other queries forced this cache to be evicted). A query after the 5-minute period will force RAW to retrieve a “fresher” version of data from Dropbox.
In other scenarios - where source systems host data on an ephemeral basis for instance, like streaming sources - it is possible to ingest data directly into RAW. This is done using tables. RAW owns and hosts the data on a table in a permanent basis. Once data is inserted into a table, you can similarly write queries/views/packages to perform the desired analysis simply by referencing the table by name.
RAW reads data from the source system as required by each data analysis. Some data sources are secure and their data can only be read by authorized users or machines. Therefore, for these secure data sources, it is first necessary to add their credentials in RAW. For data sources that are not secured or otherwise publicly available, no access credentials are required.
Adding access credentials is illustrated in the usage guide for the various RAW clients.
Once access credentials - if necessary - are setup, it is time to analyze data.
In RAW, users analyze data using queries. Queries allow uses to choose, filter, join, aggregate or otherwise transform data. In RAW, queries read data directly from the original data sources.
Queries that need to executed often can be given a name, and then re-run by name. This is called a view.
RAW has support for “virtual” and “materialized” views, similar to classical database systems.
Virtual views are simply a way to refer to a query by name. However, in RAW, due to caching, “virtual views” may end up being cached by RAW, as part of its caching system. Cached virtual views have similar performance characteristics to materialized views, except that RAW may create and remove the caches transparently to the user.
Materialized views have the additional property that the query results will be made persist on disk, so that future accesses are faster. However, in RAW, due to features as “data freshness”, materialized views may require the materialization to occur during query execution. This can happen if the materialized view has never been used before, or if the last materialization is “too old” for the freshness required by the query.
Refer to Views for additional information.
Oftentimes it is convenient to create sets of queries that relate to a given concept. This is called a package.
A package, which has no direct equivalent in classical database systems, consists of a set of queries that are grouped together. For instance, all queries regarding departments - whether to obtain list of departments, average salary by department, employees of a department, etc - can be packaged into a package called “departments”.
Refer to Packages for additional information.
Tables store collections of data in RAW on a permanent basis. RAW owns and hosts this data. It is not “cached” but rather stored permanently, or until it is deleted by the user.
Operations on RAW tables use an optimistic concurrency model.
In addition, it is possible to create indexes in tables, for performance-sensitive operations.
Refer to Tables for additional information.
In RAW, each query specifies the desired output format. That is, a query can produce a CSV file, or JSON document, or some other output format from the same output data.
This is exposed to the user by a HTTP(S) REST call which allows the user to specify the requested output type; the response to the REST call is a HTTP stream whose body contains the data in the requested format.
In practice, this means it is easy to integrate RAW with other systems. For instance, let’s assume you want to build a search service. You use RAW to prepare data, performing all the data “extraction”, transformation and preparation in RAW scripts. You then produce a view in RAW with the desired code. Then, to ingest this data into e.g. Solr, you ask RAW to execute the view with an output as HJSON, which can be directly consumed by Solr.