Caching

Queries in RAW read data from the original source.

However, retrieving data from each source repeatedly could be slow or place too much load on the source system.

For this reason, RAW creates caches of the data. These caches are kept by RAW in its internal storage, which is cluster-wide and can grow to be very large.

Caches are never created pre-emptively: they are always created during the execution of a query as a "side-effect" of query execution. These caches can be of:

Caching the original data is evidently advantageous if possible. Intermediate query results or generated code are cached opportunistically, as they may benefit "future similar queries".

Caches are maintained by RAW and do not require user intervention.

(The next notebook will introduce "materialized views", which gives users explicit control over cache creation.)

When a query runs, users need guarantees of how fresh data is. Reading stale data from caches could lead to queries that produce output based on old data.

On the other hand, to efficiently cache data, RAW needs to know how often it is used.

In RAW, both these concepts are implemented by a single "hint". This hint is "cache duration" of a source, i.e. how "fresh" data should be. The cache duration hint is always part of a query: if it is not specified by the user, a system-wide default setting is used.

In the query above, we explicitly set cache to 10 seconds. This means that if the data available in the RAW caches is 10 seconds or younger, it may be used. But if the data is older than 10 seconds, RAW must fetch it fresh from the original data source.

If you repeatedly run the query above, and considering the query needs to issue a remote HTTP call, you may see a difference in speed. (This becomes more evident in "slow" sources or reading larger amounts of data.)

If cache is not specified explicitely in the READ command, a system default is used.

Next: Materialized Views