22. Best Practices

Some frequently-asked questions regarding best practices using RAW.

When should I use views vs materialized views vs tables?

Views are a good choice for short cache durations - i.e. fast-changing data - or for the early phases where a lot of experimentation is being done. At this point, forcing the creation of materialized views may not make sense. Because views can be virtual, RAW will cache opportunistically and this may well be sufficient as an extra performance boost.

Materialized Views give better performance guaranteees and as a rule-of-thumb are the safer option. The exception is for short cache durations.

When to use RQL vs SQL?

RQL is required for “ETL”-ish tasks, i.e. the steps of data exploration. All views, materialized views and packages are defined in RQL.

SQL, however, is best used over established views, and in particular, over materialized views and tables. In these scenarios, SQL can actually lead to better query times than RQL, because the SQL language semantics are simpler than RQL and best suited for (cost) optimization.

Should I rely on inference or specify the schema manually?

Inference is helpful and should be used particularly in the early phases of data exploration.

However, once queries are stable, we recommend specifying the schema manually. This prevents accidental errors if the source data were to change and a new schema - still “compatible” with the previous one - were detected.

How to do large-scale queries with performance?

Use SQL over materialized views or tables.

How to do queries that output a lot of data?

RAW is currently best suited to retrieve small results back to the user, e.g. for ingestion into a Jupyter notebook.

Future plans include the possibility for queries to directly write their output - potentially in parallel - to an external location. This will be required for queries that output large sets of data.

Thank you for completing our introductory tutorial.

Feedback is welcome!

Send us a note here.

[ ]: