10. Materialized Views - Keeping Up To Date¶
Materialized views are automatically refreshed after their cache has expired. In this notebook, we illustrate how to efficiently define a materialized view to be “the last week of data” of a very large RDMBS table.
An RDBMS table contains historical information about taxis trips. Data is added daily to the table.
%%rql SELECT * FROM READ_PGSQL("psql01", "public", "taxis") ORDER BY tpep_pickup_datetime DESC LIMIT 10
|4||2018-10-06 23:59:57||2018-10-07 00:13:16||1||3.67||1||N||163||74||1||13.0||0.5||0.5||2.86||0.0||0.3||17.16|
|4||2018-10-06 23:59:47||2018-10-07 00:25:14||1||6.02||1||N||151||211||2||21.0||0.5||0.5||0.0||0.0||0.3||22.3|
|4||2018-10-06 23:59:07||2018-10-07 00:15:51||1||3.23||1||N||158||48||2||14.0||0.5||0.5||0.0||0.0||0.3||15.3|
|4||2018-10-06 23:58:13||2018-10-07 00:09:01||1||2.1||1||N||238||141||2||9.5||0.5||0.5||0.0||0.0||0.3||10.8|
|4||2018-10-06 23:58:01||2018-10-07 00:04:17||1||1.37||1||N||107||229||1||7.0||0.5||0.5||1.66||0.0||0.3||9.96|
|4||2018-10-06 23:57:49||2018-10-07 00:13:21||1||2.65||1||N||142||107||1||12.0||0.5||0.5||2.66||0.0||0.3||15.96|
|4||2018-10-06 23:57:34||2018-10-07 00:09:27||1||2.85||1||N||45||189||2||12.0||0.5||0.5||0.0||0.0||0.3||13.3|
|4||2018-10-06 23:56:39||2018-10-07 00:04:41||1||1.33||1||N||164||68||1||7.0||0.5||0.5||1.66||0.0||0.3||9.96|
|4||2018-10-06 23:56:21||2018-10-07 00:13:29||1||3.21||1||N||114||237||1||14.0||0.5||0.5||3.83||0.0||0.3||19.13|
|4||2018-10-06 23:56:17||2018-10-07 00:00:00||1||0.91||1||N||236||239||2||5.0||0.5||0.5||0.0||0.0||0.3||6.3|
We would like to work against this table. More specifically, we’d like to analyze the most recent trips only (a week of data).
Materializing the full table would be an option. However, since we are interested in recent rows only and need frequent updates, a materialized view which evaluates to the most recent data only would be cheaper to create, and faster to query.
In the table, one of the columns permits to identify the recent rows. It’s a timestamp which value is the taxi trip pickup time. It is called
tpep_pickup_datetime. The maximum value found in that column is therefore the time of the most recent trip.
%%rql SELECT MAX(tpep_pickup_datetime) FROM taxis_trips
Trips that have been registered recently can be queried by filtering rows which
tpep_pickup_datetime is recent enough compared to the maximum value found in the column.
A materialized view which first computes
MAX(tpep_pickup_datetime) and filters the most recent rows is what we need: * It contains only the rows we’re interested in (seven days), * Its content will automatically be updated after the
cache parameter (three days) has expired.
%%materialized_view taxis_7_days taxis_trips := SELECT * FROM READ_PGSQL("psql01", "public", "taxis", cache := INTERVAL "3 DAYS"); // Time of the most recent trip max_pickup_time := SELECT MAX(tpep_pickup_datetime) FROM taxis_trips; // Recent trips obtained by comparing them to the most recent one SELECT * FROM taxis_trips WHERE tpep_pickup_datetime > max_pickup_time - INTERVAL "7 DAYS"
Materialized view "taxis_7_days" created
The materialized view is defined. Its content (last seven days) is saved to storage when queried for the first time. Since a cache value of three days is specified, the materialized view is reused for three days.
%%rql SELECT day, AVG(trip_distance) AS distance, AVG(tip_amount) AS tip FROM taxis_7_days GROUP BY CAST(tpep_pickup_datetime AS DATE) AS day ORDER BY day
Three days later, one of our queries runs after the cache duration has expired. The materialized view is then recomputed automatically. In that moment, it has been three days the RDBMS table had new rows inserted. The materialized view execution will compute
MAX(tpep_pickup_datetime) over these more recent rows, and as expected, newer records will be returned when filtering the last seven days.