10. Materialized Views - Keeping Up To Date

Materialized views are automatically refreshed after their cache has expired. In this notebook, we illustrate how to efficiently define a materialized view to be “the last week of data” of a very large RDMBS table.

%load_ext raw_magic

An RDBMS table contains historical information about taxis trips. Data is added daily to the table.

FROM READ_PGSQL("psql01", "public", "taxis")
ORDER BY tpep_pickup_datetime DESC
42018-10-06 23:59:572018-10-07 00:13:1613.671N16374113.
42018-10-06 23:59:472018-10-07 00:25:1416.021N151211221.
42018-10-06 23:59:072018-10-07 00:15:5113.231N15848214.
42018-10-06 23:58:132018-10-07 00:09:0112.11N23814129.
42018-10-06 23:58:012018-10-07 00:04:1711.371N10722917.
42018-10-06 23:57:492018-10-07 00:13:2112.651N142107112.
42018-10-06 23:57:342018-10-07 00:09:2712.851N45189212.
42018-10-06 23:56:392018-10-07 00:04:4111.331N1646817.
42018-10-06 23:56:212018-10-07 00:13:2913.211N114237114.
42018-10-06 23:56:172018-10-07 00:00:0010.911N23623925.

We would like to work against this table. More specifically, we’d like to analyze the most recent trips only (a week of data).

Materializing the full table would be an option. However, since we are interested in recent rows only and need frequent updates, a materialized view which evaluates to the most recent data only would be cheaper to create, and faster to query.

In the table, one of the columns permits to identify the recent rows. It’s a timestamp which value is the taxi trip pickup time. It is called tpep_pickup_datetime. The maximum value found in that column is therefore the time of the most recent trip.

SELECT MAX(tpep_pickup_datetime)
FROM taxis_trips
2018-10-06 23:59:57

Trips that have been registered recently can be queried by filtering rows which tpep_pickup_datetime is recent enough compared to the maximum value found in the column.

A materialized view which first computes MAX(tpep_pickup_datetime) and filters the most recent rows is what we need: * It contains only the rows we’re interested in (seven days), * Its content will automatically be updated after the cache parameter (three days) has expired.

%%materialized_view taxis_7_days
taxis_trips := SELECT *
    FROM READ_PGSQL("psql01", "public", "taxis", cache := INTERVAL "3 DAYS");

// Time of the most recent trip
max_pickup_time :=
    SELECT MAX(tpep_pickup_datetime)
    FROM taxis_trips;

// Recent trips obtained by comparing them to the most recent one
FROM taxis_trips
WHERE tpep_pickup_datetime > max_pickup_time - INTERVAL "7 DAYS"
Materialized view "taxis_7_days" created

The materialized view is defined. Its content (last seven days) is saved to storage when queried for the first time. Since a cache value of three days is specified, the materialized view is reused for three days.

SELECT day, AVG(trip_distance) AS distance, AVG(tip_amount) AS tip
FROM taxis_7_days
GROUP BY CAST(tpep_pickup_datetime AS DATE) AS day

Three days later, one of our queries runs after the cache duration has expired. The materialized view is then recomputed automatically. In that moment, it has been three days the RDBMS table had new rows inserted. The materialized view execution will compute MAX(tpep_pickup_datetime) over these more recent rows, and as expected, newer records will be returned when filtering the last seven days.