Skip to main content

RQL Concepts

A declarative query language

RQL is a programming language that associates concepts of both SQL and functional programming. The language syntax is close to SQL. Familiar SQL keywords like SELECT, GROUP BY, ORDER BY permit the user to specify an SQL-like query against a dataset. At the same time, the language offers functional programming constructs (function declarations, variables) that enable scripting capabilities.

An example RQL program
olympic_dataset := READ_CSV("s3://bucket/olympics.csv");

how_many_olympic_games(location: string) := {
SELECT COUNT(*)
FROM olympic_dataset
WHERE city = location
};

cities := READ_JSON("s3://bucket/cities.json");

SELECT city, how_many_olympic_games(city)
FROM city IN cities

Data model (collections, records, primitive types)

RQL's data model isn't limited to tables of rows with primitive types like most SQL systems. The following JSON file contains an array of JSON objects, in which field products is an array of JSON objects. That dataset has a complex structure and can be processed as a table by an RQL query.

https://raw-tutorial.s3.amazonaws.com/sales.json
[
{"country": "CH",
"products": [
{"category": "Keyboard", "cost": 50},
{"category": "Keyboard", "cost": 70},
{"category": "Monitor", "cost": 450}]},
{"country": "US",
"products": [
{"category": "Keyboard", "cost": 20},
{"category": "Monitor", "cost": 200}]}
]
SELECT *
FROM READ_JSON("https://raw-tutorial.s3.amazonaws.com/sales.json")
WHERE country = "CH"

Such input dataset that contains nested collections, types as a collection of records which one the fields itself types as a collection:

collection(
record(
country: string,
products: collection(
record(category: string, cost: int)
)
)
)

A collection that is present in each individual row is no different from any other collection. It is a table embedded in each record. Queries are allowed on them.

Here is an example that filters records with more than 100 products, and returns their average cost, using COUNT and AVG aggregations. Both queries apply to the products nested collection of each record of the JSON file.

SELECT country, (SELECT AVG(cost) FROM products) AS avg_cost
FROM READ_JSON("https://raw-tutorial.s3.amazonaws.com/sales.json")
WHERE (SELECT COUNT(*) FROM products) >= 100

For convenience, RQL defines aggregation keywords that can be applied to collections.

SELECT country, CAVG(SELECT cost FROM products) AS avg_cost
FROM READ_JSON("https://raw-tutorial.s3.amazonaws.com/sales.json")
WHERE CCOUNT(products) >= 100

Expression types (SELECT and others)

In SQL, queries are always expressed in the form of a SELECT that technically returns a collection of rows. The expression below types in RQL as a collection: collection(record(country: string, avg_cost: int null))

SELECT country, (SELECT AVG(cost) FROM products) AS avg_cost
FROM READ_JSON("https://raw-tutorial.s3.amazonaws.com/sales.json")

Because RQL allows mixing both SQL queries and regular scripting constructs, SELECT queries that perform aggregation and return one item, do not type as collections.

SELECT COUNT(*) FROM ... // types as long
SELECT MAX(price) FROM ... // with price typing as float, this types as float null

These specific typing rules resulting in non-collections, match functional programming counterparts of these aggregations.

sales := READ_JSON("https://raw-tutorial.s3.amazonaws.com/sales.json");
n_entries := // a long
SELECT COUNT(*)
FROM sales;
avg_n_items := // a long null
SELECT AVG(SELECT COUNT(*) FROM products)
FROM sales;
big := // a long
SELECT COUNT(*)
FROM sales
WHERE CCOUNT(products) >= avg_n_items;
// big and n_entries used as regular numbers in a computation
ratio := CAST(big AS float) / n_entries;

SELECT queries that perform an aggregation, type as primitives (the type of the aggregation). When several SELECT aggregations are performed, the query types as a record, in which each field types as the corresponding aggregation.

SELECT MIN(cost) AS minCost, MAX(cost) AS maxCost, AVG(cost) AS avgCost
FROM products

types as

record(minCost: int null, maxCost: int null, avgCost: int null)

When aggregations aren't given explicit column names in the SELECT, RQL automatically names the resulting record fields:

SELECT MIN(cost), MAX(cost), AVG(cost)
FROM products

types as

record(_1: int null, _2: int null, _3: int null)

GROUP BY

Like regular SQL, RQL has a GROUP BY keyword that computes aggregations by key.

Say a dataset of cities is made available, that associates a number of cities to their country and their number of inhabitants.

city|country|population
Tokyo|Japan|39105000
Jakarta|Indonesia|35362000
Delhi|India|31870000
Manila|Philippines|23971000
São Paulo|Brazil|22495000
...

The following RQL query returns the largest and smallest population of cities in each country:

SELECT country, MIN(population), MAX(population)
FROM READ_CSV("s3://bucket/cities.csv")
GROUP BY country

That RQL query has the familiar shape of a regular SQL GROUP BY.

PARTITION

In RQL, GROUP BY allows to specify more than plain aggregations next to the aggregating key. The matching rows themselves are made available to the query, by means of the PARTITION keyword.

When projected from a GROUP BY query, the PARTITION keyword evaluates to the nested collection of rows that matched the key.

SELECT country, PARTITION
FROM READ_CSV("s3://bucket/cities.csv")
GROUP BY country
[
{
"country": "Afghanistan",
"partition": [
{"city": "Kandahār", "country": "Afghanistan", "population": 614254},
{"city": "Jalālābād", "country": "Afghanistan", "population": 263312},
{"city": "Tāluqān", "country": "Afghanistan", "population": 263800},
{"city": "Kabul", "country": "Afghanistan", "population": 4273156},
{"city": "Kunduz", "country": "Afghanistan", "population": 259809},
{"city": "Mazār-e Sharīf", "country": "Afghanistan", "population": 469247},
...
]
},
...
{
"country": "Armenia",
"partition": [
{"city": "Gyumri", "country": "Armenia", "population": 121976},
{"city": "Vanadzor", "country": "Armenia", "population": 90525},
{"city": "Abovyan", "country": "Armenia", "population": 44400},
{"city": "Kapan", "country": "Armenia", "population": 43190},
{"city": "Armavir", "country": "Armenia", "population": 38635},
{"city": "Yerevan", "country": "Armenia", "population": 1075800},
{"city": "Charentsavan", "country": "Armenia", "population": 25039},
{"city": "Artashat", "country": "Armenia", "population": 21300},
{"city": "Hrazdan", "country": "Armenia", "population": 52808},
...
]
},
...
]

Since it is itself a collection, PARTITION can be queried with RQL constructs. The query below returns the list of countries, associated with their three biggest cities:

SELECT country, (SELECT city, population
FROM PARTITION
ORDER BY population DESC LIMIT 3) AS large_cities
FROM READ_CSV("s3://bucket/cities.csv")
GROUP BY country
    {
"country": "Afghanistan",
[
{"city": "Kabul", "population": 4273156},
{"city": "Kandahār", "population": 614254},
{"city": "Herāt", "population": 556205}
]
},
...
{
"country": "Armenia", "large_cities": [
{"city": "Yerevan", "population": 1075800},
{"city": "Gyumri", "population": 121976},
{"city": "Vanadzor", "population": 90525}
]
},
...
]

The presence of a custom query against PARTITION after a GROUP BY doesn't prevent computing regular SQL aggregations at the same time.

SELECT country, (SELECT city, population
FROM PARTITION
ORDER BY population DESC LIMIT 3) as large_cities,
MIN(population) as minPop
FROM READ_CSV("s3://bucket/cities.csv")
GROUP BY country

Here's the record obtained for Switzerland.

[
...
{
"country": "Switzerland",
"large_cities": [
{"city": "Zürich", "population": 434008},
{"city": "Geneva", "population": 201818},
{"city": "Basel", "population": 177827}
],
"minPop": 5026
}
...
]

Key expression and alias

Any arbitrary expression can be used as a GROUP BY key. As an example, the key can be the first letter of the city's name: SUBSTR(city, 1, 1). When using a complex expression as a grouping key, copying the expression in projection won't be accepted. The row and its fields aren't in scope after a GROUP BY.

attention

This query doesn't typecheck: city is not declared. The expression can't be projected.

SELECT SUBSTR(city, 1, 1), COUNT(*) as N
FROM READ_CSV("s3://bucket/cities.csv")
GROUP BY SUBSTR(city, 1, 1)

Using the grouping key expression is possible if it is aliased in the GROUP BY clause:

SELECT firstLetter, COUNT(*) as N
FROM READ_CSV("s3://bucket/cities.csv")
GROUP BY SUBSTR(city, 1, 1) AS firstLetter
[
{"firstLetter": "F", "N": 1012},
{"firstLetter": "M", "N": 3263},
{"firstLetter": "T", "N": 2157},
{"firstLetter": "Ä", "N": 3},
...
]

The same applies to the HAVING clause, that permits to filter rows after GROUP BY has been applied:

attention

This query doesn't typecheck: city is not declared. The expression can't be used in HAVING.

SELECT ...
FROM READ_CSV("s3://bucket/cities.csv")
GROUP BY SUBSTR(city, 1, 1)
HAVING SUBSTR(city, 1, 1) IN ["X", "Y", "Z"]

This query declares and uses an alias. It is accepted.

SELECT firstLetter, COUNT(*)
FROM READ_CSV("s3://bucket/cities.csv")
GROUP BY SUBSTR(city, 1, 1) AS firstLetter
HAVING firstLetter IN ["X", "Y", "Z"]

Arbitrarily nested queries

A number of examples demonstrate how RQL permits querying nested collections, by running a regular nested query on them. Nested queries aren't limited to nested collections. RQL expressions can include arbitrarily nested subqueries.

To begin here is a regular query that browses the cities dataset and returns a collection containing the three biggest cities of a given country. If it would be entered with another country name, it would return the three largest cities of that country.

SELECT *
FROM READ_CSV("s3://bucket/cities.csv")
WHERE country = "Switzerland"
ORDER BY population DESC
LIMIT 3

That query evaluates to a collection of records. RQL allows that query to appear nested in another. Say for example a short dataset of Olympic games locations lists the Olympic games from 1896 to today, and reports the organizer's country name and the event starting date. The following query filters the country names of recent Olympic events, and associates each one to a short nested collection of its three biggest cities, using the earlier query.

SELECT olympic.country, (
// the query, now nested
SELECT *
FROM READ_CSV("s3://bucket/cities.csv")
WHERE country = olympic.country // filtering the country name
ORDER BY population DESC
LIMIT 3)
FROM READ_CSV("s3://bucket/olympic_games.csv") olympic
WHERE olympic.start_date >= DATE '1996-01-01'

NULL values

In SQL, unknown values are evaluated to NULL. That's why in most cases, the presence of a NULL value in an expression turns that expression to NULL itself. NULL in RQL has the same semantic:

  • Expressions like x + NULL, LENGTH(x), CAST(x AS INT) all evaluate to NULL when x is NULL.
  • x BETWEEN y AND z evaluates to NULL if any x, y or z IS NULL.
  • x OR TRUE evaluates to TRUE regardless of the value of x, including when x is NULL.
  • x AND FALSE evaluates to FALSE regardless of the value of x, including when x is NULL.

Like in SQL an expression that evaluates to NULL is considered as false when used as a WHERE predicate: WHERE price > 100 filters out a row that has a null price.

RQL supports the SQL ISNULL keyword that replaces a value that is null by another one: ISNULL(price, 0) evaluates to 0 when price is null, price otherwise.

RQL has a rich data model that supports complex objects like records and collections. Such objects can be null too. Following the semantic that a NULL value is unknown, many operations on NULL records and collections evaluate to NULL.

  • r.name, that extracts the value of the name field of a record r, evaluates to NULL when r is NULL.
  • SELECT name, price FROM data that processes the data collection, evaluates to NULL when data is NULL.
  • x UNION y that concatenates two collections x and y, evaluates to a NULL collection when x or y is NULL.

NULL arguments may modify the evaluation of user-defined RQL functions.

In the function below, that computes an aggregation over rows between two dates, parameters are specified as date, therefore not nullable. The function returns a non nullable collection of records.

// types as collection(record(country: string, sales: float null))
sales_per_country(start_date: date, end_date: date) := {
SELECT country, SUM(sales) AS sales
FROM READ_CSV(...)
WHERE sale_date BETWEEN start_date AND end_date
GROUP BY COUNTRY
}

When the function is called with arguments that are not nullable, it applies as expected. When called with a nullable argument, the function call is protected, and evaluates to NULL:

maxDate := SELECT MAX(item_sale_date) // might evaluate to `NULL` if no row matches "shoes"
FROM some_data
WHERE item_type = "shoes";

// If maxDate is NULL, evaluates to a NULL collection
sales_per_country(DATE "2000-01-01", maxDate)

In the implementation below, parameters are this time explicitly specified as nullable. If any of start_date or end_date is null, the function is applied without the protection mentioned earlier. The NULL-related logic now applies at the level of its body. If one of the arguments is NULL, the BETWEEN predicate evaluates to NULL. That predicate is used in a WHERE clause. The WHERE interprets it as false. With this implementation, a NULL argument leads to an empty GROUP BY result. is empty.

sales_per_country(start_date: date null, end_date: date null) := {
SELECT country, SUM(sales) AS sales
FROM READ_CSV(...)
WHERE sale_date BETWEEN start_date AND end_date // false if start or end date is null
GROUP BY COUNTRY
}

RQL vs. SQL

RQL is purposedly made similar to SQL. Here is what to keep in mind regarding their similarity:

  • Both are declarative languages. RQL is an extension of SQL with functional programming constructs,
  • RQL's collections of records are equivalent to SQL's tables. RQL's data model is however richer since nested collections (nested tables) are supported and can be queried like top-level ones.
  • RQL supports aggregation keywords that come from SQL. SELECT queries that perform aggregations however type as primitive or records. They don't evaluate to a collection of one item.
  • RQL's GROUP BY has the same syntax as in SQL. For each identified key, it puts in scope the input PARTITION, the subset of the input that matches the key. This permits arbitrary operations on the matching rows.
  • Nested queries are allowed in RQL, at any level.
  • RQL NULL values have the same semantic as in SQL. The semantic is extended to RQL's records, collections, and function calls.