5. Producing Complex Data

In the previous notebook we saw how to query complex data and started exploring RAW’s data model.

In this notebook we continue this exploration by showing how RAW queries can also output complex data structures, which will be particularly useful when we see how to export RAW results into formats like JSON or XML.

[8]:
%load_ext raw_magic
The raw_magic extension is already loaded. To reload it, use:
  %reload_ext raw_magic

Collections in the output field of a SELECT

We start by the following query:

[9]:
%%rql

SELECT *
FROM READ("https://raw-tutorial.s3.amazonaws.com/sales.json") AS sale
[9]:
countryproducts
categorycost
CHKeyboard50
Keyboard70
Monitor450
USKeyboard20
Monitor200

Recall that products is a “nested table”.

Let’s analyze the following query:

[13]:
%%rql

SELECT sale.country, (SELECT p.cost FROM sale.products AS p) AS products_cost
FROM READ("https://raw-tutorial.s3.amazonaws.com/sales.json") AS sale
[13]:
countryproducts_cost
CH50
70
450
US20
200

This query returns two rows: one for CH, another for US.

The first column is the country, and the second column is a list of the cost of the products in that country.

How does this work?

The inner SELECT contains (SELECT p.cost FROM sale.products AS p). We can think of this as a normal query, over a table called sale.products. THe output of that query is a table with the cost of each product.

When we compose it in a single query:

SELECT sale.country, (SELECT p.cost FROM sale.products AS p) AS products_cost
FROM READ("https://raw-tutorial.s3.amazonaws.com/sales.json") AS sale

… then the result is a table, where the second column contains another table.

If we were to represent the output as JSON, it would look like:

[
  {"country": "CH",
   "products_cost": [50, 70, 450]},
  {"country": "US",
   "products_cost": [20, 200]}
]

To further demonstrate that SELECTs are just operations over collections of data, let’s add a filter to the inner SELECT:

[17]:
%%rql

SELECT sale.country, (SELECT p.cost FROM sale.products AS p WHERE p.cost > 60) AS products_cost_over_60
FROM READ("https://raw-tutorial.s3.amazonaws.com/sales.json") AS sale
[17]:
countryproducts_cost_over_60
CH70
450
US200

This query filtered the products in the inner SELECT for those that cost > 60.

We can even do aggregations:

[16]:
%%rql

SELECT sale.country, (SELECT COUNT(*) FROM sale.products AS p WHERE p.cost > 60) AS number_products_cost_over_60
FROM READ("https://raw-tutorial.s3.amazonaws.com/sales.json") AS sale
[16]:
countrynumber_products_cost_over_60
CH2
US1

This query counts the number of products in each country that cost over 60.

Extensions to GROUP BY

We start by a traditional aggregation in SQL.

[47]:
%%rql

SELECT Country, COUNT(*)
FROM READ("https://raw-tutorial.s3.amazonaws.com/airports.csv")
GROUP BY Country
LIMIT 2
[47]:
Country_2
Afghanistan21
Albania1

This query lists the number of airports per country.

In RAW, however, GROUP BY produces “groups” that be queried.

When the GROUP BY keyword is used, the * is bound to the group.

To query the entire “group” for a given country - i.e. the airports in each country - we can do:

[48]:
%%rql

SELECT Country, *
FROM READ("https://raw-tutorial.s3.amazonaws.com/airports.csv")
GROUP BY Country
LIMIT 2
[48]:
Country_2
AirportIDNameCityCountryIATA_FAAICAOLatitudeLongitudeAltitudeTimezoneDSTTZ
Afghanistan2048HeratHeratAfghanistanHEAOAHR34.21001762.228332064.5UAsia/Kabul
2049JalalabadJalalabadAfghanistanJAAOAJL34.39984270.49862518144.5UAsia/Kabul
2050Kabul IntlKabulAfghanistanKBLOAKB34.56585369.21232858774.5UAsia/Kabul
2051KandaharKandaharAfghanistanKDHOAKN31.50575665.84782233374.5UAsia/Kabul
2052MaimanaMaimamaAfghanistanMMZOAMN35.93078964.76091727434.5UAsia/Kabul
2053Mazar I SharifMazar-i-sharifAfghanistanMZROAMS36.70691467.20967812844.5UAsia/Kabul
2054ShindandShindandAfghanistanOASD33.39133162.26097537734.5UAsia/Kabul
2055SheberghanSheberghanAfghanistanOASG36.75078365.91316410534.5UAsia/Kabul
2056KonduzKunduzAfghanistanUNDOAUZ36.66511168.91083314574.5UAsia/Kabul
5922Faizabad AirportFaizabadAfghanistanFBDOAFZ37.121170.518138724.5UAsia/Kabul
7036Bagram AFBKabulAfghanistanBPMOAIX34.564669.155448954.5UAsia/Kabul
7499Tarin Kowt AirportTarin KowtAfghanistanTIIOATN32.60527865.86416735004.5UAsia/Kabul
7500Zaranj AirportZaranjAfghanistanZAJOAZJ30.96916761.86694415814.5UAsia/Kabul
7501Chaghcharan AirportChaghcharanAfghanistanCCNOACC34.52666765.27166773834.5UAsia/Kabul
7868Camp Bastion Camp BastionAfghanistanOAZI31.86555664.19527828084.5NAsia/Kabul
8127FOB SalernoKhostAfghanistanKHTOAKS33.333469.95237564.5NAsia/Kabul
8145FOB SharanaSharanAfghanistanN33.1277721568.836984774004.5NAsia/Kabul
8146SharonaSharonaAfghanistanAZ3OASA33.127721568.836984774004.5NAsia/Kabul
8267FOB ShankShankAfghanistanOASH33.92136969.0781366144.5NAsia/Kabul
8773Bost AirportLashkar GahAfghanistanBSTOABT31.55888964.36416724644.5UAsia/Kabul
8825Bamyan AirportBamyanAfghanistanBINOABN34.81666767.81666725504.5NAsia/Kabul
Albania1190Tirana RinasTiranaAlbaniaTIALATI41.41474219.7205611261.0EEurope/Tirane

THE * is a “nested table” containing all rows in the group defined by the GROUP BY clause.

In this example the * is all the airports in a given Country, since the query does GROUP BY Country.

Since * is a table, we can query it as normally:

[50]:
%%rql

SELECT Country, (SELECT Name, City FROM *)
FROM READ("https://raw-tutorial.s3.amazonaws.com/airports.csv")
GROUP BY Country
LIMIT 2
[50]:
Country_2
NameCity
AfghanistanHeratHerat
JalalabadJalalabad
Kabul IntlKabul
KandaharKandahar
MaimanaMaimama
Mazar I SharifMazar-i-sharif
ShindandShindand
SheberghanSheberghan
KonduzKunduz
Faizabad AirportFaizabad
Bagram AFBKabul
Tarin Kowt AirportTarin Kowt
Zaranj AirportZaranj
Chaghcharan AirportChaghcharan
Camp Bastion Camp Bastion
FOB SalernoKhost
FOB SharanaSharan
SharonaSharona
FOB ShankShank
Bost AirportLashkar Gah
Bamyan AirportBamyan
AlbaniaTirana RinasTirana

… or even …

[55]:
%%rql

SELECT Country, (SELECT City, COUNT(*) FROM * GROUP BY City)
FROM READ("https://raw-tutorial.s3.amazonaws.com/airports.csv")
GROUP BY Country
LIMIT 2
[55]:
Country_2
City_2
AfghanistanShank1
Camp Bastion1
Jalalabad1
Chaghcharan1
Tarin Kowt1
Sheberghan1
Kunduz1
Kabul2
Shindand1
Sharan1
Khost1
Faizabad1
Zaranj1
Mazar-i-sharif1
Herat1
Sharona1
Kandahar1
Maimama1
Lashkar Gah1
Bamyan1
AlbaniaTirana1

This groups the airports by Country, and then by City.

The COUNT(*) in the inner SELECT refers to the groups created by GROUP BY City.

Top-Level Collections

Let’s look in more detail at the output of the following queries:

[68]:
%%rql

SELECT country AS name FROM READ("https://raw-tutorial.s3.amazonaws.com/sales.json")
[68]:
name
CH
US

This query returns a list of countries. Each row in the output has the column name.

If we were to visualize the output as JSON, it would be:

[
  {"name": "CH"},
  {"name": "US"}
]

Now the following query:

[70]:
%%rql

SELECT country FROM READ("https://raw-tutorial.s3.amazonaws.com/sales.json")
[70]:
string
CH
US

… appears similar but note that AS alias is not included.

The output is different: each row is in fact a string. There is not record.

If we were to visualize the output as JSON, it would be:

["CH", "US"]

We can confirm this by asking the output type of the query, with the RAW Jupyter magic %%query_validate.

[78]:
%%query_validate

SELECT country AS name FROM READ("https://raw-tutorial.s3.amazonaws.com/sales.json")
collection(record(name: string))
[79]:
%%query_validate

SELECT country FROM READ("https://raw-tutorial.s3.amazonaws.com/sales.json")
collection(string)

Note that the first example returns collection(record(name: string)), which is RAW’s type representation for a collection of records, each with a single field name of type string.

The second returns collection(string), which is RAW’s type representation for a collection of strings.

Top-Level Records

The syntax (field1: "One", field2: 1) is used to create a record with two fields: field1, a string with value "One", and field2, a integer with value 1.

Collections and Records can be nested in RAW, so the following is a valid query:

[63]:
%%rql

(
    Countries: (SELECT DISTINCT Country FROM READ("https://raw-tutorial.s3.amazonaws.com/airports.csv")),
    Number_Of_Airports: (SELECT COUNT(*) FROM READ("https://raw-tutorial.s3.amazonaws.com/airports.csv"))
)
[63]:
CountriesNumber_Of_Airports
Afghanistan8107
Albania
Angola
Antarctica
Australia
Barbados
Belarus
Benin
Bermuda
Bhutan
Bolivia
Brazil
British Indian Ocean Territory
British Virgin Islands
Cambodia
Cameroon
Canada
Chad
Colombia
Comoros
Cook Islands
Cote d'Ivoire
Cyprus
Denmark
Djibouti
Dominica
Dominican Republic
East Timor
Equatorial Guinea
Eritrea
Fiji
Finland
French Polynesia
Georgia
Gibraltar
Greenland
Grenada
Guam
Guernsey
Honduras
Hungary
Iceland
Indonesia
Iran
Isle of Man
Israel
Jamaica
Japan
Jersey
Jordan
Kazakhstan
Kenya
Korea
Kyrgyzstan
Lesotho
Liberia
Luxembourg
Madagascar
Marshall Islands
Micronesia
Moldova
Mongolia
Montenegro
Morocco
Mozambique
Nepal
Netherlands
Netherlands Antilles
New Caledonia
Norfolk Island
Norway
Panama
Papua New Guinea
Paraguay
Peru
Poland
Portugal
Puerto Rico
Reunion
Saint Helena
Saint Kitts and Nevis
Saint Vincent and the Grenadines
Saudi Arabia
Serbia
Singapore
Slovakia
Somalia
South Sudan
Suriname
Sweden
Syria
Taiwan
Tajikistan
Tanzania
Trinidad and Tobago
Turkey
Turks and Caicos Islands
Uganda
United Kingdom
Uruguay
Uzbekistan
Vanuatu
Vietnam
Wallis and Futuna
Western Sahara
Yemen
Zambia
Algeria
American Samoa
Anguilla
Antigua and Barbuda
Argentina
Armenia
Aruba
Austria
Azerbaijan
Bahamas
Bahrain
Bangladesh
Belgium
Belize
Bosnia and Herzegovina
Botswana
Brunei
Bulgaria
Burkina Faso
Burma
Burundi
Cape Verde
Cayman Islands
Central African Republic
Chile
China
Christmas Island
Cocos (Keeling) Islands
Congo (Brazzaville)
Congo (Kinshasa)
Costa Rica
Croatia
Cuba
Czech Republic
Ecuador
Egypt
El Salvador
Estonia
Ethiopia
Falkland Islands
Faroe Islands
France
French Guiana
Gabon
Gambia
Germany
Ghana
Greece
Guadeloupe
Guatemala
Guinea
Guinea-Bissau
Guyana
Haiti
Hong Kong
India
Iraq
Ireland
Italy
Johnston Atoll
Kiribati
Kuwait
Laos
Latvia
Lebanon
Libya
Lithuania
Macau
Macedonia
Malawi
Malaysia
Maldives
Mali
Malta
Martinique
Mauritania
Mauritius
Mayotte
Mexico
Midway Islands
Monaco
Montserrat
Myanmar
Namibia
Nauru
New Zealand
Nicaragua
Niger
Nigeria
Niue
North Korea
Northern Mariana Islands
Oman
Pakistan
Palau
Palestine
Philippines
Qatar
Romania
Russia
Rwanda
Saint Lucia
Saint Pierre and Miquelon
Samoa
Sao Tome and Principe
Senegal
Seychelles
Sierra Leone
Slovenia
Solomon Islands
South Africa
South Georgia and the Islands
South Korea
Spain
Sri Lanka
Sudan
Svalbard
Swaziland
Switzerland
Thailand
Togo
Tonga
Tunisia
Turkmenistan
Tuvalu
Ukraine
United Arab Emirates
United States
Venezuela
Virgin Islands
Wake Island
West Bank
Zimbabwe

If we again use %%query_validate to see the output type:

[80]:
%%query_validate


(
    Countries: (SELECT DISTINCT Country FROM READ("https://raw-tutorial.s3.amazonaws.com/airports.csv")),
    Number_Of_Airports: (SELECT COUNT(*) FROM READ("https://raw-tutorial.s3.amazonaws.com/airports.csv"))
)
record(Countries: collection(string), Number_Of_Airports: long)

Therefore we confirm the output of this query is a record with two fields: Countries, a collection of strings, and Number_of_Airports, a long.