6. Advanced Data Discovery

In the previous notebooks we began to look at the data types available in RAW.

This notebook looks into these in more detail in the context of data discovery.

[10]:
%load_ext raw_magic

Types in RAW

There are two families of data types in RAW: nested types and primitive types.

The nested types are collection and record. They are called nested types because they take inner types and can nest arbitrarily.

The primitive types are strings, numbers and temporal types.

The full list of types is available in the official documentation at http://docs.raw-labs.com.

For instance: collection(record(a: int, b: collection(string))) is a type that represents a collection of records. Each record has two fields: field a is an int (aka. an integer number) and field b is a collection of strings.

Data discovery

In the previous examples, DESCRIBE was used to describe RAW’s automatic inference outcome.

For instance:

[11]:
%%rql

DESCRIBE("https://raw-tutorial.s3.amazonaws.com/sales.json")
[11]:
formatcommenttypepropertiesis_collectioncolumns
namevaluecol_namecol_typenullable
jsonencoding iso-8859-2 (confidence: 44%)collection(record(`country`: string,`products`: collection(record(`category`: string,`cost`: int))))emptyTruecountrystringFalse
productscollection(record(`category`: string,`cost`: int))False

We can look at the described type in more detail.

[12]:
%%rql

DESCRIBE("https://raw-tutorial.s3.amazonaws.com/sales.json").`type`
[12]:
string
collection(record(`country`: string,`products`: collection(record(`category`: string,`cost`: int))))

When reading a file, we have used READ.

However, there are more specific commands to use. For instance READ_JSON can be used if we know or expect the file to be a JSON file.

[14]:
%%rql

SELECT * FROM READ_JSON("https://raw-tutorial.s3.amazonaws.com/sales.json")
[14]:
countryproducts
categorycost
CHKeyboard50
Keyboard70
Monitor450
USKeyboard20
Monitor200

Specific commands take format-specific arguments.

For instance, in READ_JSON we can specify the encoding:

[22]:
%%rql

SELECT *
FROM READ_JSON("https://raw-tutorial.s3.amazonaws.com/sales.json", encoding := "utf-8")
[22]:
countryproducts
categorycost
CHKeyboard50
Keyboard70
Monitor450
USKeyboard20
Monitor200

This forces a specific encoding to be used. This may be necessary if RAW’s encoding detection fails to determine the exact encoding of the data.

Moreover, we can also specify the type.

This may be necessary if RAW’s structure detection fails, or if we want to specify a more precise type.

[21]:
%%rql

SELECT *
FROM READ_JSON[collection(record(`country`: string,`products`: collection(record(`category`: string,`cost`: long))))]
("https://raw-tutorial.s3.amazonaws.com/sales.json", encoding := "utf-8")
[21]:
countryproducts
categorycost
CHKeyboard50
Keyboard70
Monitor450
USKeyboard20
Monitor200

In the example above, we changed cost from being an int to a long.

Sampling

RAW “samples” the dataset to determine the schema of data.

In some cases, the default sample size may be too small to determine the exact type of the data.

For instance: * all sampled “rows” of a CSV column could be integer numbers, while other rows may contain decimals; * or all sampled values are defined, while later on the data may contain “null” (i.e. undefined) values.

In this case, a query that runs over the entire data may fail with incompatible type.

When this happens, the exact type can be specfied in the READ command, or the sample size can be adjusted in the READ command, as it is an optional argument. (In general, for production use, it is recommended to specify the type, to ensure “changes” are detected with failures.)

Refer to docs.raw-labs.com for details.