Advanced Data Discovery

In the previous notebooks we began to look at the data types available in RAW.

This notebook looks into these in more detail in the context of data discovery.

Types in RAW

There are two families of data types in RAW: nested types and primitive types.

The nested types are collection and record. They are called nested types because they take inner types and can nest arbitrarily.

The primitive types are strings, numbers and temporal types.

The full list of types is available in the official documentation at docs.raw-labs.com.

For instance: collection(record(a: int, b: collection(string))) is a type that represents a collection of records. Each record has two fields: field a is an int (aka. an integer number) and field b is a collection of strings.

Data discovery

In the previous examples, DESCRIBE was used to describe RAW's automatic inference outcome.

For instance:

We can look at the described type in more detail.

When reading a file, we have used READ.

However, there are more specific commands to use. For instance READ_JSON can be used if we know or expect the file to be a JSON file.

Specific commands take format-specific arguments.

For instance, in READ_JSON we can specify the encoding:

This forces a specific encoding to be used. This may be necessary if RAW's encoding detection fails to determine the exact encoding of the data.

Moreover, we can also specify the type.

This may be necessary if RAW's structure detection fails, or if we want to specify a more precise type.

In the example above, we changed cost from being an int to a long.

Sampling

RAW "samples" the dataset to determine the schema of data.

In some cases, the default sample size may be too small to determine the exact type of the data.

For instance:

In this case, a query that runs over the entire data may fail with incompatible type.

When this happens, the exact type can be specfied in the READ command, or the sample size can be adjusted in the READ command, as it is an optional argument. (In general, for production use, it is recommended to specify the type, to ensure "changes" are detected with failures.)

Refer to docs.raw-labs.com for details.

Next: SQL Compatibility