Skip to main content

Discovering and exploring datasets

Opening scratchpad

Scratchpad is part of RAW's extension to VS Code. Its a piece of software where you can run interactive queries against RAW's server. Its meant to be used as a sandbox where you can test code, probe datasets or just to get accustomed with RQL syntax.

This guide will show some examples of queries that can be used to explore datasets using scratchpad.

To open it first click on the RAW extension icon on the left pane the window.

Then click on the scratchpad button on the top right of the file explorer window.

And that's it!

To execute a query you can click on the play button on the top left corner of the scratchpad. The output will be shown in json in the left panel of VS Code.

Listing S3 or Dropbox folders

To list contents of folders you can use the LS and LL functions.

  • LS list folders or wildcards returning the list of urls.
  • LL is similar to LS but returns the list of urls with size, blocks, and last modification timestamp.

Given the following folder in a S3 bucket.

s3://raw-tutorial/read_many_demo/
metadata.json
person1.json
person2.json
person3.json
read_me.txt
logs/
machine_log1.log
machine_log2.log
machine_log3.log
machine_log4.log

Lets look at the following queries and the corresponding output:

LS
LS("s3://raw-tutorial/read_many_demo/")
Output
[
"s3://raw-tutorial/read_many_demo/metadata.json",
"s3://raw-tutorial/read_many_demo/person1.json",
"s3://raw-tutorial/read_many_demo/person2.json",
"s3://raw-tutorial/read_many_demo/person3.json",
"s3://raw-tutorial/read_many_demo/read_me.txt",
"s3://raw-tutorial/read_many_demo/logs"
]

Using a wildcard to list the folder and sub-folders:

LS with recursive wildcards
ls("s3://raw-tutorial/read_many_demo/**/*") 
Output
[
"s3://raw-tutorial/read_many_demo/metadata.json",
"s3://raw-tutorial/read_many_demo/person1.json",
"s3://raw-tutorial/read_many_demo/person2.json",
"s3://raw-tutorial/read_many_demo/person3.json",
"s3://raw-tutorial/read_many_demo/read_me.txt",
"s3://raw-tutorial/read_many_demo/logs",
"s3://raw-tutorial/read_many_demo/logs/machine_log1.log",
"s3://raw-tutorial/read_many_demo/logs/machine_log2.log",
"s3://raw-tutorial/read_many_demo/logs/machine_log3.log",
"s3://raw-tutorial/read_many_demo/logs/machine_log4.log"
]

For more information about wildcards please got to the Wildcard reference guide.

Listing database servers

Similar to LS we have LS_SCHEMAS and LS_tables to be used with relational database servers.

  • LS_SCHEMAS lists all schemas of a database.
  • LS_TABLES list all tables of a schema.

For example, after registering PostgreSQL database credentials with the name "my-pg-server". You could run the following queries to list all schemas:

LS_SCHEMAS
ls_schemas("pgsql://my-pg-server")
Output
[
"pgsql:my-pg-server/information_schema",
"pgsql:my-pg-server/pg_catalog",
"pgsql:my-pg-server/public",
"pgsql:my-pg-server/rawdemo",
"pgsql:my-pg-server/rdbmstest",
"pgsql:my-pg-server/tpch1",
"pgsql:my-pg-server/tpch_tiny"
]

And the following query to list the tables of one the available schemas:

LS_TABLES
ls_tables("pgsql://my-pg-server/rdbmstest")
Output
[
"pgsql:my-pg-server/rdbmstest/tbl1",
"pgsql:my-pg-server/rdbmstest/tbl2",
"pgsql:my-pg-server/rdbmstest/tbl3",
"pgsql:my-pg-server/rdbmstest/tbl4",
"pgsql:my-pg-server/rdbmstest/test_types"
"pgsql:my-pg-server/rdbmstest/test_types_bigserial1_seq",
"pgsql:my-pg-server/rdbmstest/test_types_serial1_seq",
"pgsql:my-pg-server/rdbmstest/test_types_smallserial1_seq",
"pgsql:my-pg-server/rdbmstest/skippable_types",
"pgsql:my-pg-server/rdbmstest/sql_aggregations_test_data",
"pgsql:my-pg-server/rdbmstest/sql_aggregations_test_schedule",
"pgsql:my-pg-server/rdbmstest/sql_null_semantic_test",
]

DESCRIBE function

DESCRIBE is a builtin function that returns the result of RAW's schema detection for a specified url. This function returns a record with the following fields:

NameTypeDescription
formatstringFormat of the location e.g.: "csv", "json", "relational table", etc.
descriptionstringDescription of the data format of the location e.g.:
“csv with encoding utf-8”,
“oracle table url=’oracle:server1/schema/table’”
typestringRAW type of the location e.g.: “collection( record( a: string nullable, b: int))”
propertiescollection( record(name: string,value: string))List of key-value pairs of extra properties of the location, e.g.:
[( name: “delimiter”, value: “,”)
( name: “has_header”, value: “true”)]
is_collectionboolTrue if the top level is a collection, false otherwise
columnscollection( record( col_name: string nullable, type: string, nullable: bool))Description of the type of the columns of this location e.g.:
[( col_name: “a”, type: “string”, nullable: true),
( col_name: “b”, ype: “int”, nullable: false)]
sampledbooleanFalse if the file was read completely during inference, True if it was sampled.

For example, given te following file:

s3://raw-tutorial/read_many_demo/person1.json
{
"name": "john",
"age": 34,
"interests": ["books", "football", "finance"],
"occupation": "sales",
"salary": 15.5
}

The following query calling describe on this file would return the following output:

Describe
DESCRIBE("s3://raw-tutorial/read_many_demo/person1.json")
Output
{
"format": "json",
"comment": "encoding iso-8859-1 (confidence: 49%)",
"type": "record(`name`: string, `age`: int, `interests`: collection(string), `occupation`: string, `salary`: double)",
"properties": [],
"is_collection": false,
"columns": [
{
"col_name": "name",
"col_type": "string",
"nullable": false
},
{
"col_name": "age",
"col_type": "int",
"nullable": false
},
{
"col_name": "interests",
"col_type": "collection(string)",
"nullable": false
},
{
"col_name": "occupation",
"col_type": "string",
"nullable": false
},
{
"col_name": "salary",
"col_type": "double",
"nullable": false
}
],
"sampled": false
}

Example of describe on a table of a registered PostgreSQL database.

Describe on a PostgreSQL table
describe("pgsql:my-pg-server/rdbmstest/tbl1")
Output
{
"format": "relational table",
"comment": "",
"type": "collection(record(`a`: int, `b`: int nullable, `c`: double, `d`: double nullable, `x`: string, `y`: string nullable))",
"properties": [],
"is_collection": true,
"columns": [
{
"col_name": "a",
"col_type": "int",
"nullable": false
},
{
"col_name": "b",
"col_type": "int",
"nullable": true
},
{
"col_name": "c",
"col_type": "double",
"nullable": false
},
{
"col_name": "d",
"col_type": "double",
"nullable": true
},
{
"col_name": "x",
"col_type": "string",
"nullable": false
},
{
"col_name": "y",
"col_type": "string",
"nullable": true
}
],
"sampled": false
}