Data Discovery

A normal part of data analysis is discovering data and understanding schemas.

Our goal in RAW is to ease this process and make it an integral part of data analysis, avoiding to the extent possible the need for separate "Extract-Transform-Load" (ETL) processes.

When given a new dataset, a simple way to understand its structure is to use DESCRIBE.

The output of DESCRIBE includes information on the format and structure of the data.

DESCRIBE tells us this data is a collection. It means we can query it with SELECT as shown before. (RAW is capable of query data that are not collections as will be shown later in this tutorial.)

Another very typical way to get an idea of the content of a dataset, is to directly read it using SELECT.

In this case, however, it is always wise to set LIMIT to a low number to avoid reading large amounts of data unnecessarily.

For instance, given a new file located at https://raw-tutorial.s3.amazonaws.com/trips.json a RAW user may well start by doing DESCRIBE then SELECT.

Note that this JSON file includes a nested structure on field dates. These will be discussed further in the tutorial.

Sometimes, it's helpful to list data on a storage system accessible by RAW.

In this case, we will search for data on an S3 bucket. Before we can read data from S3, we need to "register the bucket"; this will be discussed later in the tutorial.

LS lists the contents of the bucket and returns URLs for each file available. This is useful to discover datasets.

Wildcards can also be used:

Next: Views