2. Data Discovery

A normal part of data analysis is discovering data and understanding schemas.

Our goal in RAW is to ease this process and make it an integral part of data analysis, avoiding to the extent possible the need for separate “Extract-Transform-Load” (ETL) processes.

[2]:
%load_ext raw_magic

When given a new dataset, a simple way to understand its structure is to use DESCRIBE.

[3]:
%%rql

DESCRIBE("https://raw-tutorial.s3.amazonaws.com/airports.csv")
[3]:
formatcommenttypepropertiesis_collectioncolumns
namevaluecol_namecol_typenullable
csvencoding utf-8 (confidence: 100%)collection(record(`AirportID`: int,`Name`: string,`City`: string,`Country`: string,`IATA_FAA`: string,`ICAO`: string,`Latitude`: double,`Longitude`: double,`Altitude`: int,`Timezone`: double,`DST`: string,`TZ`: string))has_headertrueTrueAirportIDintFalse
nans[]NamestringFalse
multiLine_fieldsfalseCitystringFalse
skip1CountrystringFalse
nulls[""]IATA_FAAstringFalse
quote"ICAOstringFalse
escape\LatitudedoubleFalse
delimiter,LongitudedoubleFalse
AltitudeintFalse
TimezonedoubleFalse
DSTstringFalse
TZstringFalse

The output of DESCRIBE includes information on the format and structure of the data.

DESCRIBE tells us this data is a collection. It means we can query it with SELECT as shown before. (RAW is capable of query data that are not collections as will be shown later in this tutorial.)

Another very typical way to get an idea of the content of a dataset, is to directly read it using SELECT.

In this case, however, it is always wise to set LIMIT to a low number to avoid reading large amounts of data unnecessarily.

For instance, given a new file located at https://raw-tutorial.s3.amazonaws.com/trips.json a RAW user may well start by doing DESCRIBE then SELECT.

[5]:
%%rql

DESCRIBE("https://raw-tutorial.s3.amazonaws.com/trips.json")
[5]:
formatcommenttypepropertiesis_collectioncolumns
namevaluecol_namecol_typenullable
jsonencoding iso-8859-1 (confidence: 58%)collection(record(`reason`: string,`origin`: string,`destination`: string,`dates`: record(`departure`: string,`arrival`: string)))emptyTruereasonstringFalse
originstringFalse
destinationstringFalse
datesrecord(`departure`: string,`arrival`: string)False
[6]:
%%rql

SELECT * FROM READ("https://raw-tutorial.s3.amazonaws.com/trips.json") LIMIT 2
[6]:
reasonorigindestinationdates
departurearrival
HolidaysGVAKEF2016/02/272016/03/06
WorkGVASFO2016/04/102016/04/17

Note that this JSON file includes a nested structure on field dates. These will be discussed further in the tutorial.

Sometimes, it’s helpful to list data on a storage system accessible by RAW.

In this case, we will search for data on an S3 bucket. Before we can read data from S3, we need to “register the bucket”; this will be discussed later in the tutorial.

[11]:
%buckets_register raw-tutorial
Registered s3://raw-tutorial
[14]:
%%rql

SELECT * FROM LS("s3://raw-tutorial/") LIMIT 5
[14]:
string
s3://raw-tutorial//CDR.txt
s3://raw-tutorial//Lokad_Orders.csv
s3://raw-tutorial//airports.csv
s3://raw-tutorial//files-with-errors.hjson
s3://raw-tutorial//machine_log.log

LS lists the contents of the bucket and returns URLs for each file available. This is useful to discover datasets.

Wildcards can also be used:

[17]:
%%rql

SELECT * FROM LS("s3://raw-tutorial/air*.csv")
[17]:
string
s3://raw-tutorial//airports.csv