I/O¶

RAW can read and write data directly from external locations and in multiple formats.

Reading input data consists on specifying the location of the data and, where applicable, the format of the data and any properties required by the format (e.g. the delimiter used to separate columns in a CSV file). In many cases, however, RAW can automatically detect the format and structure of the data: in this case, all the user is obliged to specify is the location of the data.

In order to speed up future queries RAW will cache frequently-used data in a manner that is transparent to the user. Reading input data is tightly integrated with how RAW builds and maintains its data caches. Whenever a user is reading data, the user specifies the “validity” desired, i.e. how recent the data should be. For instance, when reading data from S3, the user may state that any data less than 1 hour old is suitable for a given query. If RAW finds a cache in its own internal system that is 1 hour or less old - fulfilling the desired validity - it may avoid reaching out to S3 again to read the data. This often results in very significant performance improvements.

Moveover, RAW’s internal caches may be written in a format that is more optimal to the specificity of the data and how it has been used in the past. This format will typically be different from the original format the data is stored in.

Locations¶

Locations are specified in a RAW-specific URL format. The locations supported are:

Location

URL Syntax

S3 1

s3://<bucket name>/<path> (assumes default system region)
s3://<region name>@<bucket name>/<path>

HTTP(S) 2

http://example.org/data/

Dropbox 3

dropbox:<path>

HDFS

hdfs://<path>
hdfs://<host>/<path>
hdfs://<host>:<port>/<path>

Hive

hive://<table name>

Local file 4

file:<path>

1

Requires AWS keys registered in the “Credentials Service”.

2

Optionally, for secure websites, HTTP authentication credentials may be registered in the “Credentials Service”.

3

Requires Dropbox token registered in the “Credentials Service”.

4

Only allowed if local file access is enabled in the server (disabled by default).

TEST_ACCESS¶

Checks whether a location is accessible to RAW, including if any required credentials are well-configured and if data exists at the location.

Syntax:

TEST_ACCESS(<location>)


For example:

TEST_ACCESS("dropbox:///file.json") // true if accessible; false otherwise.


Caches¶

The user can specify the validity of the data when reading. The validity represents how recent the data should be. This time period is specified as an interval.

For instance, if the user accepts data that is up to one hour old:

SELECT * FROM READ("http://example.org/some.data", cache := interval "1 hour")


If the query above would be executed twice in sequence from a fresh RAW installation, RAW would reach out to the server the first time, cache the data (transparently to the user, likely in a different format from the original one), and serve the second execution of the same query from local caches.

If the user requires the most recent data, the interval should be set to zero or to a short period. This forces RAW to ignore any caches and reach out to the original source for the most recent data in every re-execution of the query:

READ("http://example.org/some.data", cache := interval "0 seconds")


If any validity is acceptable - either because the user knows the data doesn’t change, or any older data will be acceptable - then:

READ("http://example.org/some.data", cache := null)


If cache is not specified by the user, the default system settings are used.

Under some circumstances, RAW may create more than one cache for the same source data. This could happen under conditions involving concurrency, or optimization decisions where its best to have the same data cached under two distinct formats. In those scenarios, and assuming a request whose validity can be respected by more than one cache, there are no guarantees over which cache version will be chosen. In particular, a more recent cache may not be chosen: any cached version that respects the validity can be chosen.

Error Handling¶

Users can specify the a retry strategy in case of failures accessing the data. Retry strategies are particularly useful for long-running queries, when transient failures in remote sources can be handled gracefully. The number of retries must be a number greater or equal to zero. The retry interval is an interval.

For instance:

READ("http://example.org/unstable_server", retries := 2, retry_interval := interval "30 seconds")


This will attempt to access the URL. In case of connection or other problems (e.g. credentials missing), RAW will retry 2 times (for a total of 3 attempts) with an interval of 30 seconds between attempts. If after 2 retries the location cannot be reached, fails with a runtime error.

The variants of the readers prefixed with TRY_, such as TRY_READ for READ, return null after the retries are exhausted instead of failing with a runtime error. Moreover, in the case of TRY_READ_MAY of a file system locations, if individual files in the location can’t be read - due to invalid permissions or other errors - the individual files are skipped as if they were not part of the directory.

If retries or retry_interval are not specified by the user, the default system settings are used.

Detecting Formats¶

For most formats, RAW also has the ability to automatically detect the structure of the data. The detection of the structure of the data happens during query validation transparently to the user. However, it imposes a performance penalty given that the source has to be accessed and, typically, sampled. This performance penalty however, is small in most common cases and quickly amortized by the internal RAW caches.

For instance, assume a file “data.json” with the following content:

[
{ "name": "John", "age": 39 },
{ "name": "Jane", "age": 37 }
]


The following query will automatically detect the structure of the data and then read the JSON file:

READ_JSON("dropbox://data.json")


The following query will read the JSON file without detecting the structure of the data, since it was specified by the user:

READ_JSON[collection(record(name: string, age: int))]("dropbox://data.json")


In fact, the first query is typically translated internally by RAW to a query similar to the second query, with some additional properties such as the text encoding.

For file-based data, there is also the possibility to detect the format directly, as shown in the following example:

READ("dropbox://data.json")


In this example the format JSON is also detected by the system, along with the type of data in the JSON file.

The locations that include paths may include wildcards through use of a special syntax in the URL.

The following table describes the wildcard conventions:

Description

?

Matches a single character.

*

Matches any sequence of characters.

**

Recurses through subdirectories.

For instance, given the following local directories:

/data/file1.csv
/data/file2.csv
/data/file1.json
/data/jsons/file1.json
/data/jsons/file2.json
/data/jsons/not-json.csv


The location file:///data/*.csv will match /data/file1.csv and /data/file2.csv.

The location file:///data//**/*.json will match /data/file1.json, /data/jsons/file1.json and /data/jsons/file2.json because it walks through all subdirectories recursively but only matches *.json.

LS¶

Lists the contents of a file system location. Fails with a runtime error if the location is not accessible or is not a file system.

Syntax

LS(location)


For example:

LS("s3://bucket1/path1/**/*.json")


This lists all json files in all subdirectory of ‘path1’ in ‘bucket1’.

Automatic Inference¶

The following readers use the inference of RAW to automatically detect the format of the data, as well as its schema and any other properties required to read the file.

Reads a single file, detecting its format, schema, and any other properties required to read the file.

Syntax:

READ(location, sample_size := 10000,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The optional sample_size specifies the number of “objects” to sample within the file. The definition of object depends on the file format. For instance, for CSV and Excel an object is a row. For JSON and XML, an object is an element.

If a large sample_size is used, the detection will take more time to complete, but has a higher chance of detecting the correct format. If a small sample_size is used, the detection will be faster but may not return the precise results. Suppose a column contains integers up to row number 10000. At row 10001, the number is too large to be an integer, but is a long. In this case, using sample_size 10000 will result on the wrong type being inferred, since RAW expects an integer when a long is required to read the entire table. This triggers a runtime error.

Similarly, suppose a column is non-nullable but at a certain row, a null value is present. If the detecting stopped before encountering the null value, it will erroneously infer the data as having non-nullable values only (which leads to faster execution times). This again triggers a runtime error.

To force the detection to read as many rows as possible, set sample_size to -1.

TRY_READ(location, sample_size := 10000,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


TRY_READ behaves similarly to READ but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Reads many files - directories or wildcards -, detecting the unified format, schema, and any other properties required to read the data.

Each file in a directory or wildcard match must be of compatible type. If each file contains a collection, then all files are UNIONed into a single large collection. Otherwise, all files are concatenated into a single large collection.

Syntax:

READ_MANY(location, sample_size := 10000, sample_files := 5,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The optional sample_size behaves similar to its counterpart in READ.

The optional sample_files gives an indication of how many files to sample. To force the detection to read as many files as possible, set sample_files to -1.

Syntax:

TRY_READ_MANY(location, sample_size := 10000, sample_files := 5,
cache := default interval,
retries := <default int> , retry_interval := <default interval>)


TRY_READ_MANY behaves similarly to READ_MANY but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

DESCRIBE¶

Describes how RAW inferred data from a location. Fails with a runtime error if the location is not accessible or is not a database table or a file.

Syntax:

DESCRIBE(location)


The output is a record with the following fields:

Field name

Type

Description

description

string

Description of the data format of the location
e.g.: “csv with encoding utf-8”,
“oracle table url=’oracle:server1/schema/table’”

type

string

Complete output type of the location
e.g.: “”“collection(
record(a: string nullable, b: int))”“”

properties

collection(record(
name: string,
value: string
))
List of key-value pairs of extra properties of the
location, e.g.:
[(name: “delimiter”, value: “,”)

is_collection

bool

True if the top level is a collection, false otherwise

columns

collection( record(
col_name: string nullable,
type: string,
nullable: bool))

Description of the type of the columns of this location
e.g.:
[(col_name: “a”, type: “string”, nullable: true)
(col_name: “b”, type: “int”, nullable: false)]

Note

For data types which are not records or collections of records the columns field will have only one row with the col_name field set to null

For example, assume a file “data.csv” with the following content:

"name", "age"
"John", 39
"Jane", 37


the following query:

DESCRIBE("file://data.csv")


will yield the following result:

RECORD(
description: "csv with encoding utf-8 (confidence: 60%)",
type: "collection(record(name: string,age: int))",
isCollection: true,
columns: [
(col_name: "name", type: "string", nullable: false),
(col_name: "age", type: "int", nullable: false)
],
properties : [
(name: "delimiter", value: ","),
]
)


LS_TABLES¶

List all tables in a schema from a registered RDBMS server. Fails with a runtime error if:

• the server is not registered

• schema is null for servers that use schemas e.g. Oracle, PostgreSQL.

• schema is not null for servers that do not support schemas e.g MySQL.

Syntax:

LS_TABLES(schema_url)


For example:

LS_TABLES("oracle://ora_server/some_schema") // ["oracle:ora_server/some_schema/table1", ...]
LS_TABLES("mysql:mysql_server") // ["mysql:mysql_server/table1", ...]


Note

For database urls the double slash ‘//’ is optional so ‘mysql:server1’ is equivalent to ‘mysql://server1’

LS_SCHEMAS¶

List all schemas from a registered RDBMS server. Fails with a runtime error if the server is not registered or does not support schemas.

Syntax:

LS_SCHEMAS(server_url)


For example:

LS_SCHEMAS("oracle://ora_server") // ["oracle:ora_server/schema1", ...]


JSON¶

All JSON data types map to RAW types directly with the exception of a JSON array, which becomes a RAW collection.

Multiple variants are supported:

• if connection or other problems trigger a runtime error or return null;

• if reading a single location or, in the case of file systems read directories or wildcards and merge their contents;

• if inference is used or whether the type and format properties are specified.

Reads JSON data from a location.

Syntax when using inference:

READ_JSON(location, encoding := null,
sample_size := 10000,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The optional argument encoding specifies the encoding of the data; if the encoding is set to null, it will be determined automatically; otherwise, refer to HTTP Requests.

Syntax without inference:

READ_JSON[<type>](location, encoding := "utf-8",
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


TRY_READ_JSON behaves similarly to READ_JSON but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Reads many JSON files - directories or wildcards - from a file system location.

Each file in a directory or wildcard match must be of compatible type. If each file contains a collection, then all files are UNIONed into a single large collection. Otherwise, all files are concatenated into a single large collection.

Syntax when using inference:

READ_MANY_JSON(location, encoding := null,
sample_size := 10000, sample_files := 5,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The optional argument encoding specifies the encoding of the data; if the encoding is set to null, it will be determined automatically; otherwise, refer to HTTP Requests.

Syntax without inference:

READ_MANY_JSON[<type>](location, encoding := "utf-8",
sample_size := 10000, sample_files := 5,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


TRY_READ_MANY_JSON behaves similarly to READ_MANY_JSON but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

HJSON¶

HJSON is a variation of JSON commonly used in Hadoop where each line contains a JSON object. Therefore, HJSON data is always a collection of JSON objects, where each JSON object must be of a compatible type.

Example of HJSON data:

{"id": 1, "description": "First row", "extra": 1.2}
{"id": 2, "description": "Second row", "extra": null}


The type of this data is collection(record(id: int, description: string, extra: double)).

When reading HJSON data without inference, the user is only required to pass the inner type - in the example above record(id: int, description: string, extra: double) - since HJSON data are always collections.

Multiple variants are supported:

• if connection or other problems trigger a runtime error or return null;

• if reading a single location or, in the case of file systems read directories or wildcards and merge their contents;

• if inference is used or whether the type and format properties are specified.

Reads HJSON data from a location.

Syntax when using inference:

READ_HJSON(location, encoding := null,
sample_size := 10000,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The optional argument encoding specifies the encoding of the data; if the encoding is set to null, it will be determined automatically; otherwise, refer to HTTP Requests.

Syntax without inference:

READ_HJSON[<type>](location, encoding := "utf-8",
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The <type> is the type of each row of data.

TRY_READ_HJSON behaves similarly to READ_HJSON but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Reads many HJSON files - directories or wildcards - from a file system location.

Each file in a directory or wildcard match must be of compatible type. If each file contains a collection, then all files are UNIONed into a single large collection. Otherwise, all files are concatenated into a single large collection.

Syntax when using inference:

READ_MANY_HJSON(location, encoding := null,
sample_size := 10000, sample_files := 5,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The optional argument encoding specifies the encoding of the data; if the encoding is set to null, it will be determined automatically; otherwise, refer to HTTP Requests.

Syntax without inference:

READ_MANY_HJSON[<type>](location, encoding := "utf-8",
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The <type> is the type of each row of data.

TRY_READ_MANY_HJSON behaves similarly to READ_MANY_HJSON but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

XML¶

RAW uses the following conventions to parse XML documents:

XML Document

Corresponding RQL query

<obj/>

null

<obj>
<a>hello</a>
<b>1234</b>
</obj>

record(a: “hello”, b: 1234)

<obj>
<a>hello</a>
<a>world</a>
</obj>

record(a: [“hello”, “world”])

<obj name=”value” />

record( @name: “value”)

<obj>
<a>hello</a>
Some text here
</obj>
record(
a: “hello”,
#text: “Some text here”)

<obj name=”value” id=”123”>
<a>hello</a>
<a>world</a>
<b>1234</b>
<c></c>
Some text here
</obj>
record(
@name: “value”,
@id: 1234,
a: [“hello”, “world”]
b: 1234,
c: null,
#text: “Some text here”)

Multiple variants of reading XML are supported:

• if connection or other problems trigger a runtime error or return null;

• if reading a single location or, in the case of file systems read directories or wildcards and merge their contents;

• if inference is used or whether the type and format properties are specified.

Reads XML data from a location.

Syntax when using inference:

READ_XML(location, encoding := null,
sample_size := 10000,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The optional argument encoding specifies the encoding of the data; if the encoding is set to null, it will be determined automatically; otherwise, refer to HTTP Requests.

Syntax without inference:

READ_XML[<type>](location, encoding := "utf-8",
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


TRY_READ_XML behaves similarly to READ_XML but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Reads many XML files - directories or wildcards - from a file system location.

Each file in a directory or wildcard match must be of compatible type. If each file contains a collection, then all files are UNIONed into a single large collection. Otherwise, all files are concatenated into a single large collection.

Syntax when using inference:

READ_MANY_XML(location, encoding := null,
sample_size := 10000, sample_files := 5,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The optional argument encoding specifies the encoding of the data; if the encoding is set to null, it will be determined automatically; otherwise, refer to HTTP Requests.

Syntax without inference:

READ_MANY_XML[<type>](location, encoding := "utf-8",
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


TRY_READ_MANY_XML behaves similarly to READ_MANY_XML but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

CSV¶

Reads rows from CSV data. CSVs are always collections, where each row must be of a compatible type.

CSV columns must be either numbers, strings, or temporals.

Example of CSV data:

first_name,last_name,age
Jeff,Dean,48
Jeff,Bezos,50
John,Carmack,45


The type of this data is collection(record(first_name: string, last_name: string, age: int)).

When reading CSV data without inference, the user is only required to pass the inner type - in the example above record(first_name: string, last_name: string, age: int) - since CSVs are always collections.

Multiple variants are supported:

• if connection or other problems trigger a runtime error or return null;

• if reading a single location or, in the case of file systems read directories or wildcards and merge their contents;

• if inference is used or whether the type and format properties are specified.

Reads CSV data from a location.

Syntax when using inference:

READ_CSV(location, encoding := null,
delimiters := [","],
nulls := [],
nans := [],
skip := 0,
escape := "\\",
quote := "\"",
sample_size := 10000,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)

• The optional argument encoding specifies the encoding of the data; if the encoding is set to null, it will be determined automatically; otherwise, refer to HTTP Requests.

• The optional argument has_header specifies whether the data has a header or not, e.g. true or false; if the field is not specified, the inference will try to detect the presence of a header.

• The optional argument skip specifies the number of lines to skip in the data; if the field is not specified it will default to 0 (read and infer from the start of the file).

• The optional argument delimiters specifies a candidate list of delimiters, e.g. [",", "|"]; if the field is not specified, the inference will try to detect the delimiter.

• The optional argument nulls specifies a candidate list of strings to interpret as null, e.g. ["NA"]; if the field is not specified, the inference will not detect nulls.

• The optional argument nans specifies a candidate list of strings to interpret as NaN, e.g. ["nan"]; if the field is not specified, the inference will not detect NaNs.

• The optional argument quote specifies the quote charater in the data, e.g. "\""; if the field is not specified, the inference will try to detect the quote char, if the value is set to null then no quote char will be used.

• The optional argument escape specifies a escape character while parsing the CSV file, e.g. "\\"; the default value is \, if the value is set to null or empty string then no escape char will be used.

Syntax without inference:

READ_CSV[<type>](location, encoding := "utf-8",
skip := 0,
delimiter := ",",
nulls := [""],
nans := [],
time_format := "HH:mm[:ss[.SSS]]",
date_format := "yyyy-MM-dd",
timestamp_format := "yyyy-M-d'T'HH:mm[:ss[.SSS]]",
escape := "\\",
quote := "\"",
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)

• The <type> is the type of each row of data.

• The optional argument skip specifies the number of lines to skip in the data.

• The optional argument delimiter specifies the default delimiter to use.

• The optional argument time_format specifies the format to use while parsing time fields.

• The optional argument date_format specifies the format to use while parsing date fields.

• The optional argument timestamp_format specifies the format to use while parsing timestamp fields.

• The optional argument escape specifies a escape character while parsing the CSV file.

• The optional argument quote specifies the quote charater in the data.

TRY_READ_CSV behaves similarly to READ_CSV but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Reads many CSV files - directories or wildcards - from a file system location.

Each file in a directory or wildcard match must be of compatible type. If each file contains a collection, then all files are UNIONed into a single large collection. Otherwise, all files are concatenated into a single large collection.

Syntax when using inference:

READ_MANY_CSV(location, encoding := null,
delimiters := [","],
nulls := [],
nans := [],
sample_size := 10000, sample_files := 5,
escape := "\\",
quote := "\"",
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

READ_MANY_CSV[<type>(location, encoding := "utf-8",
skip := 0,
delimiter := ",",
nulls := [],
nans := [],
time_format := "HH:mm[:ss[.SSS]]",
date_format := "yyyy-MM-dd",
timestamp_format := "yyyy-M-d'T'HH:mm[:ss[.SSS]]",
escape := "\\",
quote := "\"",
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The <type> is the type of each row of data.

TRY_READ_MANY_CSV behaves similarly to READ_MANY_CSV but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

TEXT¶

Multiple variants are supported:

• if connection or other problems trigger a runtime error or return null;

• if reading a single location or, in the case of file systems read directories or wildcards and merge their contents.

Reads lines of text data from a location.

Syntax:

READ_LINES(location, encoding := "utf-8",
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The optional argument encoding specifies the encoding of the data; refer to HTTP Requests.

TRY_READ_LINES behaves similarly to READ_LINES but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Reads many files of text data - directories or wildcards - from a file system location. Returns a single collection with all lines UNIONed into a single large collection.

Syntax:

READ_MANY_LINES(location, encoding := "utf-8",
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The optional argument encoding specifies the encoding of the data; refer to HTTP Requests.

TRY_READ_MANY_LINES behaves similarly to READ_MANY_LINES but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Excel¶

Reads Excel tables. The supported Excel types are strings, booleans, doubles and timestamps.

Multiple variants are supported:

• if connection or other problems trigger a runtime error or return null;

• if reading a single location or, in the case of file systems read directories or wildcards and merge their contents;

• if inference is used or whether the type and format properties are specified.

Reads an Excel table from a location.

Syntax when using inference:

READ_EXCEL(location, sheet := null,
at := null,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The optional argument sheet specifies the Excel sheet to use; if set to null, the first sheet will be read. The optional argument has_header specifies whether the data has a header or not, e.g. true or false; if the field is not specified, the inference will try to detect the presence of a header. The optional argument at specifies the beginning and end of the table in Excel coordinates, e.g. "A1:D5"; if the field is not specified, the inference will try to detect the beginning and end of the table.

Syntax without inference:

READ_EXCEL[<type>](location, sheet := "Sheet1",
x0 := 0,
y0 := 0,
y1 := null,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The <type> is the type of each row of data.

Attention

At this point, the Excel syntax without inference is cumbersome and error prone to use. It will be changed in an upcoming release. Please use the syntax with inference instead for the time being.

TRY_READ_EXCEL behaves similarly to READ_EXCEL but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Reads many EXCEL files - directories or wildcards - from a file system location.

Each file in a directory or wildcard match must be of compatible type. If each file contains a collection, then all files are UNIONed into a single large collection. Otherwise, all files are concatenated into a single large collection.

READ_MANY_EXCEL(location, sheet := null,
at := null,
sample_files := 5,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The optional argument sheet specifies the Excel sheet to use; if set to null, the first sheet will be read. The optional argument has_header specifies whether the data has a header or not, e.g. true or false; if the field is not specified, the inference will try to detect the presence of a header. The optional argument at specifies the beginning and end of the table in Excel coordinates, e.g. "A1:D5"; if the field is not specified, the inference will try to detect the beginning and end of the table.

Syntax without inference:

READ_MANY_EXCEL[<type>](location, sheet := "Sheet1",
x0 := 0,
y0 := 0,
y1 := null,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The <type> is the type of each row of data.

Attention

At this point, the Excel syntax without inference is cumbersome and error prone to use. It will be changed in an upcoming release. Please use the syntax with inference instead for the time being.

TRY_READ_MANY_EXCEL behaves similarly to READ_MANY_EXCEL but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Parquet¶

Reads PARQUET data from a location.

Syntax when using inference:

READ_PARQUET(location,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

READ_PARQUET[<type>](location,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The <type> is the type of each row of data.

Attention

Not currently supported.

TRY_READ_PARQUET behaves similarly to READ_PARQUET but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Reads many PARQUET files - directories or wildcards - from a file system location.

Each file in a directory or wildcard match must be of compatible type. If each file contains a collection, then all files are UNIONed into a single large collection. Otherwise, all files are concatenated into a single large collection.

This command filters hidden and CRC files according to Hadoop conventions.

READ_MANY_PARQUET(location,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

READ_MANY_PARQUET[<type>](location,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The <type> is the type of each row of data.

Attention

Not currently supported.

TRY_READ_MANY_PARQUET behaves similarly to READ_MANY_PARQUET but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Avro¶

Reads Avro data from a location.

Syntax when using inference:

READ_AVRO(location,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

READ_AVRO[<type>](location,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The <type> is the type of each row of data.

Attention

Not currently supported.

TRY_READ_AVRO behaves similarly to READ_AVRO but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Reads many Avro files - directories or wildcards - from a file system location.

Each file in a directory or wildcard match must be of compatible type. If each file contains a collection, then all files are UNIONed into a single large collection. Otherwise, all files are concatenated into a single large collection.

This command filters hidden and CRC files according to Hadoop conventions.

READ_MANY_AVRO(location,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

READ_MANY_AVRO[<type>](location,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The <type> is the type of each row of data.

Attention

Not currently supported.

TRY_READ_MANY_AVRO behaves similarly to READ_MANY_AVRO but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

ORC¶

Reads ORC data from a location.

Syntax when using inference:

READ_ORC(location,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

READ_ORC[<type>](location,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The <type> is the type of each row of data.

Attention

Not currently supported.

TRY_READ_ORC behaves similarly to READ_ORC but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Reads many ORC files - directories or wildcards - from a file system location.

Each file in a directory or wildcard match must be of compatible type. If each file contains a collection, then all files are UNIONed into a single large collection. Otherwise, all files are concatenated into a single large collection.

This command filters hidden and CRC files according to Hadoop conventions.

READ_MANY_ORC(location,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

READ_MANY_ORC[<type>](location,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The <type> is the type of each row of data.

Attention

Not currently supported.

TRY_READ_MANY_ORC behaves similarly to READ_MANY_ORC but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Parquet-Avro¶

The supported PARQUET-AVRO types are: strings, booleans, ints, longs, doubles, floats, dates, time, timestamps, records, arrays and nulls.

Note

• Time and timestamps are supported only with milli-second precision, parquet files with micro-second precision will fail to read.

• Union types are only supported between null and another type which will be mapped to a nullable type in RAW.

• Enums, maps, decimal, fixed and duration types are not supported.

Avro arrays will be converted to RAW collections.

Multiple variants are supported:

• if connection or other problems trigger a runtime error or return null;

• if reading a single location or, in the case of file systems read directories or wildcards and merge their contents;

• if inference is used or whether the type and format properties are specified.

Reads PARQUET-AVRO data from a location.

Syntax when using inference:

READ_PARQUET_AVRO(location,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

READ_PARQUET_AVRO[<type>](location,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The <type> is the type of each row of data.

TRY_READ_PARQUET_AVRO behaves similarly to READ_PARQUET_AVRO but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Reads many PARQUET-AVRO files - directories or wildcards - from a file system location.

Each file in a directory or wildcard match must be of compatible type. If each file contains a collection, then all files are UNIONed into a single large collection. Otherwise, all files are concatenated into a single large collection.

READ_MANY_PARQUET_AVRO(location,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

READ_MANY_PARQUET_AVRO[<type>](location,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The <type> is the type of each row of data.

TRY_READ_MANY_PARQUET_AVRO behaves similarly to READ_MANY_PARQUET_AVRO but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

MySQL¶

The MySQL types supported by RAW are strings, booleans, numbers and temporals.

Credentials to access MySQL must first be stored in the “Credentials Service”.

Reads a MySQL table from a location.

The database argument corresponds to the database name registered in the “Credentials Service”.

Syntax when using inference:

READ_MYSQL(database, table,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

READ_MYSQL[<type>](database, table,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The <type> is the type of each row of data.

TRY_READ_MYSQL behaves similarly to READ_MYSQL but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

QUERY_MYSQL¶

Queries the MySQL database directly using MySQL’s SQL syntax.

The database argument corresponds to the database name registered in the “Credentials Service”.

Syntax when using inference:

QUERY_MYSQL(database, query,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

QUERY_MYSQL[<type>](database, query,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


For example:

QUERY_MYSQL("mysql-server", "select a, b, c from table1")


TRY_QUERY_MYSQL¶

TRY_QUERY_MYSQL behaves similarly to QUERY_MYSQL but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

PostgreSQL¶

The PostgreSQL types supported by RAW are strings, booleans, numbers, temporals and binary.

Credentials to access PostgreSQL must first be stored in the “Credentials Service”.

Reads a PostgreSQL table from a location.

The database argument corresponds to the database name registered in the “Credentials Service”.

Syntax when using inference:

READ_PGSQL(database, schema, table,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

READ_PGSQL[<type>](database, schema, table,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The <type> is the type of each row of data.

TRY_READ_PGSQL behaves similarly to READ_PGSQL but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

QUERY_PGSQL¶

Queries the PostgreSQL database directly using PostgreSQL’s SQL syntax.

The database argument corresponds to the database name registered in the “Credentials Service”.

Syntax when using inference:

QUERY_PGSQL(database, query,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

QUERY_PGSQL[<type>](database, query,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


For example:

QUERY_PGSQL("pgsql-server", "select a, b, c from table1")


TRY_QUERY_PGSQL¶

TRY_QUERY_PGSQL behaves similarly to QUERY_PGSQL but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

SQL Server¶

The Microsoft SQL Server types supported by RAW are strings, booleans, numbers, temporals and binary.

Credentials to access SQL Server must first be stored in the “Credentials Service”.

The database argument corresponds to the database name registered in the “Credentials Service”.

Syntax when using inference:

READ_SQLSERVER(database, schema, table,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

READ_SQLSERVER[<type>](database, schema, table,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The <type> is the type of each row of data.

TRY_READ_SQLSERVER behaves similarly to READ_SQLSERVER but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

QUERY_SQLSERVER¶

Queries the SQL Server database directly using SQL Server’s SQL syntax.

The database argument corresponds to the database name registered in the “Credentials Service”.

Syntax when using inference:

QUERY_SQLSERVER(database, query,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

QUERY_SQLSERVER[<type>](database, query,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


For example:

QUERY_SQLSERVER("sqlserver-server", "select a, b, c from table1")


TRY_QUERY_SQLSERVER¶

TRY_QUERY_SQLSERVER behaves similarly to QUERY_SQLSERVER but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Oracle¶

The Oracle types supported by RAW are strings, booleans, numbers, temporals and binary.

Credentials to access Oracle must first be stored in the “Credentials Service”.

The database argument corresponds to the database name registered in the “Credentials Service”.

Syntax when using inference:

READ_ORACLE(database, schema, table,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

READ_ORACLE[<type>](database, schema, table,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The <type> is the type of each row of data.

TRY_READ_ORACLE behaves similarly to READ_ORACLE but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

QUERY_ORACLE¶

Queries the Oracle database directly using Oracle’s SQL syntax.

The database argument corresponds to the database name registered in the “Credentials Service”.

Syntax when using inference:

QUERY_ORACLE(database, query,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

QUERY_ORACLE[<type>](database, query,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


For example:

QUERY_ORACLE("oracle-server", "select a, b, c from table1")


TRY_QUERY_ORACLE¶

TRY_QUERY_ORACLE behaves similarly to QUERY_ORACLE but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

The Teradata types supported by RAW are strings, booleans, numbers, temporals and binary.

Credentials to access Teradata must first be stored in the “Credentials Service”.

The server argument corresponds to the server name registered in the “Credentials Service”.

Syntax when using inference:

READ_TERADATA(server, database, table,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

READ_TERADATA[<type>](server, database, table,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The <type> is the type of each row of data.

The server argument corresponds to the server name registered in the “Credentials Service”.

Syntax when using inference:

QUERY_TERADATA(server, query,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

QUERY_TERADATA[<type>](server, query,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


For example:

QUERY_TERADATA("teradata-server", "select a, b, c from table1")


TRY_QUERY_TERADATA behaves similarly to QUERY_TERADATA but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Sqlite¶

Reads SQLite tables from a location.

The SQLite types supported by RAW are strings, booleans, numbers and temporals.

Reads a SQLite table from an input stream location.

The location for Sqlite is an input stream location.

Syntax when using inference:

READ_SQLITE(location, table,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


Syntax without inference:

READ_SQLITE[<type>](location, table,
cache := <default interval>,
retries := <default int>, retry_interval := <default interval>)


The <type> is the type of each row of data.

Attention

At this point, only local file locations (e.g. as in file://...) are allowed for Sqlite databases.

TRY_READ_SQLITE behaves similarly to READ_SQLITE but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

HTTP Requests¶

HTTP¶

Makes an HTTP request. The return type is record(status: int, data: string), with the HTTP status and the data decoded as a string.

The expected_status argument specifies a list of valid/expected status for this request, the query will fail if the returned status is not included there. To make a HTTP call without checks set this argument to null.

 HTTP(url: string,
method: string := "get",
encoding: string := "utf-8",
args: collection(tuple(string, string)) := [],
body: string nullable := null,
expected_status: collection(int) nullable := [200, 201, 202, 203, 204, 205, 206]
)


For example:

request := HTTP("http://www.server.users",
method := "POST",
args := [("name", "john")],
headers := [("Authorization", "Bearer token"), ("Accept", "application/json")]
);
select * from parse_json[collection(record(name: string, age: int))](request.data)


HTTP_BINARY¶

Like HTTP but the body and return data are binary.

HTTP_BINARY(url: string,
method: string := "get",
args: collection(tuple(string, string)) := [],
body: binary nullable := null,
expected_status: collection(int) nullable := [200, 201, 202, 203, 204, 205, 206]
)


For example:

request := HTTP_BINARY("http://www.server.users",
method := "POST",
args := [("name", "john")],
headers := [("Authorization", "Bearer token"), ("Accept", "application/json")]
);
decoded_data := decode_string(request.data, "utf-8");
select * from parse_json[collection(record(name: string, age: int))](decoded_data)


Supported Encodings¶

Textual data can be encoded in one of the following charsets:

"UTF-8", "UTF-16", "UTF-16LE", "UTF-16BE", "ISO-8859-1", "ISO-8859-2", "ISO-8859-9", "Windows-1252".

Detecting encodings is difficult and in some situations, ambiguous.

If you run into encoding-related issues, the recommendation is to ensure the data is stored in a single encoding; the “safest” choice in practice is "utf-8".