I/O

RAW can read and write data directly from external locations and in multiple formats.

Reading input data consists on specifying the location of the data and, where applicable, the format of the data and any properties required by the format (e.g. the delimiter used to separate columns in a CSV file). In many cases, however, RAW can automatically detect the format and structure of the data: in this case, all the user is obliged to specify is the location of the data.

In order to speed up future queries RAW will cache frequently-used data in a manner that is transparent to the user. Reading input data is tightly integrated with how RAW builds and maintains its data caches. Whenever a user is reading data, the user specifies the “validity” desired, i.e. how recent the data should be. For instance, when reading data from S3, the user may state that any data less than 1 hour old is suitable for a given query. If RAW finds a cache in its own internal system that is 1 hour or less old - fulfilling the desired validity - it may avoid reaching out to S3 again to read the data. This often results in very significant performance improvements.

Moveover, RAW’s internal caches may be written in a format that is more optimal to the specificity of the data and how it has been used in the past. This format will typically be different from the original format the data is stored in.

Locations

Locations are specified in a RAW-specific URL format. The locations supported are:

Location

URL Syntax

S3 1

s3://<bucket name>/<path> (assumes default system region)
s3://<region name>@<bucket name>/<path>

HTTP(S) 2

http://example.org/data/

Dropbox 3

dropbox:<path>

HDFS

hdfs://<path>
hdfs://<host>/<path>
hdfs://<host>:<port>/<path>

Hive

hive://<table name>

Local file 4

file:<path>

1

Requires AWS keys registered in the “Credentials Service”.

2

Optionally, for secure websites, HTTP authentication credentials may be registered in the “Credentials Service”.

3

Requires Dropbox token registered in the “Credentials Service”.

4

Only allowed if local file access is enabled in the server (disabled by default).

TEST_ACCESS

Checks whether a location is accessible to RAW, including if any required credentials are well-configured and if data exists at the location.

Syntax:

TEST_ACCESS(<location>)

For example:

TEST_ACCESS("dropbox:///file.json") // true if accessible; false otherwise.

Caches

The user can specify the validity of the data when reading. The validity represents how recent the data should be. This time period is specified as an interval.

For instance, if the user accepts data that is up to one hour old:

SELECT * FROM READ("http://example.org/some.data", cache := interval "1 hour")

If the query above would be executed twice in sequence from a fresh RAW installation, RAW would reach out to the server the first time, cache the data (transparently to the user, likely in a different format from the original one), and serve the second execution of the same query from local caches.

If the user requires the most recent data, the interval should be set to zero or to a short period. This forces RAW to ignore any caches and reach out to the original source for the most recent data in every re-execution of the query:

READ("http://example.org/some.data", cache := interval "0 seconds")

If any validity is acceptable - either because the user knows the data doesn’t change, or any older data will be acceptable - then:

READ("http://example.org/some.data", cache := null)

If cache is not specified by the user, the default system settings are used.

Under some circumstances, RAW may create more than one cache for the same source data. This could happen under conditions involving concurrency, or optimization decisions where its best to have the same data cached under two distinct formats. In those scenarios, and assuming a request whose validity can be respected by more than one cache, there are no guarantees over which cache version will be chosen. In particular, a more recent cache may not be chosen: any cached version that respects the validity can be chosen.

Error Handling

Users can specify the a retry strategy in case of failures accessing the data. Retry strategies are particularly useful for long-running queries, when transient failures in remote sources can be handled gracefully. The number of retries must be a number greater or equal to zero. The retry interval is an interval.

For instance:

READ("http://example.org/unstable_server", retries := 2, retry_interval := interval "30 seconds")

This will attempt to access the URL. In case of connection or other problems (e.g. credentials missing), RAW will retry 2 times (for a total of 3 attempts) with an interval of 30 seconds between attempts. If after 2 retries the location cannot be reached, fails with a runtime error.

The variants of the readers prefixed with TRY_, such as TRY_READ for READ, return null after the retries are exhausted instead of failing with a runtime error. Moreover, in the case of TRY_READ_MAY of a file system locations, if individual files in the location can’t be read - due to invalid permissions or other errors - the individual files are skipped as if they were not part of the directory.

If retries or retry_interval are not specified by the user, the default system settings are used.

Detecting Formats

For most formats, RAW also has the ability to automatically detect the structure of the data. The detection of the structure of the data happens during query validation transparently to the user. However, it imposes a performance penalty given that the source has to be accessed and, typically, sampled. This performance penalty however, is small in most common cases and quickly amortized by the internal RAW caches.

For instance, assume a file “data.json” with the following content:

[
   { "name": "John", "age": 39 },
   { "name": "Jane", "age": 37 }
]

The following query will automatically detect the structure of the data and then read the JSON file:

READ_JSON("dropbox://data.json")

The following query will read the JSON file without detecting the structure of the data, since it was specified by the user:

READ_JSON[collection(record(name: string, age: int))]("dropbox://data.json")

In fact, the first query is typically translated internally by RAW to a query similar to the second query, with some additional properties such as the text encoding.

For file-based data, there is also the possibility to detect the format directly, as shown in the following example:

READ("dropbox://data.json")

In this example the format JSON is also detected by the system, along with the type of data in the JSON file.

Reading Directories or Wildcards

The locations that include paths may include wildcards through use of a special syntax in the URL.

The following table describes the wildcard conventions:

Description

?

Matches a single character.

*

Matches any sequence of characters.

**

Recurses through subdirectories.

For instance, given the following local directories:

/data/file1.csv
/data/file2.csv
/data/file1.json
/data/jsons/file1.json
/data/jsons/file2.json
/data/jsons/not-json.csv

The location file:///data/*.csv will match /data/file1.csv and /data/file2.csv.

The location file:///data//**/*.json will match /data/file1.json, /data/jsons/file1.json and /data/jsons/file2.json because it walks through all subdirectories recursively but only matches *.json.

LS

Lists the contents of a file system location. Fails with a runtime error if the location is not accessible or is not a file system.

Syntax

LS(location)

For example:

LS("s3://bucket1/path1/**/*.json")

This lists all json files in all subdirectory of ‘path1’ in ‘bucket1’.

TRY_LS

Similar to LS but returns null if the location is not accessible, instead of failing with a runtime error.

Automatic Inference

The following readers use the inference of RAW to automatically detect the format of the data, as well as its schema and any other properties required to read the file.

If the inference fails, use the format-specific inference instead: e.g. use READ_CSV instead of READ.

READ

Reads a single file, detecting its format, schema, and any other properties required to read the file.

Syntax:

READ(location, sample_size := 10000,
     cache := <default interval>,
     retries := <default int>, retry_interval := <default interval>)

The optional sample_size specifies the number of “objects” to sample within the file. The definition of object depends on the file format. For instance, for CSV and Excel an object is a row. For JSON and XML, an object is an element.

If a large sample_size is used, the detection will take more time to complete, but has a higher chance of detecting the correct format. If a small sample_size is used, the detection will be faster but may not return the precise results. Suppose a column contains integers up to row number 10000. At row 10001, the number is too large to be an integer, but is a long. In this case, using sample_size 10000 will result on the wrong type being inferred, since RAW expects an integer when a long is required to read the entire table. This triggers a runtime error.

Similarly, suppose a column is non-nullable but at a certain row, a null value is present. If the detecting stopped before encountering the null value, it will erroneously infer the data as having non-nullable values only (which leads to faster execution times). This again triggers a runtime error.

To force the detection to read as many rows as possible, set sample_size to -1.

TRY_READ

TRY_READ(location, sample_size := 10000,
         cache := <default interval>,
         retries := <default int>, retry_interval := <default interval>)

TRY_READ behaves similarly to READ but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

READ_MANY

Reads many files - directories or wildcards -, detecting the unified format, schema, and any other properties required to read the data.

Each file in a directory or wildcard match must be of compatible type. If each file contains a collection, then all files are UNIONed into a single large collection. Otherwise, all files are concatenated into a single large collection.

Syntax:

READ_MANY(location, sample_size := 10000, sample_files := 5,
          cache := <default interval>,
          retries := <default int>, retry_interval := <default interval>)

The optional sample_size behaves similar to its counterpart in READ.

The optional sample_files gives an indication of how many files to sample. To force the detection to read as many files as possible, set sample_files to -1.

TRY_READ_MANY

Syntax:

TRY_READ_MANY(location, sample_size := 10000, sample_files := 5,
              cache := default interval,
              retries := <default int> , retry_interval := <default interval>)

TRY_READ_MANY behaves similarly to READ_MANY but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

DESCRIBE

Describes how RAW inferred data from a location. Fails with a runtime error if the location is not accessible or is not a database table or a file.

Syntax:

DESCRIBE(location)

The output is a record with the following fields:

Field name

Type

Description

description

string

Description of the data format of the location
e.g.: “csv with encoding utf-8”,
“oracle table url=’oracle:server1/schema/table’”

type

string

Complete output type of the location
e.g.: “””collection(
record(a: string nullable, b: int))”””

properties

collection(record(
name: string,
value: string
))
List of key-value pairs of extra properties of the
location, e.g.:
[(name: “delimiter”, value: “,”)
(name: “has_header”, value: “true”)]

is_collection

bool

True if the top level is a collection, false otherwise

columns

collection( record(
col_name: string nullable,
type: string,
nullable: bool))

Description of the type of the columns of this location
e.g.:
[(col_name: “a”, type: “string”, nullable: true)
(col_name: “b”, type: “int”, nullable: false)]

Note

For data types which are not records or collections of records the columns field will have only one row with the col_name field set to `null`

For example, assume a file “data.csv” with the following content:

"name", "age"
"John", 39
"Jane", 37

the following query:

DESCRIBE("file://data.csv")

will yield the following result:

RECORD(
  description: "csv with encoding utf-8 (confidence: 60%)",
  `type`: "collection(record(`name`: string,`age`: int))",
   isCollection: true,
   columns: [
     (col_name: "name", `type`: "string", nullable: false),
     (col_name: "age", `type`: "int", nullable: false)
  ],
 properties : [
    (name: "delimiter", value: ","),
    (name: "has_header", value: "true"),
 ]
)

LS_TABLES

List all tables in a schema from a registered RDBMS server. Fails with a runtime error if:

  • the server is not registered

  • schema is null for servers that use schemas e.g. Oracle, PostgreSQL.

  • schema is not null for servers that do not support schemas e.g MySQL.

Syntax:

LS_TABLES(schema_url)

For example:

LS_TABLES("oracle://ora_server/some_schema") // ["oracle:ora_server/some_schema/table1", ...]
LS_TABLES("mysql:mysql_server") // ["mysql:mysql_server/table1", ...]

Note

For database urls the double slash ‘//’ is optional so ‘mysql:server1’ is equivalent to ‘mysql://server1’

LS_SCHEMAS

List all schemas from a registered RDBMS server. Fails with a runtime error if the server is not registered or does not support schemas.

Syntax:

LS_SCHEMAS(server_url)

For example:

LS_SCHEMAS("oracle://ora_server") // ["oracle:ora_server/schema1", ...]

JSON

All JSON data types map to RAW types directly with the exception of a JSON array, which becomes a RAW collection.

Multiple variants are supported:

  • if connection or other problems trigger a runtime error or return null;

  • if reading a single location or, in the case of file systems read directories or wildcards and merge their contents;

  • if inference is used or whether the type and format properties are specified.

READ_JSON

Reads JSON data from a location.

Syntax when using inference:

READ_JSON(location, encoding := null,
          sample_size := 10000,
          cache := <default interval>,
          retries := <default int>, retry_interval := <default interval>)

The optional argument encoding specifies the encoding of the data; if the encoding is set to null, it will be determined automatically; otherwise, refer to Supported Encodings.

Syntax without inference:

READ_JSON[<type>](location, encoding := "utf-8",
                  cache := <default interval>,
                  retries := <default int>, retry_interval := <default interval>)

TRY_READ_JSON

TRY_READ_JSON behaves similarly to READ_JSON but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

READ_MANY_JSON

Reads many JSON files - directories or wildcards - from a file system location.

Each file in a directory or wildcard match must be of compatible type. If each file contains a collection, then all files are UNIONed into a single large collection. Otherwise, all files are concatenated into a single large collection.

Syntax when using inference:

READ_MANY_JSON(location, encoding := null,
               sample_size := 10000, sample_files := 5,
               cache := <default interval>,
               retries := <default int>, retry_interval := <default interval>)

The optional argument encoding specifies the encoding of the data; if the encoding is set to null, it will be determined automatically; otherwise, refer to Supported Encodings.

Syntax without inference:

READ_MANY_JSON[<type>](location, encoding := "utf-8",
                       sample_size := 10000, sample_files := 5,
                       cache := <default interval>,
                       retries := <default int>, retry_interval := <default interval>)

TRY_READ_MANY_JSON

TRY_READ_MANY_JSON behaves similarly to READ_MANY_JSON but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

HJSON

HJSON is a variation of JSON commonly used in Hadoop where each line contains a JSON object. Therefore, HJSON data is always a collection of JSON objects, where each JSON object must be of a compatible type.

Example of HJSON data:

{"id": 1, "description": "First row", "extra": 1.2}
{"id": 2, "description": "Second row", "extra": null}

The type of this data is collection(record(id: int, description: string, extra: double)).

When reading HJSON data without inference, the user is only required to pass the inner type - in the example above record(id: int, description: string, extra: double) - since HJSON data are always collections.

Multiple variants are supported:

  • if connection or other problems trigger a runtime error or return null;

  • if reading a single location or, in the case of file systems read directories or wildcards and merge their contents;

  • if inference is used or whether the type and format properties are specified.

READ_HJSON

Reads HJSON data from a location.

Syntax when using inference:

READ_HJSON(location, encoding := null,
           sample_size := 10000,
           cache := <default interval>,
           retries := <default int>, retry_interval := <default interval>)

The optional argument encoding specifies the encoding of the data; if the encoding is set to null, it will be determined automatically; otherwise, refer to Supported Encodings.

Syntax without inference:

READ_HJSON[<type>](location, encoding := "utf-8",
                   cache := <default interval>,
                   retries := <default int>, retry_interval := <default interval>)

The <type> is the type of each row of data.

TRY_READ_HJSON

TRY_READ_HJSON behaves similarly to READ_HJSON but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

READ_MANY_HJSON

Reads many HJSON files - directories or wildcards - from a file system location.

Each file in a directory or wildcard match must be of compatible type. If each file contains a collection, then all files are UNIONed into a single large collection. Otherwise, all files are concatenated into a single large collection.

Syntax when using inference:

READ_MANY_HJSON(location, encoding := null,
                sample_size := 10000, sample_files := 5,
                cache := <default interval>,
                retries := <default int>, retry_interval := <default interval>)

The optional argument encoding specifies the encoding of the data; if the encoding is set to null, it will be determined automatically; otherwise, refer to Supported Encodings.

Syntax without inference:

READ_MANY_HJSON[<type>](location, encoding := "utf-8",
                        cache := <default interval>,
                        retries := <default int>, retry_interval := <default interval>)

The <type> is the type of each row of data.

TRY_READ_MANY_HJSON

TRY_READ_MANY_HJSON behaves similarly to READ_MANY_HJSON but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

XML

RAW uses the following conventions to parse XML documents:

XML Document

Corresponding RQL query

<obj/>

null

<obj>
<a>hello</a>
<b>1234</b>
</obj>

record(a: “hello”, b: 1234)

<obj>
<a>hello</a>
<a>world</a>
</obj>

record(a: [“hello”, “world”])

<obj name=”value” />

record( `@name`: “value”)

<obj>
<a>hello</a>
Some text here
</obj>
record(
a: “hello”,
`#text`: “Some text here”)

<obj name=”value” id=”123”>
<a>hello</a>
<a>world</a>
<b>1234</b>
<c></c>
Some text here
</obj>
record(
`@name`: “value”,
`@id`: 1234,
a: [“hello”, “world”]
b: 1234,
c: null,
`#text`: “Some text here”)

Multiple variants of reading XML are supported:

  • if connection or other problems trigger a runtime error or return null;

  • if reading a single location or, in the case of file systems read directories or wildcards and merge their contents;

  • if inference is used or whether the type and format properties are specified.

READ_XML

Reads XML data from a location.

Syntax when using inference:

READ_XML(location, encoding := null,
         sample_size := 10000,
         cache := <default interval>,
         retries := <default int>, retry_interval := <default interval>)

The optional argument encoding specifies the encoding of the data; if the encoding is set to null, it will be determined automatically; otherwise, refer to Supported Encodings.

Syntax without inference:

READ_XML[<type>](location, encoding := "utf-8",
                 cache := <default interval>,
                 retries := <default int>, retry_interval := <default interval>)

TRY_READ_XML

TRY_READ_XML behaves similarly to READ_XML but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

READ_MANY_XML

Reads many XML files - directories or wildcards - from a file system location.

Each file in a directory or wildcard match must be of compatible type. If each file contains a collection, then all files are UNIONed into a single large collection. Otherwise, all files are concatenated into a single large collection.

Syntax when using inference:

READ_MANY_XML(location, encoding := null,
              sample_size := 10000, sample_files := 5,
              cache := <default interval>,
              retries := <default int>, retry_interval := <default interval>)

The optional argument encoding specifies the encoding of the data; if the encoding is set to null, it will be determined automatically; otherwise, refer to Supported Encodings.

Syntax without inference:

READ_MANY_XML[<type>](location, encoding := "utf-8",
                      cache := <default interval>,
                      retries := <default int>, retry_interval := <default interval>)

TRY_READ_MANY_XML

TRY_READ_MANY_XML behaves similarly to READ_MANY_XML but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

CSV

Reads rows from CSV data. CSVs are always collections, where each row must be of a compatible type.

CSV columns must be either numbers, strings, or temporals.

Example of CSV data:

first_name,last_name,age
Jeff,Dean,48
Jeff,Bezos,50
John,Carmack,45

The type of this data is collection(record(first_name: string, last_name: string, age: int)).

When reading CSV data without inference, the user is only required to pass the inner type - in the example above record(first_name: string, last_name: string, age: int) - since CSVs are always collections.

Multiple variants are supported:

  • if connection or other problems trigger a runtime error or return null;

  • if reading a single location or, in the case of file systems read directories or wildcards and merge their contents;

  • if inference is used or whether the type and format properties are specified.

READ_CSV

Reads CSV data from a location.

Syntax when using inference:

READ_CSV(location, encoding := null,
         has_header := null,
         delimiters := [","],
         nulls := [],
         nans := [],
         skip := 0,
         escape := "\\",
         quote := "\"",
         sample_size := 10000,
         cache := <default interval>,
         retries := <default int>, retry_interval := <default interval>)
  • The optional argument encoding specifies the encoding of the data; if the encoding is set to null, it will be determined automatically; otherwise, refer to Supported Encodings.

  • The optional argument has_header specifies whether the data has a header or not, e.g. true or false; if the field is not specified, the inference will try to detect the presence of a header.

  • The optional argument skip specifies the number of lines to skip in the data; if the field is not specified it will default to 0 (read and infer from the start of the file).

  • The optional argument delimiters specifies a candidate list of delimiters, e.g. [",", "|"]; if the field is not specified, the inference will try to detect the delimiter.

  • The optional argument nulls specifies a candidate list of strings to interpret as null, e.g. ["NA"]; if the field is not specified, the inference will not detect nulls.

  • The optional argument nans specifies a candidate list of strings to interpret as NaN, e.g. ["nan"]; if the field is not specified, the inference will not detect NaNs.

  • The optional argument quote specifies the quote charater in the data, e.g. "\""; if the field is not specified, the inference will try to detect the quote char, if the value is set to null then no quote char will be used.

  • The optional argument escape specifies a escape character while parsing the CSV file, e.g. "\\"; the default value is \, if the value is set to null or empty string then no escape char will be used.

Syntax without inference:

READ_CSV[<type>(location, encoding := "utf-8",
                skip := 0,
                delimiter := ",",
                nulls := [],
                nans := [],
                time_format := "HH:mm[:ss[.SSS]]",
                date_format := "yyyy-MM-dd",
                timestamp_format := "yyyy-M-d'T'HH:mm[:ss[.SSS]]",
                escape := "\\",
                quote := "\"",
                cache := <default interval>,
                retries := <default int>, retry_interval := <default interval>)
  • The <type> is the type of each row of data.

  • The optional argument skip specifies the number of lines to skip in the data.

  • The optional argument delimiter specifies the default delimiter to use.

  • The optional argument time_format specifies the format to use while parsing time fields.

  • The optional argument date_format specifies the format to use while parsing date fields.

  • The optional argument timestamp_format specifies the format to use while parsing timestamp fields.

  • The optional argument escape specifies a escape character while parsing the CSV file.

  • The optional argument quote specifies the quote charater in the data.

TRY_READ_CSV

TRY_READ_CSV behaves similarly to READ_CSV but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

READ_MANY_CSV

Reads many CSV files - directories or wildcards - from a file system location.

Each file in a directory or wildcard match must be of compatible type. If each file contains a collection, then all files are UNIONed into a single large collection. Otherwise, all files are concatenated into a single large collection.

Syntax when using inference:

READ_MANY_CSV(location, encoding := null,
              has_header := null,
              delimiters := [","],
              nulls := [],
              nans := [],
              sample_size := 10000, sample_files := 5,
              escape := "\\",
              quote := "\"",
              cache := <default interval>,
              retries := <default int>, retry_interval := <default interval>)

Syntax without inference:

READ_MANY_CSV[<type>(location, encoding := "utf-8",
                     skip := 0,
                     delimiter := ",",
                     nulls := [],
                     nans := [],
                     time_format := "HH:mm[:ss[.SSS]]",
                     date_format := "yyyy-MM-dd",
                     timestamp_format := "yyyy-M-d'T'HH:mm[:ss[.SSS]]",
                     escape := "\\",
                     quote := "\"",
                     cache := <default interval>,
                     retries := <default int>, retry_interval := <default interval>)

The <type> is the type of each row of data.

TRY_READ_MANY_CSV

TRY_READ_MANY_CSV behaves similarly to READ_MANY_CSV but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

TEXT

Reads lines of text data.

Multiple variants are supported:

  • if connection or other problems trigger a runtime error or return null;

  • if reading a single location or, in the case of file systems read directories or wildcards and merge their contents.

READ_LINES

Reads lines of text data from a location.

Syntax:

READ_LINES(location, encoding := "utf-8",
           cache := <default interval>,
           retries := <default int>, retry_interval := <default interval>)

The optional argument encoding specifies the encoding of the data; refer to Supported Encodings.

TRY_READ_LINES

TRY_READ_LINES behaves similarly to READ_LINES but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

READ_MANY_LINES

Reads many files of text data - directories or wildcards - from a file system location. Returns a single collection with all lines UNIONed into a single large collection.

Syntax:

READ_MANY_LINES(location, encoding := "utf-8",
                cache := <default interval>,
                retries := <default int>, retry_interval := <default interval>)

The optional argument encoding specifies the encoding of the data; refer to Supported Encodings.

TRY_READ_MANY_LINES

TRY_READ_MANY_LINES behaves similarly to READ_MANY_LINES but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Excel

Reads Excel tables. The supported Excel types are strings, booleans, doubles and timestamps.

Multiple variants are supported:

  • if connection or other problems trigger a runtime error or return null;

  • if reading a single location or, in the case of file systems read directories or wildcards and merge their contents;

  • if inference is used or whether the type and format properties are specified.

READ_EXCEL

Reads an Excel table from a location.

Syntax when using inference:

READ_EXCEL(location, sheet := null,
           has_header := null,
           at := null,
           cache := <default interval>,
           retries := <default int>, retry_interval := <default interval>)

The optional argument sheet specifies the Excel sheet to use; if set to null, the first sheet will be read. The optional argument has_header specifies whether the data has a header or not, e.g. true or false; if the field is not specified, the inference will try to detect the presence of a header. The optional argument at specifies the beginning and end of the table in Excel coordinates, e.g. "A1:D5"; if the field is not specified, the inference will try to detect the beginning and end of the table.

Syntax without inference:

READ_EXCEL[<type>](location, sheet := "Sheet1",
                   x0 := 0,
                   y0 := 0,
                   y1 := null,
                   cache := <default interval>,
                   retries := <default int>, retry_interval := <default interval>)

The <type> is the type of each row of data.

Attention

At this point, the Excel syntax without inference is cumbersome and error prone to use. It will be changed in an upcoming release. Please use the syntax with inference instead for the time being.

TRY_READ_EXCEL

TRY_READ_EXCEL behaves similarly to READ_EXCEL but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

READ_MANY_EXCEL

Reads many EXCEL files - directories or wildcards - from a file system location.

Each file in a directory or wildcard match must be of compatible type. If each file contains a collection, then all files are UNIONed into a single large collection. Otherwise, all files are concatenated into a single large collection.

READ_MANY_EXCEL(location, sheet := null,
                has_header := null,
                at := null,
                sample_files := 5,
                cache := <default interval>,
                retries := <default int>, retry_interval := <default interval>)

The optional argument sheet specifies the Excel sheet to use; if set to null, the first sheet will be read. The optional argument has_header specifies whether the data has a header or not, e.g. true or false; if the field is not specified, the inference will try to detect the presence of a header. The optional argument at specifies the beginning and end of the table in Excel coordinates, e.g. "A1:D5"; if the field is not specified, the inference will try to detect the beginning and end of the table.

Syntax without inference:

READ_MANY_EXCEL[<type>](location, sheet := "Sheet1",
                        x0 := 0,
                        y0 := 0,
                        y1 := null,
                        cache := <default interval>,
                        retries := <default int>, retry_interval := <default interval>)

The <type> is the type of each row of data.

Attention

At this point, the Excel syntax without inference is cumbersome and error prone to use. It will be changed in an upcoming release. Please use the syntax with inference instead for the time being.

TRY_READ_MANY_EXCEL

TRY_READ_MANY_EXCEL behaves similarly to READ_MANY_EXCEL but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Parquet-Avro

The supported PARQUET-AVRO types are: strings, booleans, ints, longs, doubles, floats, dates, time, timestamps, records, arrays and nulls.

Note

  • Time and timestamps are supported only with milli-second precision, parquet files with micro-second precision will fail to read.

  • Union types are only supported between null and another type which will be mapped to a nullable type in RAW.

  • Enums, maps, decimal, fixed and duration types are not supported.

Avro arrays will be converted to RAW collections.

Multiple variants are supported:

  • if connection or other problems trigger a runtime error or return null;

  • if reading a single location or, in the case of file systems read directories or wildcards and merge their contents;

  • if inference is used or whether the type and format properties are specified.

READ_PARQUET_AVRO

Reads PARQUET-AVRO data from a location.

Syntax when using inference:

READ_PARQUET_AVRO(location,
           cache := <default interval>,
           retries := <default int>, retry_interval := <default interval>)

Syntax without inference:

READ_PARQUET_AVRO[<type>](location,
                   cache := <default interval>,
                   retries := <default int>, retry_interval := <default interval>)

The <type> is the type of each row of data.

TRY_READ_PARQUET_AVRO

TRY_READ_PARQUET_AVRO behaves similarly to READ_PARQUET_AVRO but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

READ_MANY_PARQUET_AVRO

Reads many PARQUET-AVRO files - directories or wildcards - from a file system location.

Each file in a directory or wildcard match must be of compatible type. If each file contains a collection, then all files are UNIONed into a single large collection. Otherwise, all files are concatenated into a single large collection.

READ_MANY_PARQUET_AVRO(location,
                cache := <default interval>,
                retries := <default int>, retry_interval := <default interval>)

Syntax without inference:

READ_MANY_PARQUET_AVRO[<type>](location,
                        cache := <default interval>,
                        retries := <default int>, retry_interval := <default interval>)

The <type> is the type of each row of data.

TRY_READ_MANY_PARQUET_AVRO

TRY_READ_MANY_PARQUET_AVRO behaves similarly to READ_MANY_PARQUET_AVRO but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

HDF5

Reads HDF5 datasets and attributes. The supported HDF5 types for datasets are arrays of multiple dimension of: bytes, shorts, integers, longs, strings, booleans, doubles, floats and compound types.

To read HDF5 metadata and attributes use the function LIST_H5, see the section bellow for more details.

Note

Currently enum, bitfield, opaque, variable length and reference types are not supported

READ_H5

Reads a dataset from an HDF5 file.

Syntax:

READ_H5[<type>](<location>, <dataset name>,
                    cache := <default interval>,
                    retries := <default int>, retry_interval := <default interval>)

For example:

READ_H5[array(double, x, y)]("file:/file.h5", "group1/dataset")

Only local files and array types are supported in READ_H5

TRY_READ_H5

TRY_READ_H5 behaves similarly to READ_H5 but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Attention

TRY_READ_H5 not implemented

READ_MANY_H5

Reads the same dataset from many HDF5 files - directories or wildcards - from a file system location.

READ_MANY_H5[<type>](<location>, <dataset name>,

For example:

READ_MANY_H5[array(double, x, y)]("file://folder/*.h5", "group1/dataset")

TRY_READ_MANY_H5

TRY_READ_MANY_H5 behaves similarly to READ_MANY_H5 but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Attention

TRY_READ_MANY_H5 not implemented yet

LIST_H5

Reads all metadata from an HDF5 file or wildcard of HDF5 files.

Syntax:

LIST_H5(<location>)

For example:

LIST_H5("file://files/*.h5")

The output type of this function is:

collection(record(
  file: string ,                              // Url of the file ex: "file://folder/file.h5"
  objects: collection(<object metadata type>) // See object metadata table for more details
))

object metadata type

Field

Type

Description

name

string

path of the object in the hdf5 file eg. “group/ds1”

type

string

“dataset”, “group” or “unknown”

properties

collection(key: string, value: string)

properties of the current object

attributes

collection(<attribute type>)

List of hdf5 attributes of the current object.
see attribute type table

List of properties of groups:

  • links: names of children in a group ex: [“dataset1”, “group2”]

  • nlinks: number of children of a group ex: 2

List of properties of datasets:

  • cell type: hdf5 inner type ex: float64

  • raw type: full raw type ex: array(double, x, y, z)

  • cell size: element size in bytes

  • rank: number of dimensions of the dataset

  • dimensions: size of each dimension ex: [100, 200, 300]

attribute type

Field

Type

Description

name

string

name of the attribute

innerType

string

cell type of the attribute ex: “double”

rank

int

number of dimensions of the attribute

fullType

string

full type of the attribute ex: “collection(double)”

value

string

json string of the data ex: “[1.0, 20.]”

MySQL

Reads MySQL tables.

The MySQL types supported by RAW are strings, booleans, numbers and temporals.

Credentials to access MySQL must first be stored in the “Credentials Service”.

READ_MYSQL

Reads a MySQL table from a location.

The database argument corresponds to the database name registered in the “Credentials Service”.

Syntax when using inference:

READ_MYSQL(database, table,
           cache := <default interval>,
           retries := <default int>, retry_interval := <default interval>)

Syntax without inference:

READ_MYSQL[<type>](database, table,
                   cache := <default interval>,
                   retries := <default int>, retry_interval := <default interval>)

The <type> is the type of each row of data.

TRY_READ_MYSQL

TRY_READ_MYSQL behaves similarly to READ_MYSQL but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

QUERY_MYSQL

Queries the MySQL database directly using MySQL’s SQL syntax.

The database argument corresponds to the database name registered in the “Credentials Service”.

Syntax when using inference:

QUERY_MYSQL(database, query,
           cache := <default interval>,
           retries := <default int>, retry_interval := <default interval>)

Syntax without inference:

QUERY_MYSQL[<type>](database, query,
                   cache := <default interval>,
                   retries := <default int>, retry_interval := <default interval>)

For example:

QUERY_MYSQL("mysql-server", "select a, b, c from table1")

TRY_QUERY_MYSQL

TRY_QUERY_MYSQL behaves similarly to QUERY_MYSQL but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

PostgreSQL

Reads PostgreSQL tables.

The PostgreSQL types supported by RAW are strings, booleans, numbers, temporals and binary.

Credentials to access PostgreSQL must first be stored in the “Credentials Service”.

READ_PGSQL

Reads a PostgreSQL table from a location.

The database argument corresponds to the database name registered in the “Credentials Service”.

Syntax when using inference:

READ_PGSQL(database, schema, table,
           cache := <default interval>,
           retries := <default int>, retry_interval := <default interval>)

Syntax without inference:

READ_PGSQL[<type>](database, schema, table,
                   cache := <default interval>,
                   retries := <default int>, retry_interval := <default interval>)

The <type> is the type of each row of data.

TRY_READ_PGSQL

TRY_READ_PGSQL behaves similarly to READ_PGSQL but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

QUERY_PGSQL

Queries the PostgreSQL database directly using PostgreSQL’s SQL syntax.

The database argument corresponds to the database name registered in the “Credentials Service”.

Syntax when using inference:

QUERY_PGSQL(database, query,
           cache := <default interval>,
           retries := <default int>, retry_interval := <default interval>)

Syntax without inference:

QUERY_PGSQL[<type>](database, query,
                   cache := <default interval>,
                   retries := <default int>, retry_interval := <default interval>)

For example:

QUERY_PGSQL("pgsql-server", "select a, b, c from table1")

TRY_QUERY_PGSQL

TRY_QUERY_PGSQL behaves similarly to QUERY_PGSQL but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

SQl Server

Reads Microsoft SQL Servers tables.

The Microsoft SQL Server types supported by RAW are strings, booleans, numbers, temporals and binary.

Credentials to access SQL Server must first be stored in the “Credentials Service”.

READ_SQLSERVER

Reads a SQL Server table.

The database argument corresponds to the database name registered in the “Credentials Service”.

Syntax when using inference:

READ_SQLSERVER(database, schema, table,
               cache := <default interval>,
               retries := <default int>, retry_interval := <default interval>)

Syntax without inference:

READ_SQLSERVER[<type>](database, schema, table,
                       cache := <default interval>,
                       retries := <default int>, retry_interval := <default interval>)

The <type> is the type of each row of data.

TRY_READ_SQLSERVER

TRY_READ_SQLSERVER behaves similarly to READ_SQLSERVER but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

QUERY_SQLSERVER

Queries the SQL Server database directly using SQL Server’s SQL syntax.

The database argument corresponds to the database name registered in the “Credentials Service”.

Syntax when using inference:

QUERY_SQLSERVER(database, query,
           cache := <default interval>,
           retries := <default int>, retry_interval := <default interval>)

Syntax without inference:

QUERY_SQLSERVER[<type>](database, query,
                   cache := <default interval>,
                   retries := <default int>, retry_interval := <default interval>)

For example:

QUERY_SQLSERVER("sqlserver-server", "select a, b, c from table1")

TRY_QUERY_SQLSERVER

TRY_QUERY_SQLSERVER behaves similarly to QUERY_SQLSERVER but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Oracle

Reads Oracle tables.

The Oracle types supported by RAW are strings, booleans, numbers, temporals and binary.

Credentials to access Oracle must first be stored in the “Credentials Service”.

READ_ORACLE

Reads a Oracle table.

The database argument corresponds to the database name registered in the “Credentials Service”.

Syntax when using inference:

READ_ORACLE(database, schema, table,
               cache := <default interval>,
               retries := <default int>, retry_interval := <default interval>)

Syntax without inference:

READ_ORACLE[<type>](database, schema, table,
                       cache := <default interval>,
                       retries := <default int>, retry_interval := <default interval>)

The <type> is the type of each row of data.

TRY_READ_ORACLE

TRY_READ_ORACLE behaves similarly to READ_ORACLE but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

QUERY_ORACLE

Queries the Oracle database directly using Oracle’s SQL syntax.

The database argument corresponds to the database name registered in the “Credentials Service”.

Syntax when using inference:

QUERY_ORACLE(database, query,
           cache := <default interval>,
           retries := <default int>, retry_interval := <default interval>)

Syntax without inference:

QUERY_ORACLE[<type>](database, query,
                   cache := <default interval>,
                   retries := <default int>, retry_interval := <default interval>)

For example:

QUERY_ORACLE("oracle-server", "select a, b, c from table1")

TRY_QUERY_ORACLE

TRY_QUERY_ORACLE behaves similarly to QUERY_ORACLE but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Teradata

Reads Teradata tables.

The Teradata types supported by RAW are strings, booleans, numbers, temporals and binary.

Credentials to access Teradata must first be stored in the “Credentials Service”.

READ_TERADATA

Reads a Teradata table.

The server argument corresponds to the server name registered in the “Credentials Service”.

Syntax when using inference:

READ_TERADATA(server, database, table,
               cache := <default interval>,
               retries := <default int>, retry_interval := <default interval>)

Syntax without inference:

READ_TERADATA[<type>](server, database, table,
                       cache := <default interval>,
                       retries := <default int>, retry_interval := <default interval>)

The <type> is the type of each row of data.

TRY_READ_TERADATA

TRY_READ_TERADATA behaves similarly to READ_TERADATA but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

QUERY_TERADATA

Queries the Teradata database directly using Teradata’s SQL syntax.

The server argument corresponds to the server name registered in the “Credentials Service”.

Syntax when using inference:

QUERY_TERADATA(server, query,
           cache := <default interval>,
           retries := <default int>, retry_interval := <default interval>)

Syntax without inference:

QUERY_TERADATA[<type>](server, query,
                   cache := <default interval>,
                   retries := <default int>, retry_interval := <default interval>)

For example:

QUERY_TERADATA("teradata-server", "select a, b, c from table1")

TRY_QUERY_TERADATA

TRY_QUERY_TERADATA behaves similarly to QUERY_TERADATA but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Sqlite

Reads SQLite tables from a location.

The SQLite types supported by RAW are strings, booleans, numbers and temporals.

READ_SQLITE

Reads a SQLite table from an input stream location.

The location for Sqlite is an input stream location.

Syntax when using inference:

READ_SQLITE(location, table,
            cache := <default interval>,
            retries := <default int>, retry_interval := <default interval>)

Syntax without inference:

READ_SQLITE[<type>](location, table,
                    cache := <default interval>,
                    retries := <default int>, retry_interval := <default interval>)

The <type> is the type of each row of data.

Attention

At this point, only local file locations (e.g. as in file://...) are allowed for Sqlite databases.

TRY_READ_SQLITE

TRY_READ_SQLITE behaves similarly to READ_SQLITE but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Hive

Reads tables registered in the Hive Metastore. The Hive Metastore types supported by RAW are strings, booleans, numbers and temporals.

READ_HIVE

Reads a Hive table.

Syntax when using inference:

READ_HIVE(database, table,
            cache := <default interval>,
            retries := <default int>, retry_interval := <default interval>)

Syntax without inference:

READ_HIVE[<type>](database, table,
                    cache := <default interval>,
                    retries := <default int>, retry_interval := <default interval>)

The <type> is the type of each row of data.

TRY_READ_HIVE

TRY_READ_HIVE behaves similarly to READ_HIVE but in case of connection or other problems returns null instead of failing with a runtime error after retries are exhausted.

Supported Encodings

Textual data can be encoded in one of the following charsets:

"UTF-8", "UTF-16", "UTF-16LE", "UTF-16BE", "ISO-8859-1", "ISO-8859-2", "ISO-8859-9", "Windows-1252".

Detecting encodings is difficult and in some situations, ambiguous.

If you run into encoding-related issues, the recommendation is to ensure the data is stored in a single encoding; the “safest” choice in practice is "utf-8".