Skip to main content

I/O

RAW can read and write data directly from external locations and in multiple formats.

Reading input data consists on specifying the location of the data and, where applicable, the format of the data and any properties required by the format (e.g. the delimiter used to separate columns in a CSV file). In many cases, however, RAW can automatically detect the format and structure of the data: in this case, all the user is obliged to specify is the location of the data.

In order to speed up future queries RAW will cache frequently-used data in a manner that is transparent to the user. Reading input data is tightly integrated with how RAW builds and maintains its data caches. Whenever a user is reading data, the user specifies the “validity” desired, i.e. how recent the data should be. For instance, when reading data from S3, the user may state that any data less than 1 hour old is suitable for a given query. If RAW finds a cache in its own internal system that is 1 hour or less old - fulfilling the desired validity - it may avoid reaching out to S3 again to read the data. This often results in very significant performance improvements.

Moveover, RAW’s internal caches may be written in a format that is more optimal to the specificity of the data and how it has been used in the past. This format will typically be different from the original format the data is stored in.

Locations

Locations are specified in a RAW-specific URL format. The locations supported are:

LocationURL Syntax
S3s3://<bucket name>/<path> (assumes default system region)s3://<region name>@<bucket name>/<path>
HTTP(S)http://example.org/data/
Dropboxdropbox://<credential name>/<path>
HDFShdfs://<path>``hdfs://<host>/<path>``hdfs://<host>:<port>/<path>
Hivehive://<table name>
Local filefile:<path>
Relational databasesmysql:<db name>/<table name>``pgsql:<db name>/<schema name>/<table name>``oracle:<db name>/<schema name>/<table name>``sqlserver:<db name>/<schema name>/<table name>``teradata:<db name>/<schema name>/<table name>``sqlite:<path>/<table name>
GitHub public datagithub://<user or organization name>/<repository> (assumes default branch)github://<user or organization name>/<repository>[<branch>]
  • Requires AWS keys registered in the “Credentials Service”.
  • Optionally, for secure websites, HTTP authentication credentials may be registered in the “Credentials Service”.
  • Requires Dropbox token registered in the “Credentials Service”.
  • Only allowed if local file access is enabled in the server (disabled by default).
  • Requires database credentials registered in the “Credentials Service”. These are identified by <db name>.

HTTP Options

Read functions (e.g. READ, READ_CSV, …) accept optional arguments to customize the HTTP request.

List of HTTP arguments:

nametypeexamplecomment
http_methodstringGETMethod for the HTTP request: GET, POST, PUT, DELETE, PATCH or OPTIONS. The default value is “GET”.
http_bodybinaryBinary data to send as the body of the request. Cannot be used with http_body_string.
http_body_stringstring“my body data”String data to send as the body of the request. Cannot be used with http_body_binary.
http_headerscollection( tuple(string, string) )[ (“Accept”, “text/csv”),
(“Authorization”, “Bearer my-token”)]
HTTP headers to pass.
http_argscollection( tuple(string, string) )[ (“name”, “john doe”),
(“search-date”, “2022-01-23”)]
Query parameters arguments of the URL.
http_expected_statuscollection(int)[200, 201, 202]List of valid HTTP status. If not defined defaults to [200, 201, 202, 206]
http_auth_cred_namestring"my-http-credential"If you have registered some HTTP credentials in the credentials server, you can assign the name of the credential to this parameter to use it
http_tokenstring“my-token”Bearer token to be passed as the Authorization header of the request.
http_client_idstring“my-client-id”Client ID to be used for the client credentials OAuth flow. Requires http_client_secret and http_token_url.
http_client_secretstring“my-client-secret”Client Secret to be used for the client credentials OAuth flow. Requires http_client_id and either http_auth_provider or http_token_url.
http_auth_providerstring"auth0"Provider for client ID client secret OAuth flow. Requires http_client_id and http_client_secret.
Cannot be used at the same time as http_token_url
http_token_urlstringhttps://api.twitter.com/oauth2/token”URL to be used for the client credentials OAuth flow. Requires http_client_id and http_client_secret.
Cannot be used at the same time as http_auth_provider
http_renew_token_use_basic_authbooltrueUse basic auth for the client credentials OAuth flow. Requires http_client_id, http_client_secret and http_token_url
http_usernamestring“john-doe”Username to be used for basic authentication. Requires http_password.
http_passwordstring“my-password”Password to be used for basic authentication. Requires http_username.
http_auth_optionscollection( tuple(string, string))[ ("audience", "https://audience"), ("base_url", "https://raw-test2")]Extra options for the auth token exchange flow.

Caches

The user can specify the validity of the data when reading. The validity represents how recent the data should be. This time period is specified as an interval.

For instance, if the user accepts data that is up to one hour old:

SELECT * FROM READ("http://example.org/some.data", cache := interval "1 hour")

If the query above would be executed twice in sequence from a fresh RAW installation, RAW would reach out to the server the first time, cache the data (transparently to the user, likely in a different format from the original one), and serve the second execution of the same query from local caches.

If the user requires the most recent data, the interval should be set to zero or to a short period. This forces RAW to ignore any caches and reach out to the original source for the most recent data in every re-execution of the query:

READ("http://example.org/some.data", cache := interval "0 seconds")

If any validity is acceptable - either because the user knows the data doesn’t change, or any older data will be acceptable - then:

READ("http://example.org/some.data", cache := null)

If cache is not specified by the user, the default system settings are used.

Under some circumstances, RAW may create more than one cache for the same source data. This could happen under conditions involving concurrency, or optimization decisions where its best to have the same data cached under two distinct formats. In those scenarios, and assuming a request whose validity can be respected by more than one cache, there are no guarantees over which cache version will be chosen. In particular, a more recent cache may not be chosen: any cached version that respects the validity can be chosen.

Error Handling

Users can specify the a retry strategy in case of failures accessing the data. Retry strategies are particularly useful for long-running queries, when transient failures in remote sources can be handled gracefully. The number of retries must be a number greater or equal to zero. The retry interval is an interval.

For instance:

READ("http://example.org/unstable_server", retries := 2, retry_interval := interval "30 seconds")

This will attempt to access the URL. In case of connection or other problems (e.g. credentials missing), RAW will retry 2 times (for a total of 3 attempts) with an interval of 30 seconds between attempts. If after 2 retries the location cannot be reached, fails with a runtime error.

The variants of the readers prefixed with TRY_, such as TRY_READ for READ, return null after the retries are exhausted instead of failing with a runtime error. Moreover, in the case of TRY_READ_MAY of a file system locations, if individual files in the location can’t be read - due to invalid permissions or other errors - the individual files are skipped as if they were not part of the directory.

If retries or retry_interval are not specified by the user, the default system settings are used.

Credentials Service

Credentials Functions

Detecting Formats

For most formats, RAW also has the ability to automatically detect the structure of the data. The detection of the structure of the data happens during query validation transparently to the user. However, it imposes a performance penalty given that the source has to be accessed and, typically, sampled. This performance penalty however, is small in most common cases and quickly amortized by the internal RAW caches.

For instance, assume a file “data.json” with the following content:

[
{ "name": "John", "age": 39 },
{ "name": "Jane", "age": 37 }
]

The following query will automatically detect the structure of the data and then read the JSON file:

READ_JSON("dropbox://data.json")

The following query will read the JSON file without detecting the structure of the data, since it was specified by the user:

READ_JSON[collection(record(name: string, age: int))]("dropbox://data.json")

In fact, the first query is typically translated internally by RAW to a query similar to the second query, with some additional properties such as the text encoding.

For file-based data, there is also the possibility to detect the format directly, as shown in the following example:

READ("dropbox://data.json")

In this example the format JSON is also detected by the system, along with the type of data in the JSON file.

Reading Directories or Wildcards

The locations that include paths may include wildcards through use of a special syntax in the URL.

The following table describes the wildcard conventions:

Description
?Matches a single character.
*Matches any sequence of characters.
**Recurses through subdirectories.

For instance, given the following local directories:

/data/file1.csv
/data/file2.csv
/data/file1.json
/data/jsons/file1.json
/data/jsons/file2.json
/data/jsons/not-json.csv

The location file:///data/*.csv will match /data/file1.csv and /data/file2.csv.

The location file:///data//**/*.json will match /data/file1.json, /data/jsons/file1.json and /data/jsons/file2.json because it walks through all subdirectories recursively but only matches *.json.

Operations

Automatic Inference

The following readers use the inference of RAW to automatically detect the format of the data, as well as its schema and any other properties required to read the file.

If the inference fails, use the format-specific inference instead: e.g. use READ_CSV instead of READ.

Supported Encodings

Textual data can be encoded in one of the following charsets:

"UTF-8", "UTF-16", "UTF-16LE", "UTF-16BE", "ISO-8859-1", "ISO-8859-2", "ISO-8859-9", "Windows-1252".

Detecting encodings is difficult and in some situations, ambiguous.

If you run into encoding-related issues, the recommendation is to ensure the data is stored in a single encoding; the “safest” choice in practice is "utf-8".