Ingest data from Google Cloud Storage

Syntax

CREATE SOURCE [ IF NOT EXISTS ] source_name
schema_definition
[INCLUDE { file | offset | payload } [AS <column_name>]]
WITH (
   connector = 'gcs',
   connector_parameter = 'value', ...
)
FORMAT data_format ENCODE data_encode (
   without_header = 'true' | 'false',
   delimiter = 'delimiter'
);

schema_definition:

(
   column_name data_type [ PRIMARY KEY ], ...
   [ PRIMARY KEY ( column_name, ... ) ]
)

Connector parameters

Field	Notes
gcs.bucket_name	Required. The name of the bucket the data source is stored in.
gcs.credential	Required. Base64-encoded credential key obtained from the GCS service account key JSON file. To get this JSON file, refer to the guides of GCS documentation. To encode it in base64, run the following command: `cat ~/Downloads/rwc-byoc-test-464bdd851bce.json \| base64 -b 0 \| pbcopy`, and then paste the output as the value for this parameter. If this field is not specified, ADC (application default credentials) will be used.
gcs.service_account	Optional. The service account of the target GCS source. If gcs.credential or ADC is not specified, the credentials will be derived from the service account.
match_pattern	Conditional. This field is used to find object keys in the bucket that match the given pattern. Standard Unix-style glob syntax is supported. A typical usage follows the `prefix/.suffix` pattern. For example, `your_directory/.parquet` matches all Parquet files under `your_directory/`. If `match_pattern` does not contain `/`, the scan runs from the container root.
compression_format	Optional. This field specifies the compression format of the file being read. You can define `compression_format` in the CREATE TABLE statement. When set to gzip or gz, the file reader reads all files with the `.gz` suffix. When set to None or not defined, the file reader will automatically read and decompress `.gz` and `.gzip` files.
refresh.interval.sec	Optional. Configure the time interval between operations of listing files. It determines the delay in discovering new files, with a default value of 60 seconds.

Other parameters

Field	Notes
data_format	Supported data format: PLAIN.
data_encode	Supported data encodes: CSV, JSON, PARQUET.
without_header	This field is only for CSV encode, and it indicates whether the first line is header. Accepted values: ‘true’, ‘false’. Default: ‘true’.
delimiter	How RisingWave splits contents. For JSON encode, the delimiter is `\n`; for CSV encode, the delimiter can be one of `,`, `;`, `E'\t'`.

Additional columns

Field	Notes
file	Optional. The column contains the file name where current record comes from.
offset	Optional. The column contains the corresponding bytes offset (record offset for parquet files) where current message begins.

Loading order of GCS files

The GCS connector does not guarantee the sequential reading of files. For example, RisingWave reads file F1 to offset O1 and crashes. After RisingWave rebuilds the task queue, it is not guaranteed the next task is reading file F1.

Read Parquet files from GCS

Added in v2.3.0.

You can use the table function file_scan() to read Parquet files from GCS, either a single file or a directory of Parquet files.

Function signature

file_scan ('parquet', 'gcs', credential, file_location_or_directory)

When reading a directory of Parquet files, the schema will be based on the first Parquet file listed. Please ensure that all Parquet files in the directory have the same schema.

Examples

Here are examples of connecting RisingWave to an GCS source to read data from individual streams.

CSV
JSON
PARQUET

CREATE TABLE t(
    id int,
    name varchar,
    age int,
    primary key(id)
)
INCLUDE file as file_name
INCLUDE offset -- default column name is `_rw_gcs_offset`
WITH (
    connector = 'gcs',
    gcs.bucket_name = 'example-bucket',
    gcs.credential = 'xxxxx'
) FORMAT PLAIN ENCODE JSON (
    without_header = 'true',
    delimiter = ',' -- set delimiter = E'\t' for tab-separated files
);

CREATE TABLE t(
    id int,
    name TEXT,
    age int,
    mark int,
)
WITH (
    connector = 'gcs',
    gcs.bucket_name = 'example-bucket',
    gcs.credential = 'xxxxx'
    match_pattern = '%Ring%*.ndjson',
) FORMAT PLAIN ENCODE JSON;

Use the payload keyword to ingest JSON data when you are unsure of the exact schema beforehand. Instead of defining specific column names and types at the very beginning, you can load all JSON data first and then prune and filter the data during runtime. Check the example below:

CREATE TABLE table_include_payload (v1 int, v2 varchar)
INCLUDE payload
WITH (
    connector = 'gcs',
    topic = 'gcs_1_partition_topic',
    properties.bootstrap.server = 'message_queue:29092',
    scan.startup.mode = 'earliest'
) FORMAT PLAIN ENCODE JSON;

CREATE TABLE t(
    id int,
    name varchar,
    age int
)
WITH (
    connector = 'gcs',
    gcs.bucket_name = 'example-bucket',
    gcs.credential = 'xxxxx'
    match_pattern = '*.parquet',
) FORMAT PLAIN ENCODE PARQUET;

Get started

Work with data

Install & Operate

Performance

Troubleshooting

Reference

Cloud

Ingest data from Google Cloud Storage

Syntax

Connector parameters

Other parameters

Additional columns

Loading order of GCS files

Read Parquet files from GCS

Examples

​Syntax

​Connector parameters

​Other parameters

​Additional columns

​Loading order of GCS files

​Read Parquet files from GCS

​Examples

Syntax

Connector parameters

Other parameters

Additional columns

Loading order of GCS files

Read Parquet files from GCS

Examples