The pipeline way: use Iceberg as sources/sinks
This approach is ideal when you have existing Iceberg tables created by systems like Spark, Flink, or batch jobs, and you want RisingWave to read from or write to them as part of a larger data ecosystem.-
Use cases:
- Existing data lakes with Iceberg tables managed by other systems.
- Multi-system architectures where multiple applications need to read/write the same Iceberg tables.
- Integration into existing data workflows and pipelines.
-
Key capabilities:
- Read from Iceberg: Ingest data from existing Iceberg tables into RisingWave for stream processing.
- Write to Iceberg: Stream processed results from RisingWave into existing Iceberg tables.
The database way: create and manage Iceberg tables natively
Choose this approach when you want RisingWave to be the primary owner of your Iceberg tables. RisingWave handles table creation, schema management, and the complete lifecycle while storing data in the standard Iceberg format.-
Key benefits:
- Simplified architecture: No external catalog setup required with the hosted catalog option.
- Streaming-first: Direct path from streaming sources to Iceberg format.
- Native management: Tables work like any other RisingWave table for queries and operations.
- Ecosystem compatibility: Standard Iceberg tables readable by Spark, Trino, Flink, etc.
-
Key capabilities:
- Iceberg table engine: Create tables using
ENGINE = iceberg
to store data natively in the Iceberg format. - Hosted Iceberg catalog: Use RisingWave’s built-in catalog service to eliminate external catalog setup.
- Iceberg table engine: Create tables using
Understanding RisingWave’s Iceberg integration
Storage architecture
It’s important to understand that RisingWave’s own internal storage system (Hummock) also uses object storage (like S3) to persist data, but it uses a row-based format optimized for RisingWave’s internal operations. When working with Iceberg, you are storing or accessing data in the columnar Iceberg format on object storage, which is designed for analytical workloads and ecosystem interoperability.Advanced features
Both approaches support advanced Iceberg features:- Time travel: Query historical snapshots of your data.
- Schema evolution: Handle changing table schemas over time.
- Partitioning: Optimize query performance with table partitioning.
- Multiple storage backends: S3, Google Cloud Storage, Azure Blob Storage.
- Various catalog types: Hosted, JDBC, AWS Glue, REST, Storage, Hive, Snowflake.