What is Streaming ETL?

What is streaming ETL?

Streaming ETL (Extract, Transform, Load) is the continuous process of extracting data from source systems, transforming it in real time, and loading the results into downstream destinations — without waiting for batch schedules. Unlike traditional batch ETL that runs on hourly or daily intervals, streaming ETL processes each event as it arrives, keeping downstream systems always up to date.

Streaming ETL vs. batch ETL

	Batch ETL	Streaming ETL
Processing frequency	Scheduled (hourly, daily)	Continuous (event-by-event)
Data freshness	Minutes to hours old	Seconds old
Latency	High	Low
Complexity	Simpler (bounded datasets)	More complex (unbounded streams, ordering, late data)
Tools	Spark, Airflow, dbt	RisingWave, Flink, Kafka Streams + Connect
Best for	Historical reporting, large backfills	Real-time analytics, monitoring, operational systems

Streaming ETL does not replace batch ETL entirely — batch processing remains valuable for historical backfills, complex cross-dataset joins, and scenarios where real-time freshness is not required. Many organizations use both, with streaming ETL handling the real-time layer and batch ETL handling historical processing.

A streaming ETL pipeline

A typical streaming ETL pipeline has three stages:

1. Extract

Continuously ingest data from source systems:

Database CDC: Capture row-level changes from PostgreSQL, MySQL, SQL Server, or MongoDB.
Message brokers: Consume events from Kafka, Pulsar, Kinesis, or MQTT.
Object storage: Read files from S3, GCS, or Azure Blob.
Webhooks: Receive push-based events from SaaS applications.

2. Transform

Apply real-time transformations to the data:

Filtering: Remove irrelevant or invalid records.
Enrichment: Join streaming data with reference tables or other streams.
Aggregation: Compute running totals, averages, counts, or percentiles.
Deduplication: Remove duplicate events.
Schema mapping: Rename, cast, or restructure fields for the target system.

3. Load

Deliver transformed results to downstream systems:

Data warehouses: Snowflake, BigQuery, ClickHouse.
Data lakes: Apache Iceberg, Delta Lake, S3 (Parquet/JSON).
Databases: PostgreSQL, MySQL, Elasticsearch, Redis.
Message brokers: Kafka, Pulsar (for further downstream consumption).

Streaming ETL with RisingWave

RisingWave enables streaming ETL entirely in SQL — no Java code, no workflow orchestrators, no separate transformation and loading tools.

-- Extract: Ingest CDC changes from PostgreSQL
CREATE TABLE orders (
  order_id INT PRIMARY KEY,
  customer_id INT,
  product_id INT,
  amount DECIMAL,
  status VARCHAR,
  created_at TIMESTAMP
) WITH (
  connector = 'postgres-cdc',
  hostname = 'db.example.com',
  port = '5432',
  username = 'repl_user',
  password = '<your_password>',
  database.name = 'production',
  schema.name = 'public',
  table.name = 'orders'
);

-- Transform: Aggregate daily revenue per product with enrichment
CREATE MATERIALIZED VIEW daily_product_revenue AS
SELECT
  p.product_name,
  p.category,
  DATE_TRUNC('day', o.created_at) AS order_date,
  SUM(o.amount) AS total_revenue,
  COUNT(*) AS order_count
FROM orders o
JOIN products p ON o.product_id = p.id
WHERE o.status = 'completed'
GROUP BY p.product_name, p.category, DATE_TRUNC('day', o.created_at);

-- Load: Sink results to Iceberg for analytics
CREATE SINK product_revenue_to_iceberg FROM daily_product_revenue
WITH (
  connector = 'iceberg',
  type = 'upsert',
  primary_key = 'product_name, category, order_date',
  catalog.type = 'rest',
  catalog.uri = 'http://iceberg-catalog:8181',
  database.name = 'analytics',
  table.name = 'daily_product_revenue'
);

Why RisingWave for streaming ETL?

All-in-one system: Extract, transform, and load in a single SQL-based platform — no Kafka Connect, no Airflow, no Flink.
Cascading materialized views: Build multi-layered transformation pipelines where each layer updates incrementally.
Native CDC: Connect directly to source databases without Debezium or Kafka.
Multiple destinations: Sink to Iceberg, Snowflake, PostgreSQL, ClickHouse, Kafka, S3, and many more — simultaneously.
SQL-native: Any SQL-proficient team member can build and maintain streaming ETL pipelines.

Common streaming ETL use cases

Real-time data warehouse ingestion: Keep analytics tables fresh with continuous updates instead of nightly batch loads.
CDC-based replication: Replicate and transform database changes across systems in real time.
Event stream enrichment: Join clickstream or IoT data with reference tables before loading into analytics systems.
Data lake ingestion: Continuously write transformed data into Iceberg or Delta Lake tables with automatic compaction.
Cross-system synchronization: Keep multiple downstream systems (search, cache, analytics) in sync with source-of-truth databases.

What is Stream Processing? — Foundations of continuous data processing
What is CDC? — Native CDC connectors for real-time database replication
Data ingestion — All ingestion patterns and connectors
Data delivery — Sink connectors and delivery semantics