Continuous streaming ingestion
What it is: Real-time, continuous data ingestion from streaming sources that automatically updates as new data arrives. When to use: For real-time analytics, event-driven applications, live dashboards, and when you need immediate data freshness.Example: Kafka streaming ingestion
Example: Database CDC (Change Data Capture)
Example: Message queues (MQTT, NATS, Pulsar)
One-time batch ingestion
What it is: Loading data once from external sources like databases, data lakes, or files. When to use: For initial data loads, historical data import, or when you need to load static datasets.Example: Batch load from a database
Thepostgres-cdc
connector can be used to perform a one-time snapshot of a PostgreSQL table. For other databases, such as MySQL, you can use the corresponding CDC connector and set snapshot.mode
to initial_only
.
Example: Load from cloud storage (S3, GCS, Azure)
Example: Load from a data lake (Iceberg)
Periodic ingestion with external orchestration
What it is: RisingWave doesn’t have a built-in scheduler, but you can achieve periodic ingestion using external orchestration tools like Cron or Airflow. When to use: For scheduled data updates, daily/hourly batch processing, or when you need precise control over ingestion timing.Example: Incremental loading pattern
A common pattern is to use a control table to track the last load time and only ingest new data.Other ingestion methods
Direct data insertion
You can always insert data directly into a standard table using theINSERT
statement.
Test data generation
For development and testing, you can use the built-indatagen
connector to generate mock data streams.
Ingestion method support matrix
Data Source | Continuous Streaming | One-Time Batch | Periodic | Notes |
---|---|---|---|---|
Apache Kafka | ✅ | ❌ | ⚠️ | Streaming only; periodic via external tools |
Redpanda | ✅ | ❌ | ⚠️ | Streaming only; periodic via external tools |
Apache Pulsar | ✅ | ❌ | ⚠️ | Streaming only; periodic via external tools |
AWS Kinesis | ✅ | ❌ | ⚠️ | Streaming only; periodic via external tools |
Google Pub/Sub | ✅ | ❌ | ⚠️ | Streaming only; periodic via external tools |
NATS JetStream | ✅ | ❌ | ⚠️ | Streaming only; periodic via external tools |
MQTT | ✅ | ❌ | ⚠️ | Streaming only; periodic via external tools |
PostgreSQL CDC | ✅ | ✅ | ⚠️ | CDC for streaming; direct connection for batch |
MySQL CDC | ✅ | ✅ | ⚠️ | CDC for streaming; direct connection for batch |
SQL Server CDC | ✅ | ✅ | ⚠️ | CDC for streaming; direct connection for batch |
MongoDB CDC | ✅ | ✅ | ⚠️ | CDC for streaming; direct connection for batch |
AWS S3 | ❌ | ✅ | ⚠️ | Batch only; periodic via external tools |
Google Cloud Storage | ❌ | ✅ | ⚠️ | Batch only; periodic via external tools |
Azure Blob | ❌ | ✅ | ⚠️ | Batch only; periodic via external tools |
Apache Iceberg | ❌ | ✅ | ⚠️ | Batch only; periodic via external tools |
Datagen | ✅ | ❌ | ❌ | Test data generation only |
Direct INSERT | ❌ | ✅ | ⚠️ | Manual insertion; periodic via external tools |
- ✅ Natively Supported: Built-in support for this ingestion method.
- ❌ Not Supported: This ingestion method is not available for this source.
- ⚠️ External Tools Required: Requires external orchestration tools (e.g., Cron, Airflow).
Best practices
Choose the right method
- Streaming: Use for real-time requirements and continuous data flows.
- Batch: Use for historical data, large one-time loads, or static datasets.
- Periodic: Use for scheduled updates with external orchestration tools.
Performance considerations
- Streaming ingestion offers the best real-time performance.
- Batch loading is efficient for large datasets.
- Use materialized views to pre-compute and store results for fast querying.
Data consistency
- CDC provides high-fidelity replication of database changes.
- For message queues, understand the delivery guarantees (e.g., at-least-once) of your system.
- Use transactions for atomic operations when inserting data manually.
- Monitor data quality and set up alerts.
Monitoring and operations
- Monitor streaming lag for real-time sources to ensure data freshness.
- Track batch job success and failure rates.
- Set up alerts for data quality issues.
- Use RisingWave’s system tables and dashboards for monitoring.