Overview
But analyzing these messages is imperative as it allows businesses to make more strategic business decisions by understanding the values of their consumers and competitors. A streaming system would be helpful in this case as it enables companies to update with trending topics constantly. To keep track of topics, social media platforms, such as Twitter, use hashtags to indicate what their post is about. The number of times a hashtag is used tracks audience engagement. If a hashtag is used often, it suggests that a particular topic is popular. And if we track how often a hashtag is used over time, we can determine if audience engagement is increasing or decreasing. In this tutorial, you will learn how to extract valuable insights from text data using RisingWave. We have set up a demo cluster for this tutorial, so you can easily try it out.Prerequisites
- Ensure you have Docker and Docker Compose installed in your environment. Note that Docker Compose is included in Docker Desktop for Windows and macOS. If you use Docker Desktop, ensure that it is running before launching the demo cluster.
- Ensure that the PostgreSQL interactive terminal,
psql
, is installed in your environment. For detailed instructions, see Download PostgreSQL.
Step 1: Launch the demo cluster
In the demo cluster, we packaged RisingWave and a workload generator. The workload generator will start to generate random traffic and feed them into Kafka as soon as the cluster is started. First, clone the risingwave repository to the environment.integration_tests/twitter
directory and start the demo cluster from the docker compose file.
COMMAND NOT FOUND?The default command-line syntax in Compose V2 starts with
docker compose
. See details in the Docker docs.If you’re using Compose V1, use docker-compose
instead.Step 2: Connect RisingWave to data streams
This tutorial will use RisingWave to consume data streams and perform data analysis. Tweets will be used as sample data so we can query the most popular hashtags on a given day to keep track of trending topics. Below are the schemas for tweets and Twitter users. In thetweet
schema, text
contains the content of a tweet, and created_at
contains the date and time when a tweet was posted. Hashtags will be extracted from text
.
Step 3: Define a materialized view and analyze data
This tutorial will create a materialized view that tracks how often each hashtag is used daily. To do so, start by extracting all the hashtags used within a tweet by using theregexp_matches
function. For instance, if given the following tweet:
Struggling with the high cost of scaling? Fret not! #RisingWave cares about performance and cost-efficiency. We use a tiered architecture that fully utilizes the #cloud resources to give the users fine-grained control over cost and performance.The
regexp_matches
function will find all the text in the tweet that matches the RegEx pattern #\w+
. This extracts all the hashtags from the tweet and stores them in an array.
unnest
function will separate each item in the array into separate rows.
hashtag
and window_start
to count how many times each hashtag was used daily.
Step 4: Query the results
We can query the ten most often used hashtags.Summary
In this tutorial, we learn:- How to define a nested table with RisingWave.
- How to extract character combinations from a string using regular expressions.