Skip to main content
RisingWave supports vector indexes to enable efficient similarity search operations. Vector indexes are specialized data structures that optimize queries involving vector distance calculations.

Creating vector indexes

Use the CREATE INDEX command with vector-specific syntax to create vector indexes. For more details, see CREATE INDEX.
Syntax
CREATE INDEX index_name ON table_name 
USING { FLAT | HNSW } (vector_column | expression) 
[ INCLUDE ( include_column [, ...] ) ]
[ WITH ( option = value [, ...] ) ];

Index types

Before creating a vector index, you may create a sample table item to reference the table name and column names. Currently, we only support creating vector indexes on append-only inputs, such as append-only tables or materialized views. Therefore, we have to specify the table as append-only here:
create table items (id int primary key, name string, embedding vector(128)) append only;
RisingWave supports two methods when creating index:
  • FLAT index: Provides exact results by comparing the query vector against all stored vectors.
    -- Create a FLAT vector index
    CREATE INDEX idx_embedding ON items 
    USING FLAT (embedding) 
    INCLUDE (name) 
    WITH (distance_type = 'l2');
    
  • HNSW index: Hierarchical Navigable Small World (HNSW) index that provides approximate nearest neighbor search with better performance for large datasets.
    -- Create an HNSW vector index
    CREATE INDEX idx_embedding_hnsw ON items 
    USING HNSW (embedding) 
    INCLUDE (name) 
    WITH (
        distance_type = 'inner_product', 
        m = 32, 
        ef_construction = 40, 
        max_level = 5
    );
    
For HNSW index, we also support specifying a query parameter ef_search by setting the session variable batch_hnsw_ef_search (the default value is 40).

Parameters

ParameterDescriptionValid for
distance_typeDistance metric to use: l2, cosine, l1, or inner_productFLAT, HNSW
mOptional. Maximum number of connections per nodeHNSW
ef_constructionOptional. Size of dynamic candidate list during constructionHNSW
max_levelOptional. Maximum level of the HNSW graphHNSW

Vector distance operators

RisingWave provides specialized operators for calculating vector distances:
OperatorFunctionDescription
<->l2_distance()Euclidean (L2) distance
<=>cosine_distance()Cosine distance
<+>l1_distance()Manhattan (L1) distance
<#>Negative inner productNegative inner product distance
Use vector distance operators with ORDER BY and LIMIT to perform similarity search:
-- Find the 5 most similar items using L2 distance
SELECT * FROM items 
ORDER BY embedding <-> '[3,1,2]' 
LIMIT 5;

-- Find similar items using cosine distance
SELECT id, name FROM items 
ORDER BY embedding <=> '[0.5, 0.3, 0.2]' 
LIMIT 10;

Vector indexes on function expressions

You can create vector indexes on function expressions instead of raw columns. This allows you to avoid storing a separate vector column, saving storage and reducing maintenance costs.
  1. Create the table to include the input column
CREATE TABLE items (
    id INT PRIMARY KEY,
    description STRING
    -- embedding column is optional if using function expression
);
The embedding column is used to store the embedding generated from the description column. If you create the vector index directly from description column with function expression, you don’t have to store raw embedding in the table.
  1. Define the user-defined function (UDF)
CREATE FUNCTION get_embedding(string) RETURNS VECTOR(128) LANGUAGE SQL AS $$
SELECT openai_embedding('{"model": <EMBEDDING_MODEL_NAME>, "api_key": <API_KEY>}'::jsonb, $1)::vector(128);
$$;
  1. Create the vector index on the function expression
CREATE INDEX idx_embedding_func ON items
USING FLAT (get_embedding(description))
INCLUDE(description)
WITH (distance_type = 'l2');
In this example, get_embedding(description) is used as the index expression. This approach avoids materializing a separate vector column in the table, which reduces storage costs and keeps the table schema simpler.

Streaming vector index lookup

Added in v2.7.0.
After creating a vector index, you can use array subqueries to define streaming jobs that look up the vector index, and expand the input with an additional array column containing the Top-N nearest rows. The query syntax is as follows:
Syntax
SELECT 
    <input_table_col1>, 
    <input_table_col2>, 
    ..., 
    array(
        SELECT row(<col1>, <col2>, .., <col_n>)
        FROM <indexed_table> FOR SYSTEM_TIME AS OF PROCTIME()
        ORDER BY <input_embedding_col> <=> <indexed_table_embedding_col>
        LIMIT N
    ) AS top_n_related
FROM <input>;
This query expands the input with an extra column. The data type of the expanded column is:
struct<col1, col2, ..., col_n>[]
You can use this query to create a materialized view, sink, or sink into table, thereby defining a streaming job that performs vector index lookups. When creating such streaming jobs, FOR SYSTEM_TIME AS OF must be set to PROCTIME(). See example. The same syntax is also supported in ad-hoc queries. In that case, FOR SYSTEM_TIME AS OF is optional, and you may specify a different system time, similar to how time travel queries are defined.
  • Currently, vector index lookups are supported only for append-only inputs.
  • All columns returned by the lookup from the indexed table must be included in the vector index.

Examples

-- Create table with vector data
CREATE TABLE products (
    id INT PRIMARY KEY,
    name STRING,
    description STRING,
    embedding vector(128)
) APPEND ONLY;

-- Create vector index
CREATE INDEX idx_embedding ON products
USING HNSW (embedding)
WITH (distance_type = 'cosine');

-- Insert sample data
INSERT INTO products (id, name, description, embedding) VALUES
(1, 'Product A', 'Description for Product A', '[0.1, 0.2, ...]'),
(2, 'Product B', 'Description for Product B', '[0.3, 0.4, ...]');

-- Find similar products
SELECT id, name
FROM products
ORDER BY embedding <=> '[0.2, 0.3, ...]'
LIMIT 5;

Using cosine distance type

The SQL query depends on the type of vector index you created:
  • If the vector index is built on a raw embedding column, use the raw column in your ORDER BY clause.
-- Query on the raw embedding column
SELECT * FROM items
ORDER BY embedding <=> '[0.5, 0.5, 0.0]' 
LIMIT 3;
  • If the vector index is built using a function expression, use the same function expression in your ORDER BY clause.
-- Query on a function expression
SELECT * FROM items
ORDER BY get_embedding(description) <=> '[1.0, 2.0, 3.0]'
LIMIT 3;

Vector index lookup in streaming sinks

Suppose you have two append-only tables, items and events:
CREATE TABLE items (
    id INT PRIMARY KEY,
    text STRING,
    embedding VECTOR(3)
) APPEND ONLY;

CREATE TABLE events (
    event_id INT PRIMARY KEY,
    time TIMESTAMP,
    embedding VECTOR(3)
);
Create a vector index on the items table:
CREATE INDEX raw_column_index
ON items USING HNSW (embedding)
WITH (distance_type = 'l2');
You can then create a sink to table or external system that uses the embedding from the events table to look up the vector index on items:
CREATE SINK AS
SELECT
    event_id,
    array(
        SELECT row(text)
        FROM items FOR SYSTEM_TIME AS OF PROCTIME()
        ORDER BY events.embedding <-> items.embedding
        LIMIT 3
    ) AS related_info,
    time
FROM events
WITH (...);
For each row in events, this query expands the output with a new column. The row embedding in events is used to query the vector index, retrieve a list of the Top-N nearest text values, and store the result in an array column. With these Top-N nearest texts, the related content can be further processed for downstream scenarios such as real-time RAG.