In this lesson, you will learn how to get started with ScyllaDB Vector Search. You will also understand how to efficiently store, index, and query vectors, enabling you to build high-performance vector search applications.
You can use ScyllaDB Vector Search to build a wide range of AI-driven and semantic search applications, including:
- Recommendation systems
- Text, image, and audio similarity search
- Retrieval-Augmented Generation (RAG) pipelines
- Semantic search engines
- Semantic caching layers
- And more…
These use cases rely on efficient vector similarity search, where ScyllaDB provides low-latency, high-throughput performance at scale.
In the following sections, let’s walk through some important items when you start your first Vector Search project with ScyllaDB Cloud.
Vector Search cluster
Currently, to create a Vector Search enabled ScyllaDB Cloud cluster, you must follow the setup instructions in the official documentation. You’ll need a ScyllaDB Cloud API token, which is required to authenticate API requests. After obtaining the token, you can create a new cluster through the API with Vector Search support enabled.
Check the docs for more info on spinning up a Vector Search cluster.
Keyspace
Currently, Vector Search in ScyllaDB requires tablets to be disabled. You can do this by setting the tablets option to false when creating your keyspace.
CREATE KEYSPACE vector_keyspace
WITH REPLICATION = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3}
AND tablets = {'enabled': false};
Vector column
Vector Search can only work on columns that use the vector data type.
embedding VECTOR<FLOAT, 3>
In this example:
- embedding is the vector column used for vector search
- FLOAT defines the element type.
- 3 specifies the vector dimension, which should match the embedding model you use in your project
Let’s see an example of how to create this column in practice:
CREATE TABLE vector_keyspace.items(
id UUID PRIMARY KEY,
name TEXT,
embedding VECTOR<FLOAT, 3>
);
Vector Search index
To make Vector Search efficient in ScyllaDB, you use a special index on the vector column. This index implements the approximate nearest-neighbor (ANN) algorithm. Here’s how you create a vector index:
CREATE INDEX IF NOT EXISTS embedding_ann_idx
ON vector_keyspace.items (embedding)
USING 'vector_index'
WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };
When you create a vector index, ScyllaDB automatically enables CDC (Change Data Capture) for that table. CDC is required for vector search to function correctly.
Notice how you also define the similarity function. ScyllaDB Vector Search supports three similarity functions:
- ‘DOT_PRODUCT’
- ‘COSINE_SIMILARITY’
- ‘EUCLIDEAN_DISTANCE’
Use the similarity function that matches your use case and your embedding model. For example, in LLM projects, it’s recommended to use the cosine similarity or the dot product function.
Insert vectors into the table
You can insert vector data into ScyllaDB using a standard INSERT statement. The vector column accepts an array of floating-point numbers that match the defined dimension.
INSERT INTO vector_keyspace.items (id, name, embedding)
VALUES (
uuid(),
'Example item',
[0.1, 0.2, 0.3]
);
Make sure that the dimension of the vector you are inserting matches the column type.
Run Vector Search query
To query your embeddings using Vector Search, use the ANN OF operator in a SELECT statement. This operator performs an ANN search using the vector index you created earlier.
Here’s an example query:
SELECT id, name
FROM vector_keyspace.items
ORDER BY embedding ANN OF [0.1, 0.2, 0.2]
LIMIT 1;
In the example above, the ORDER BY and LIMIT clauses are required when performing a search. The ORDER BY clause sorts the results based on similarity, and the LIMIT clause specifies how many similar items to return.
The ANN OF operator syntax in ScyllaDB is fully compatible with Apache Cassandra.