Skip to content

RAG

OpenMotoko includes a built-in RAG pipeline for ingesting documents and retrieving relevant context at query time.

await ragPipeline.ingest(text, {
source: 'project-readme',
metadata: { repo: 'openmotoko' },
})

The ingestion process:

  1. Chunk the input text into paragraphs
  2. Embed each chunk using local hash-based embeddings (384 dimensions)
  3. Store chunks in the rag_documents table with content, source, chunk index, metadata, and embedding

The chunker splits text by paragraphs with configurable parameters:

ParameterDefaultDescription
Chunk size512 tokensTarget tokens per chunk
Overlap64 wordsOverlap between adjacent chunks

Chunks preserve paragraph boundaries where possible for better semantic coherence.

const results = await ragPipeline.search('How do I deploy?', {
limit: 10,
minScore: 0.05,
sources: ['docs'],
hybridAlpha: 0.7,
})

Search combines two retrieval strategies:

Vector search uses BRE scoring (dot product with magnitude penalty) on the stored embeddings.

BM25 search uses a full BM25 implementation with parameters k1=1.2 and b=0.75 for keyword-based retrieval.

The final score is a weighted blend:

score = (hybridAlpha * vectorScore) + ((1 - hybridAlpha) * bm25Score)

Default alpha is 0.7, giving 70% weight to semantic similarity and 30% to keyword matching.

Retrieved chunks are ranked by hybrid score and injected into the system prompt before the LLM call. Each result includes:

FieldDescription
contentThe chunk text
sourceSource identifier
scoreCombined hybrid score
matchTypevector, bm25, or hybrid
OptionTypeDefaultDescription
limitnumber10Max results
minScorenumber0.05Minimum score threshold
sourcesstring[](all)Filter by source
hybridAlphanumber0.7Vector vs BM25 weight

Documents are stored in the rag_documents table:

ColumnTypeDescription
idtextUnique ID (nanoid)
contenttextChunk text
sourcetextSource identifier
chunkIndexintegerPosition in original document
metadatatextJSON metadata
embeddingblob384-dim float vector
tokenCountintegerToken count of the chunk
createdAtintegerUnix ms timestamp