TSemanticIndex - Semantic Search over DBF/SQL
Fuente: source/classes/tsemanticindex.prg
TSemanticIndex brings meaning-based search to ordinary
FiveWin data apps. Instead of LIKE '%word%', it finds the records
whose text is closest in meaning to a query — e.g. searching
"unhappy about late delivery" surfaces a note that says "the order arrived two
weeks late and the customer is furious", even with no shared words.
It works by turning text into embedding vectors and ranking records by cosine similarity. It is the first of the "FWAI" utility classes — see the PyTorch-lite roadmap.
How it works
Vectors are stored unit-normalized, so cosine similarity is a plain dot product. Scores range from 1.0 (identical meaning) downward.
Backend-agnostic
New( bEmbed ) takes a codeblock { |cText| -> vector },
so the embeddings can come from any source — the HuggingFace Inference API,
a local model, or your own function. To use HuggingFace MiniLM through
TEmbeddings:
oEmb := TEmbeddings():New() // needs HF_API_KEY (384-dim MiniLM)
oIdx := TSemanticIndex():New( { |c| oEmb:GetEmbeddings( c ) } )
For private/offline data, plug a local embedding model behind the same codeblock (see the roadmap's local-vs-cloud guidance) — the index code does not change.
Methods
| Method | Description |
|---|---|
New( bEmbed ) | Create an index using the given embedding codeblock. |
Add( cText, uId ) | Embed cText and store it under id uId. |
AddVector( aVec, uId ) | Store a precomputed vector under uId (skips embedding). |
IndexDbf( cField, bId ) | Walk the current work area, indexing cField; id defaults to RecNo() (or Eval(bId)). |
Search( cQuery, nTop ) | Return { { uId, nScore }, ... } for the nTop best matches, best first. |
Size() | Number of indexed records. |
Save( cFile ) / Load( cFile ) | Persist / restore the index (ids + vectors). |
Example: search customer notes
USE customers
oEmb := TEmbeddings():New()
oIdx := TSemanticIndex():New( { |c| oEmb:GetEmbeddings( c ) } )
oIdx:IndexDbf( "NOTES" ) // one row per record, id = RecNo()
oIdx:Save( "customers.idx" ) // build once, reuse later
aHits := oIdx:Search( "unhappy about late delivery", 10 )
for each h in aHits
( customers )->( dbGoto( h[ 1 ] ) ) // h[1] = RecNo, h[2] = score
? customers->NAME, h[ 2 ]
next
Notes
- Build the index once and
Saveit; reload on startup. Re-embed only changed records. - Cloud embeddings (TEmbeddings/HF) are easiest to start with; for sensitive data or high volume, a local embedding model avoids per-call cost and keeps data on the machine.
- Cosine search here is a linear scan in Harbour — fine for thousands of records. For very large sets, batch the vectors into an
FW_Tensormatrix and rank with a single matrix multiply. - The embedder codeblock must return a numeric array;
Addignores records whose embedding fails (e.g. an API error).