"PyTorch-lite in FWH" — Roadmap
Source notes: docs/pytorch-lite-roadmap.md
The effort to bring real transformer power into FiveWin/Harbour apps by reusing HuggingFace resources. This page explains why each step is needed and the order in which they land. It is a living plan — see the Transformer and HFTokenizer pages for what already ships.
Premise
Harbour-the-language is not inferior to Python for this. PyTorch
is a thin Python layer over a C++/CUDA core. FWH already ships C extensions
(source/function/matrixes.c). The gap to close is that C core,
not the language.
Three tracks
Keep these distinct to set honest expectations:
- Track A — local from-scratch Transformer. Toy/educational
scale, CPU, small
d_model. Niche: tiny embeddable/offline models, classification on small data, teaching. Will not match PyTorch. - Track B — HuggingFace as remote power (Inference API).
Already partial:
tembeddings.prg,tollama.prg,chatgpt.prg,gemini*.prg. Vastly more capable today, no local training. - Track C — bind llama.cpp (
libllama) via FFI. Highest real value for running a real LLM locally in an FWH app today. llama.cpp already does the hard part (quantized weights, SIMD, GPU, KV cache, mmap, GGUF); aTLlamaCppclass would just call its C API (llama_load_model_from_file,llama_decode, sampling) — the same pattern as the hbcurl binding — giving real Llama/Mistral/Qwen/Phi inference on CPU/GPU with 4-bit quantization, no training.
The phases below raise Track A's ceiling. Track C is a separate, complementary effort (an FFI binding, not a from-scratch core) and is the pragmatic path if the goal is production LLM power in FWH apps now.
Why not make our Transformer weight-compatible with llama.cpp models
Low value, high cost. LLaMA/Mistral use RoPE (not learned
positions), RMSNorm (not LayerNorm), SwiGLU (not
GELU/ReLU) and GQA (grouped-query attention) — different from
GPT-2 and from our class. Running their weights would mean a per-architecture
rewrite plus GGUF block dequantization. Binding libllama (Track C)
gets the same capability for a fraction of the effort.
Lessons from llama.cpp
Fold these into Track A so FW_Tensor scales:
| Lesson | What it is | Where it applies |
|---|---|---|
| Quantization (Q4_K/Q8_0) | weights in 4–8-bit blocks + per-block scale | 4-bit = 1/8 the memory of float32; how 7B fits in 4 GB. FW_Tensor should support quantized blocks + dequant-in-matmul |
| mmap weights | memory-map the file; OS pages on demand | FWT_LoadSafe currently freads; mmap = "load" 500 MB instantly, lazy paging |
| KV cache | cache K,V per layer across generation steps | our Generate recomputes the whole window each token (O(n²)); KV cache → O(1) layers/token |
| ggml graph + backends | op graph + pluggable CPU/CUDA/Metal backend + arena allocator | architecture target for FW_Tensor (aligns with Phase 2/4) |
| SIMD + threaded + blocked matmul | cache-aware multicore kernels | our matmul is a triple loop → Phase 2 (BLAS) |
| GGUF single-file format | metadata + tensors + quant in one mmap'd file | a GGUF reader (like our safetensors reader) opens the quantized-model ecosystem |
GPT-2 weight-loading spike (findings)
Sample samples/ai/gpt2spike.prg proved pure Harbour can read a
real GPT-2 model.safetensors: it parsed the header (160 tensors,
137,022,720 params), validated shapes (wte[50257,768],
wpe[1024,768]), and decoded a real float32 weight — fetching only
the 16 KB header via an HTTP range request, not the 500 MB file.
Blocker found: 137M params as Harbour nested arrays
(array of array of double, ~16–24 bytes/element) is ~2–3 GB
RAM and slow. The current matrix backend does not scale — hence Phase 1.
Real GPT-2 also needs an exact architecture (pre-norm, GELU, learned positional,
combined QKV with bias, weight-tied lm_head).
What FWH needs, by priority — and why
| # | Need | Why it is necessary |
|---|---|---|
| 1 | Flat C-backed tensor (FW_Tensor) |
A matrix is currently an array-of-arrays-of-double; 137M params that way ≈ 2–3 GB and cache-hostile. The same data as a flat float32 buffer is ~523 MB, contiguous, and fread-able straight from safetensors. Unblocks everything. |
| 2 | BLAS/SIMD matmul | hb_MatrixMultiply is a naive triple loop. A transformer is mostly large matmuls; tuned BLAS uses SIMD, blocking and all cores for 10–100× — the line between "seconds per token" and "minutes per token". |
| 3 | Autograd (tape-based) | PyTorch's killer feature. Today every backward is hand-derived per layer — error-prone and a wall against new architectures. A tape gives gradients from the forward pass for free. Needed for training, not inference. |
| 4 | float32 / bf16 / fp16 | We compute in 64-bit double. Smaller dtypes halve/quarter memory and bandwidth — and bandwidth, not flops, is the usual CPU bottleneck. |
| 5 | GPU backend (cuBLAS/OpenCL) | CPUs top out at a few hundred GFLOP/s; GPUs reach 10–100 TFLOP/s. Where big-model power and training-at-scale live. Optional for small/medium inference. |
| 6 | Kernel zoo | Naive attention materializes the full seq×seq matrix (O(n²) memory); fused/flash kernels avoid that. A broad correct kernel set lets you build many models, not one. |
| 7 | Model IO (safetensors, GGUF, ONNX) | How you reuse the world's pretrained weights instead of training from scratch. safetensors reading is proven; GGUF opens the quantized/llama.cpp ecosystem; ONNX opens cross-framework models. |
| 8 | Ecosystem (tokenizers, datasets, Hub) | A model is useless without its exact tokenizer (bit-exact ids), data, and a way to fetch weights. HFTokenizer closes the tokenizer half. |
Pragmatic phases
| Phase | Deliverable | Notes |
|---|---|---|
| 1 | FW_Tensor flat C type + ops on it | Removes the memory blocker (foundation shipped: tensor, MatMul, safetensors loader) |
| 2 | BLAS backend for matmul | 10–100× speedup |
| 3 | GPT2Model inference-only loading safetensors | THE headline demo: real pretrained GPT-2 generating text in FWH/CPU. ~1–5 tok/s with BLAS. Needs only Phase 1+3. |
| 4 (opt) | Tape autograd for training; GPU backend | Toward training real models |
Key insight: GPT-2 inference needs no autograd — only Phase 1 + Phase 3. That is the most direct path to a headline feature.
Why this order
- Phase 1 first because it is a hard dependency of everything else — you cannot hold, load, or efficiently compute on real weights while data lives in nested arrays. It also pays off immediately (loading safetensors).
- Phase 2 before 3 because a correct-but-slow forward (naive matmul) may be too slow to demo; BLAS makes it usable. Phase 3 can start on naive matmul and speed up once Phase 2 lands.
- Phase 3 before 4 because inference delivers the headline result with the least machinery (no autograd, no GPU). Training is strictly more work and can follow.
- Each phase is independently shippable and leaves the library in a working state.
Concrete HuggingFace utilities for FWH apps (the "FWAI" objective)
Objective: make FWH a first-class AI app platform — any HuggingFace-hosted capability usable from Harbour in a few lines, via three interchangeable backends: remote (Inference API, breadth today), local (FW_Tensor / llama.cpp, privacy + offline) and from-scratch (educational). FiveWin apps are DBF/SQL + xBrowse business apps, so the high-value utilities are the ones that plug AI into that data.
| Utility | What it does in an FWH app | Backend | Value |
|---|---|---|---|
| Semantic search over DBF/SQL | find customers/invoices/products by meaning, not LIKE | embeddings (API tembeddings / local) + cosine | ★★★ |
| RAG / doc Q&A | answer natural-language questions over company PDFs/manuals | embeddings + retrieval + LLM | ★★★ |
| Text classification | auto-categorize tickets, emails, expenses; intent routing | API / small local transformer | ★★★ |
| NER (extraction) | pull names/dates/amounts/tax-ids from free text → DB fields | API / local | ★★★ |
| Speech-to-text (Whisper) | dictation into GETs, transcribe calls/voice notes | API / whisper.cpp binding | ★★★ |
| Summarization | long notes/reports/emails → summary field | API / llama.cpp | ★★ |
| Translation | multilingual invoices / UI / content | API | ★★ |
| Sentiment | score reviews / customer feedback | API / local | ★★ |
| OCR / image→text | scan receipts/IDs → text → DB | API / local VLM | ★★ |
| In-app chat assistant | query the app's data in natural language; help | llama.cpp local / API | ★★★ |
| GET autocomplete | next-word / field prediction | local GPT-2 (feasible now) | ★★ |
| Data normalization | fix addresses, dedupe, standardize | LLM API/local | ★★ |
First five to build (max value / least effort)
- TSemanticIndex — index a DBF/SQL column into embeddings;
:Search(cText)returns records by cosine similarity. The killer feature for data apps; reusestembeddings. - TChatAgent — chat over the app's data (simple function-calling: the LLM asks for queries, the app returns rows). Backend llama.cpp (local) or API.
- THFTask — thin base +
:Classify / :NER / :Summarize / :Translate / :Sentimentover the Inference API (each ~20 lines of curl, liketembeddings). Five utilities at once. - TWhisperCpp — binding to whisper.cpp for offline dictation → GET text.
- GET autocomplete with local GPT-2 — everything needed already exists.
Local vs cloud API (e.g. TDeepSeek) — when to use each
Not either/or. A cloud API (TDeepSeek, OpenAI, Gemini…) wins on raw capability; local wins on the deployment context. Pick the backend per task.
Where local (FW_Tensor / GPT-2 / llama.cpp) wins:
- Privacy / data sovereignty — data never leaves the machine (GDPR, medical, legal, financial, defense). A cloud API ships your data to a third party.
- Offline — factories, ships, isolated/air-gapped networks.
- Zero per-call cost — embeddings over 1M DBF rows, autocomplete on every keystroke: free locally; metered + rate-limited via API.
- Latency — local classify/embed is instant; per-keystroke (<50 ms) is impossible across a network round-trip.
- No dependency / longevity — APIs change, deprecate, raise prices, go down, geo-block. A local model is yours forever and ships inside the .exe.
- Determinism / auditability — fixed weights = reproducible output; regulated industries need this.
- Embeddable / no account — AI in a desktop app with zero setup, no per-customer API key, no signup.
- Customization — train tiny domain models on the customer's own data (Track A).
Where the cloud API wins (be honest):
- Raw capability — DeepSeek-V3/R1 ≫ GPT-2 or anything you run locally on CPU. Hard reasoning / high-quality chat → cloud wins decisively.
- Zero infrastructure — no 500 MB model, no RAM/GPU footprint.
- Always the latest model, maintenance-free.
Rule of thumb: cloud for heavy reasoning / quality chat on non-sensitive, online data; local for sensitive data, offline, high-volume/cheap (semantic search over a whole DB), per-keystroke latency, embedded zero-setup, regulated/auditable.
The key insight — hybrid. Local complements the cloud, it does not replace it. Example RAG: retrieval with local embeddings (cheap, private, over hundreds of thousands of records), then generate the answer with the cloud API (quality). The interchangeable-backend design lets the developer choose per task, and the app is never locked to one provider.
Status
- ✓ HFTokenizer (GPT-2 BPE), bit-exact — in lib
- ✓ Tokenizer → Transformer bridge (compact vocab) +
TFWLanguageModelwrapper - ✓ GPT-2 safetensors read + float32 decode spike
- ✓ Track A: train on real Shakespeare text (overfit a line; loss drops on a slice)
- ■ Phase 1 foundation:
FW_Tensor(flat float32, MatMul, safetensors loader). Remaining: full tensor op set (add, softmax, layernorm, GELU), strides/views. - □ Phase 2: BLAS
- □ Phase 3:
GPT2Modelinference - □ Phase 4: autograd / GPU
- □ Track A: top-p / nucleus + repetition-penalty sampling
Next / pending work
Ordered by value; all builds on what already ships (in lib, hb32).
- TLlamaCpp (Track C) — FFI binding to llama.cpp for real local LLMs (Llama/Mistral/Qwen/Phi), CPU/GPU, 4-bit quantized, single self-contained exe (no Ollama install). Uses the proven optional-module pattern from TWhisperCpp; plugs into TChatAgent as
bChat. - GET autocomplete with local GPT-2 — last of the first-five FWAI utilities; everything needed exists.
- Phase 2 — BLAS backend for
FWT_MatMul(10–100×), plus a KV cache inGPT2Model:Generate(currently O(n²)). - More FWAI utilities — OCR / image→text, a RAG pipeline (TSemanticIndex retrieval + LLM answer), data normalization.
- Quantization + GGUF reader for FW_Tensor (llama.cpp lessons); mmap weights in
FWT_LoadSafe. - Build all library variants — this effort rebuilt only hb32; before a release, rebuild every compiler target so all have the new AI classes.
Reusable optional-binding pattern (for any heavy native dep —
llama.cpp, whisper.cpp, GPU libs): guarded C wrapper kept out of fwhc.hbp;
PRG class in the FWH lib calling the natives by name (no link dependency); an
IsAvailable() check; the dependency is added only by the app that wants
it. Powerful AI stays optional and zero-cost for every other FWH user.