"PyTorch-lite in FWH" — Roadmap

Source notes: docs/pytorch-lite-roadmap.md

The effort to bring real transformer power into FiveWin/Harbour apps by reusing HuggingFace resources. This page explains why each step is needed and the order in which they land. It is a living plan — see the Transformer and HFTokenizer pages for what already ships.

Premise

Harbour-the-language is not inferior to Python for this. PyTorch is a thin Python layer over a C++/CUDA core. FWH already ships C extensions (source/function/matrixes.c). The gap to close is that C core, not the language.

Three tracks

Keep these distinct to set honest expectations:

Track A — local from-scratch Transformer. Toy/educational scale, CPU, small d_model. Niche: tiny embeddable/offline models, classification on small data, teaching. Will not match PyTorch.
Track B — HuggingFace as remote power (Inference API). Already partial: tembeddings.prg, tollama.prg, chatgpt.prg, gemini*.prg. Vastly more capable today, no local training.
Track C — bind llama.cpp (libllama) via FFI. Highest real value for running a real LLM locally in an FWH app today. llama.cpp already does the hard part (quantized weights, SIMD, GPU, KV cache, mmap, GGUF); a TLlamaCpp class would just call its C API (llama_load_model_from_file, llama_decode, sampling) — the same pattern as the hbcurl binding — giving real Llama/Mistral/Qwen/Phi inference on CPU/GPU with 4-bit quantization, no training.

The phases below raise Track A's ceiling. Track C is a separate, complementary effort (an FFI binding, not a from-scratch core) and is the pragmatic path if the goal is production LLM power in FWH apps now.

Why not make our Transformer weight-compatible with llama.cpp models

Low value, high cost. LLaMA/Mistral use RoPE (not learned positions), RMSNorm (not LayerNorm), SwiGLU (not GELU/ReLU) and GQA (grouped-query attention) — different from GPT-2 and from our class. Running their weights would mean a per-architecture rewrite plus GGUF block dequantization. Binding libllama (Track C) gets the same capability for a fraction of the effort.

Lessons from llama.cpp

Fold these into Track A so FW_Tensor scales:

Lesson	What it is	Where it applies
Quantization (Q4_K/Q8_0)	weights in 4–8-bit blocks + per-block scale	4-bit = 1/8 the memory of float32; how 7B fits in 4 GB. `FW_Tensor` should support quantized blocks + dequant-in-matmul
mmap weights	memory-map the file; OS pages on demand	`FWT_LoadSafe` currently `fread`s; mmap = "load" 500 MB instantly, lazy paging
KV cache	cache K,V per layer across generation steps	our `Generate` recomputes the whole window each token (O(n²)); KV cache → O(1) layers/token
ggml graph + backends	op graph + pluggable CPU/CUDA/Metal backend + arena allocator	architecture target for `FW_Tensor` (aligns with Phase 2/4)
SIMD + threaded + blocked matmul	cache-aware multicore kernels	our matmul is a triple loop → Phase 2 (BLAS)
GGUF single-file format	metadata + tensors + quant in one mmap'd file	a GGUF reader (like our safetensors reader) opens the quantized-model ecosystem

GPT-2 weight-loading spike (findings)

Sample samples/ai/gpt2spike.prg proved pure Harbour can read a real GPT-2 model.safetensors: it parsed the header (160 tensors, 137,022,720 params), validated shapes (wte[50257,768], wpe[1024,768]), and decoded a real float32 weight — fetching only the 16 KB header via an HTTP range request, not the 500 MB file.

Blocker found: 137M params as Harbour nested arrays (array of array of double, ~16–24 bytes/element) is ~2–3 GB RAM and slow. The current matrix backend does not scale — hence Phase 1. Real GPT-2 also needs an exact architecture (pre-norm, GELU, learned positional, combined QKV with bias, weight-tied lm_head).

What FWH needs, by priority — and why

#	Need	Why it is necessary
1	Flat C-backed tensor (`FW_Tensor`)	A matrix is currently an array-of-arrays-of-double; 137M params that way ≈ 2–3 GB and cache-hostile. The same data as a flat float32 buffer is ~523 MB, contiguous, and `fread`-able straight from safetensors. Unblocks everything.
2	BLAS/SIMD matmul	`hb_MatrixMultiply` is a naive triple loop. A transformer is mostly large matmuls; tuned BLAS uses SIMD, blocking and all cores for 10–100× — the line between "seconds per token" and "minutes per token".
3	Autograd (tape-based)	PyTorch's killer feature. Today every backward is hand-derived per layer — error-prone and a wall against new architectures. A tape gives gradients from the forward pass for free. Needed for training, not inference.
4	float32 / bf16 / fp16	We compute in 64-bit double. Smaller dtypes halve/quarter memory and bandwidth — and bandwidth, not flops, is the usual CPU bottleneck.
5	GPU backend (cuBLAS/OpenCL)	CPUs top out at a few hundred GFLOP/s; GPUs reach 10–100 TFLOP/s. Where big-model power and training-at-scale live. Optional for small/medium inference.
6	Kernel zoo	Naive attention materializes the full seq×seq matrix (O(n²) memory); fused/flash kernels avoid that. A broad correct kernel set lets you build many models, not one.
7	Model IO (safetensors, GGUF, ONNX)	How you reuse the world's pretrained weights instead of training from scratch. safetensors reading is proven; GGUF opens the quantized/llama.cpp ecosystem; ONNX opens cross-framework models.
8	Ecosystem (tokenizers, datasets, Hub)	A model is useless without its exact tokenizer (bit-exact ids), data, and a way to fetch weights. `HFTokenizer` closes the tokenizer half.

Pragmatic phases

Phase	Deliverable	Notes
1	`FW_Tensor` flat C type + ops on it	Removes the memory blocker (foundation shipped: tensor, MatMul, safetensors loader)
2	BLAS backend for matmul	10–100× speedup
3	`GPT2Model` inference-only loading safetensors	THE headline demo: real pretrained GPT-2 generating text in FWH/CPU. ~1–5 tok/s with BLAS. Needs only Phase 1+3.
4 (opt)	Tape autograd for training; GPU backend	Toward training real models

Key insight: GPT-2 inference needs no autograd — only Phase 1 + Phase 3. That is the most direct path to a headline feature.

Why this order

Phase 1 first because it is a hard dependency of everything else — you cannot hold, load, or efficiently compute on real weights while data lives in nested arrays. It also pays off immediately (loading safetensors).
Phase 2 before 3 because a correct-but-slow forward (naive matmul) may be too slow to demo; BLAS makes it usable. Phase 3 can start on naive matmul and speed up once Phase 2 lands.
Phase 3 before 4 because inference delivers the headline result with the least machinery (no autograd, no GPU). Training is strictly more work and can follow.
Each phase is independently shippable and leaves the library in a working state.

Concrete HuggingFace utilities for FWH apps (the "FWAI" objective)

Objective: make FWH a first-class AI app platform — any HuggingFace-hosted capability usable from Harbour in a few lines, via three interchangeable backends: remote (Inference API, breadth today), local (FW_Tensor / llama.cpp, privacy + offline) and from-scratch (educational). FiveWin apps are DBF/SQL + xBrowse business apps, so the high-value utilities are the ones that plug AI into that data.

Utility	What it does in an FWH app	Backend	Value
Semantic search over DBF/SQL	find customers/invoices/products by meaning, not LIKE	embeddings (API `tembeddings` / local) + cosine	★★★
RAG / doc Q&A	answer natural-language questions over company PDFs/manuals	embeddings + retrieval + LLM	★★★
Text classification	auto-categorize tickets, emails, expenses; intent routing	API / small local transformer	★★★
NER (extraction)	pull names/dates/amounts/tax-ids from free text → DB fields	API / local	★★★
Speech-to-text (Whisper)	dictation into GETs, transcribe calls/voice notes	API / whisper.cpp binding	★★★
Summarization	long notes/reports/emails → summary field	API / llama.cpp	★★
Translation	multilingual invoices / UI / content	API	★★
Sentiment	score reviews / customer feedback	API / local	★★
OCR / image→text	scan receipts/IDs → text → DB	API / local VLM	★★
In-app chat assistant	query the app's data in natural language; help	llama.cpp local / API	★★★
GET autocomplete	next-word / field prediction	local GPT-2 (feasible now)	★★
Data normalization	fix addresses, dedupe, standardize	LLM API/local	★★

First five to build (max value / least effort)

TSemanticIndex — index a DBF/SQL column into embeddings; :Search(cText) returns records by cosine similarity. The killer feature for data apps; reuses tembeddings.
TChatAgent — chat over the app's data (simple function-calling: the LLM asks for queries, the app returns rows). Backend llama.cpp (local) or API.
THFTask — thin base + :Classify / :NER / :Summarize / :Translate / :Sentiment over the Inference API (each ~20 lines of curl, like tembeddings). Five utilities at once.
TWhisperCpp — binding to whisper.cpp for offline dictation → GET text.
GET autocomplete with local GPT-2 — everything needed already exists.

Local vs cloud API (e.g. TDeepSeek) — when to use each

Not either/or. A cloud API (TDeepSeek, OpenAI, Gemini…) wins on raw capability; local wins on the deployment context. Pick the backend per task.

Where local (FW_Tensor / GPT-2 / llama.cpp) wins:

Privacy / data sovereignty — data never leaves the machine (GDPR, medical, legal, financial, defense). A cloud API ships your data to a third party.
Offline — factories, ships, isolated/air-gapped networks.
Zero per-call cost — embeddings over 1M DBF rows, autocomplete on every keystroke: free locally; metered + rate-limited via API.
Latency — local classify/embed is instant; per-keystroke (<50 ms) is impossible across a network round-trip.
No dependency / longevity — APIs change, deprecate, raise prices, go down, geo-block. A local model is yours forever and ships inside the .exe.
Determinism / auditability — fixed weights = reproducible output; regulated industries need this.
Embeddable / no account — AI in a desktop app with zero setup, no per-customer API key, no signup.
Customization — train tiny domain models on the customer's own data (Track A).

Where the cloud API wins (be honest):

Raw capability — DeepSeek-V3/R1 ≫ GPT-2 or anything you run locally on CPU. Hard reasoning / high-quality chat → cloud wins decisively.
Zero infrastructure — no 500 MB model, no RAM/GPU footprint.
Always the latest model, maintenance-free.

Rule of thumb: cloud for heavy reasoning / quality chat on non-sensitive, online data; local for sensitive data, offline, high-volume/cheap (semantic search over a whole DB), per-keystroke latency, embedded zero-setup, regulated/auditable.

The key insight — hybrid. Local complements the cloud, it does not replace it. Example RAG: retrieval with local embeddings (cheap, private, over hundreds of thousands of records), then generate the answer with the cloud API (quality). The interchangeable-backend design lets the developer choose per task, and the app is never locked to one provider.

Status

✓ HFTokenizer (GPT-2 BPE), bit-exact — in lib
✓ Tokenizer → Transformer bridge (compact vocab) + TFWLanguageModel wrapper
✓ GPT-2 safetensors read + float32 decode spike
✓ Track A: train on real Shakespeare text (overfit a line; loss drops on a slice)
■ Phase 1 foundation: FW_Tensor (flat float32, MatMul, safetensors loader). Remaining: full tensor op set (add, softmax, layernorm, GELU), strides/views.
□ Phase 2: BLAS
□ Phase 3: GPT2Model inference
□ Phase 4: autograd / GPU
□ Track A: top-p / nucleus + repetition-penalty sampling

Next / pending work

Ordered by value; all builds on what already ships (in lib, hb32).

TLlamaCpp (Track C) — FFI binding to llama.cpp for real local LLMs (Llama/Mistral/Qwen/Phi), CPU/GPU, 4-bit quantized, single self-contained exe (no Ollama install). Uses the proven optional-module pattern from TWhisperCpp; plugs into TChatAgent as bChat.
GET autocomplete with local GPT-2 — last of the first-five FWAI utilities; everything needed exists.
Phase 2 — BLAS backend for FWT_MatMul (10–100×), plus a KV cache in GPT2Model:Generate (currently O(n²)).
More FWAI utilities — OCR / image→text, a RAG pipeline (TSemanticIndex retrieval + LLM answer), data normalization.
Quantization + GGUF reader for FW_Tensor (llama.cpp lessons); mmap weights in FWT_LoadSafe.
Build all library variants — this effort rebuilt only hb32; before a release, rebuild every compiler target so all have the new AI classes.

Reusable optional-binding pattern (for any heavy native dep — llama.cpp, whisper.cpp, GPU libs): guarded C wrapper kept out of fwhc.hbp; PRG class in the FWH lib calling the natives by name (no link dependency); an IsAvailable() check; the dependency is added only by the app that wants it. Powerful AI stays optional and zero-cost for every other FWH user.