"PyTorch-lite in FWH" — Roadmap

Source notes: docs/pytorch-lite-roadmap.md

The effort to bring real transformer power into FiveWin/Harbour apps by reusing HuggingFace resources. This page explains why each step is needed and the order in which they land. It is a living plan — see the Transformer and HFTokenizer pages for what already ships.

Premise

Harbour-the-language is not inferior to Python for this. PyTorch is a thin Python layer over a C++/CUDA core. FWH already ships C extensions (source/function/matrixes.c). The gap to close is that C core, not the language.

Three tracks

Keep these distinct to set honest expectations:

The phases below raise Track A's ceiling. Track C is a separate, complementary effort (an FFI binding, not a from-scratch core) and is the pragmatic path if the goal is production LLM power in FWH apps now.

Why not make our Transformer weight-compatible with llama.cpp models

Low value, high cost. LLaMA/Mistral use RoPE (not learned positions), RMSNorm (not LayerNorm), SwiGLU (not GELU/ReLU) and GQA (grouped-query attention) — different from GPT-2 and from our class. Running their weights would mean a per-architecture rewrite plus GGUF block dequantization. Binding libllama (Track C) gets the same capability for a fraction of the effort.

Lessons from llama.cpp

Fold these into Track A so FW_Tensor scales:

LessonWhat it isWhere it applies
Quantization (Q4_K/Q8_0)weights in 4–8-bit blocks + per-block scale4-bit = 1/8 the memory of float32; how 7B fits in 4 GB. FW_Tensor should support quantized blocks + dequant-in-matmul
mmap weightsmemory-map the file; OS pages on demandFWT_LoadSafe currently freads; mmap = "load" 500 MB instantly, lazy paging
KV cachecache K,V per layer across generation stepsour Generate recomputes the whole window each token (O(n²)); KV cache → O(1) layers/token
ggml graph + backendsop graph + pluggable CPU/CUDA/Metal backend + arena allocatorarchitecture target for FW_Tensor (aligns with Phase 2/4)
SIMD + threaded + blocked matmulcache-aware multicore kernelsour matmul is a triple loop → Phase 2 (BLAS)
GGUF single-file formatmetadata + tensors + quant in one mmap'd filea GGUF reader (like our safetensors reader) opens the quantized-model ecosystem

GPT-2 weight-loading spike (findings)

Sample samples/ai/gpt2spike.prg proved pure Harbour can read a real GPT-2 model.safetensors: it parsed the header (160 tensors, 137,022,720 params), validated shapes (wte[50257,768], wpe[1024,768]), and decoded a real float32 weight — fetching only the 16 KB header via an HTTP range request, not the 500 MB file.

Blocker found: 137M params as Harbour nested arrays (array of array of double, ~16–24 bytes/element) is ~2–3 GB RAM and slow. The current matrix backend does not scale — hence Phase 1. Real GPT-2 also needs an exact architecture (pre-norm, GELU, learned positional, combined QKV with bias, weight-tied lm_head).

What FWH needs, by priority — and why

#NeedWhy it is necessary
1Flat C-backed tensor (FW_Tensor) A matrix is currently an array-of-arrays-of-double; 137M params that way ≈ 2–3 GB and cache-hostile. The same data as a flat float32 buffer is ~523 MB, contiguous, and fread-able straight from safetensors. Unblocks everything.
2BLAS/SIMD matmul hb_MatrixMultiply is a naive triple loop. A transformer is mostly large matmuls; tuned BLAS uses SIMD, blocking and all cores for 10–100× — the line between "seconds per token" and "minutes per token".
3Autograd (tape-based) PyTorch's killer feature. Today every backward is hand-derived per layer — error-prone and a wall against new architectures. A tape gives gradients from the forward pass for free. Needed for training, not inference.
4float32 / bf16 / fp16 We compute in 64-bit double. Smaller dtypes halve/quarter memory and bandwidth — and bandwidth, not flops, is the usual CPU bottleneck.
5GPU backend (cuBLAS/OpenCL) CPUs top out at a few hundred GFLOP/s; GPUs reach 10–100 TFLOP/s. Where big-model power and training-at-scale live. Optional for small/medium inference.
6Kernel zoo Naive attention materializes the full seq×seq matrix (O(n²) memory); fused/flash kernels avoid that. A broad correct kernel set lets you build many models, not one.
7Model IO (safetensors, GGUF, ONNX) How you reuse the world's pretrained weights instead of training from scratch. safetensors reading is proven; GGUF opens the quantized/llama.cpp ecosystem; ONNX opens cross-framework models.
8Ecosystem (tokenizers, datasets, Hub) A model is useless without its exact tokenizer (bit-exact ids), data, and a way to fetch weights. HFTokenizer closes the tokenizer half.

Pragmatic phases

PhaseDeliverableNotes
1FW_Tensor flat C type + ops on itRemoves the memory blocker (foundation shipped: tensor, MatMul, safetensors loader)
2BLAS backend for matmul10–100× speedup
3GPT2Model inference-only loading safetensorsTHE headline demo: real pretrained GPT-2 generating text in FWH/CPU. ~1–5 tok/s with BLAS. Needs only Phase 1+3.
4 (opt)Tape autograd for training; GPU backendToward training real models

Key insight: GPT-2 inference needs no autograd — only Phase 1 + Phase 3. That is the most direct path to a headline feature.

Why this order

Concrete HuggingFace utilities for FWH apps (the "FWAI" objective)

Objective: make FWH a first-class AI app platform — any HuggingFace-hosted capability usable from Harbour in a few lines, via three interchangeable backends: remote (Inference API, breadth today), local (FW_Tensor / llama.cpp, privacy + offline) and from-scratch (educational). FiveWin apps are DBF/SQL + xBrowse business apps, so the high-value utilities are the ones that plug AI into that data.

UtilityWhat it does in an FWH appBackendValue
Semantic search over DBF/SQLfind customers/invoices/products by meaning, not LIKEembeddings (API tembeddings / local) + cosine★★★
RAG / doc Q&Aanswer natural-language questions over company PDFs/manualsembeddings + retrieval + LLM★★★
Text classificationauto-categorize tickets, emails, expenses; intent routingAPI / small local transformer★★★
NER (extraction)pull names/dates/amounts/tax-ids from free text → DB fieldsAPI / local★★★
Speech-to-text (Whisper)dictation into GETs, transcribe calls/voice notesAPI / whisper.cpp binding★★★
Summarizationlong notes/reports/emails → summary fieldAPI / llama.cpp★★
Translationmultilingual invoices / UI / contentAPI★★
Sentimentscore reviews / customer feedbackAPI / local★★
OCR / image→textscan receipts/IDs → text → DBAPI / local VLM★★
In-app chat assistantquery the app's data in natural language; helpllama.cpp local / API★★★
GET autocompletenext-word / field predictionlocal GPT-2 (feasible now)★★
Data normalizationfix addresses, dedupe, standardizeLLM API/local★★

First five to build (max value / least effort)

  1. TSemanticIndex — index a DBF/SQL column into embeddings; :Search(cText) returns records by cosine similarity. The killer feature for data apps; reuses tembeddings.
  2. TChatAgent — chat over the app's data (simple function-calling: the LLM asks for queries, the app returns rows). Backend llama.cpp (local) or API.
  3. THFTask — thin base + :Classify / :NER / :Summarize / :Translate / :Sentiment over the Inference API (each ~20 lines of curl, like tembeddings). Five utilities at once.
  4. TWhisperCpp — binding to whisper.cpp for offline dictation → GET text.
  5. GET autocomplete with local GPT-2 — everything needed already exists.

Local vs cloud API (e.g. TDeepSeek) — when to use each

Not either/or. A cloud API (TDeepSeek, OpenAI, Gemini…) wins on raw capability; local wins on the deployment context. Pick the backend per task.

Where local (FW_Tensor / GPT-2 / llama.cpp) wins:

Where the cloud API wins (be honest):

Rule of thumb: cloud for heavy reasoning / quality chat on non-sensitive, online data; local for sensitive data, offline, high-volume/cheap (semantic search over a whole DB), per-keystroke latency, embedded zero-setup, regulated/auditable.

The key insight — hybrid. Local complements the cloud, it does not replace it. Example RAG: retrieval with local embeddings (cheap, private, over hundreds of thousands of records), then generate the answer with the cloud API (quality). The interchangeable-backend design lets the developer choose per task, and the app is never locked to one provider.

Status

Next / pending work

Ordered by value; all builds on what already ships (in lib, hb32).

  1. TLlamaCpp (Track C) — FFI binding to llama.cpp for real local LLMs (Llama/Mistral/Qwen/Phi), CPU/GPU, 4-bit quantized, single self-contained exe (no Ollama install). Uses the proven optional-module pattern from TWhisperCpp; plugs into TChatAgent as bChat.
  2. GET autocomplete with local GPT-2 — last of the first-five FWAI utilities; everything needed exists.
  3. Phase 2 — BLAS backend for FWT_MatMul (10–100×), plus a KV cache in GPT2Model:Generate (currently O(n²)).
  4. More FWAI utilities — OCR / image→text, a RAG pipeline (TSemanticIndex retrieval + LLM answer), data normalization.
  5. Quantization + GGUF reader for FW_Tensor (llama.cpp lessons); mmap weights in FWT_LoadSafe.
  6. Build all library variants — this effort rebuilt only hb32; before a release, rebuild every compiler target so all have the new AI classes.

Reusable optional-binding pattern (for any heavy native dep — llama.cpp, whisper.cpp, GPU libs): guarded C wrapper kept out of fwhc.hbp; PRG class in the FWH lib calling the natives by name (no link dependency); an IsAvailable() check; the dependency is added only by the app that wants it. Powerful AI stays optional and zero-cost for every other FWH user.