GPT2Model - Pretrained GPT-2 inference in FWH
Source: source/classes/gpt2model.prg (on FW_Tensor, source/function/fwtensor.c)
GPT2Model runs real pretrained GPT-2 entirely in
FWH/Harbour on the CPU. It loads the actual GPT-2 weights from a HuggingFace
model.safetensors and executes the full forward pass — token +
positional embeddings, pre-norm, multi-head causal attention, GELU MLP,
residuals, final norm and a weight-tied output head — all on the flat
FW_Tensor backend. It is Phase 3
of the PyTorch-lite roadmap, verified numerically against a reference
("The capital of France is" → " the", token 262).
Getting the weights
Download GPT-2 small from the HuggingFace Hub (it is not shipped with FWH):
https://huggingface.co/gpt2/resolve/main/model.safetensors (~500 MB,
124M params) plus tokenizer.json for
HFTokenizer. Weights are read one tensor
at a time via FWT_LoadSafe (no full-file load), so peak memory is the
float32 weights (~500 MB), which fits a 32-bit process.
Methods
| Method | Description |
|---|---|
New( n_layer, n_head, n_embd, vocab, n_ctx ) | Configure the architecture. Defaults to GPT-2 small (12, 12, 768, 50257, 1024). |
LoadSafetensors( cFile ) | Parse the safetensors header and load every weight tensor into FW_Tensors. |
Forward( aTok ) | Run the forward pass for a token sequence (1-based gather ids). Returns logits [seq, vocab]. |
Generate( aTok, nNew, nTemp, nTopK ) | Autoregressive generation (greedy / temperature / top-k). Returns the full 1-based id sequence. |
Ids are 1-based for gathering (the HuggingFace 0-based id + 1); convert back with
id - 1 before decoding.
Example
oTok := HFTokenizer():New( "gpt2.tokenizer.json" )
oM := GPT2Model():New() // GPT-2 small defaults
oM:LoadSafetensors( "model.safetensors" )
aIds := oTok:Encode( "The capital of France is" ) // 0-based HF ids
aTok := {} ; AEval( aIds, { |n| AAdd( aTok, n + 1 ) } )
aOut := oM:Generate( aTok, 8, 0.0001, 1 ) // greedy, 8 tokens
aHF := {} ; AEval( aOut, { |n| AAdd( aHF, n - 1 ) } )
? oTok:Decode( aHF ) // "The capital of France is the capital of the French Republic, and"
Notes
- Inference only (no training) — no autograd needed.
- Speed is bounded by the naive matmul; a BLAS backend (roadmap Phase 2) would make longer generations fast. There is no KV cache yet, so each step re-runs the window (O(n²)).
- For larger/quantized models, the pragmatic route is binding llama.cpp (roadmap Track C) rather than scaling this from-scratch path.
- Test/demo:
samples/ai/gpt2test.prg.