GPT2Model - Pretrained GPT-2 inference in FWH

Source: source/classes/gpt2model.prg (on FW_Tensor, source/function/fwtensor.c)

GPT2Model runs real pretrained GPT-2 entirely in FWH/Harbour on the CPU. It loads the actual GPT-2 weights from a HuggingFace model.safetensors and executes the full forward pass — token + positional embeddings, pre-norm, multi-head causal attention, GELU MLP, residuals, final norm and a weight-tied output head — all on the flat FW_Tensor backend. It is Phase 3 of the PyTorch-lite roadmap, verified numerically against a reference ("The capital of France is" → " the", token 262).

Getting the weights

Download GPT-2 small from the HuggingFace Hub (it is not shipped with FWH): https://huggingface.co/gpt2/resolve/main/model.safetensors (~500 MB, 124M params) plus tokenizer.json for HFTokenizer. Weights are read one tensor at a time via FWT_LoadSafe (no full-file load), so peak memory is the float32 weights (~500 MB), which fits a 32-bit process.

Methods

Method	Description
`New( n_layer, n_head, n_embd, vocab, n_ctx )`	Configure the architecture. Defaults to GPT-2 small (12, 12, 768, 50257, 1024).
`LoadSafetensors( cFile )`	Parse the safetensors header and load every weight tensor into FW_Tensors.
`Forward( aTok )`	Run the forward pass for a token sequence (1-based gather ids). Returns logits `[seq, vocab]`.
`Generate( aTok, nNew, nTemp, nTopK )`	Autoregressive generation (greedy / temperature / top-k). Returns the full 1-based id sequence.

Ids are 1-based for gathering (the HuggingFace 0-based id + 1); convert back with id - 1 before decoding.

Example

oTok := HFTokenizer():New( "gpt2.tokenizer.json" )
oM   := GPT2Model():New()                       // GPT-2 small defaults
oM:LoadSafetensors( "model.safetensors" )

aIds := oTok:Encode( "The capital of France is" )   // 0-based HF ids
aTok := {} ; AEval( aIds, { |n| AAdd( aTok, n + 1 ) } )

aOut := oM:Generate( aTok, 8, 0.0001, 1 )            // greedy, 8 tokens
aHF  := {} ; AEval( aOut, { |n| AAdd( aHF, n - 1 ) } )
? oTok:Decode( aHF )    // "The capital of France is the capital of the French Republic, and"

Notes

Inference only (no training) — no autograd needed.
Speed is bounded by the naive matmul; a BLAS backend (roadmap Phase 2) would make longer generations fast. There is no KV cache yet, so each step re-runs the window (O(n²)).
For larger/quantized models, the pragmatic route is binding llama.cpp (roadmap Track C) rather than scaling this from-scratch path.
Test/demo: samples/ai/gpt2test.prg.