HFTokenizer - HuggingFace BPE Tokenizer
Fonte: source/classes/hftokenizer.prg
HFTokenizer is a pure Harbour implementation of the HuggingFace
byte-level Byte-Pair-Encoding (BPE) tokenizer used by GPT-2 and
many derived models. It loads a standard tokenizer.json file
(downloaded from the HuggingFace Hub) and reproduces GPT-2 tokenization
bit-for-bit: the same text yields the exact same token ids.
This bridges the FiveWin Transformer class with real-world vocabularies, so a model can be trained or run on real text instead of toy token ids.
Tokenization Pipeline
GPT-2 regex] PT --> BU[Bytes → Unicode
byte_to_unicode] BU --> BPE[BPE merges
by rank] BPE --> ID[Token ids] end subgraph Decode ID2[Token ids] --> TK[ids → tokens] TK --> UB[Unicode → bytes] UB --> TXT[Original text] end
Getting a tokenizer.json
Download a tokenizer from the HuggingFace Hub. For GPT-2:
https://huggingface.co/gpt2/resolve/main/tokenizer.json
The file contains a model object with a vocab map
(token → id) and a merges list (ordered BPE merge rules).
GPT-2's file is about 1.35 MB with 50,257 tokens and 50,000 merges.
Methods
| Method | Description |
|---|---|
New( cFile ) | Build the byte → unicode maps and, if cFile is given, load that tokenizer.json. |
Load( cFile ) | Parse a tokenizer.json: read model.vocab and model.merges (string or pair form both accepted). Returns NIL on a malformed file. |
Encode( cText ) | Tokenize text into an array of (0-based, raw HuggingFace) token ids. |
Decode( aIds ) | Reconstruct the original text from an array of token ids. |
Size() | Number of tokens in the loaded vocabulary. |
Example
#include "FiveWin.ch"
FUNCTION Main()
LOCAL oTok, aIds, cBack
oTok := HFTokenizer():New( "gpt2.tokenizer.json" )
? oTok:Size() // 50257
aIds := oTok:Encode( "hello world" )
? hb_ValToExp( aIds ) // { 31373, 995 }
cBack := oTok:Decode( aIds )
? cBack // hello world
RETURN NIL
How It Works
1. Byte → Unicode map
GPT-2's byte_to_unicode maps every one of the 256 byte values to a
printable Unicode character, so no byte is ever unprintable or whitespace inside
the BPE step. The space byte (0x20) famously becomes U+0120
(displayed as Ġ). The class builds this map once in
New() and the exact inverse for Decode().
2. Pre-tokenizer
Before BPE, text is split into pieces following the GPT-2 regular expression,
which keeps contractions ('s 't 're 've 'm 'll 'd) together and
attaches a single leading space to the following word, number or symbol group
(the well-known " ?\p{L}+" behaviour). The implementation covers
ASCII letters, digits, symbols and whitespace runs; any non-ASCII byte is handled
transparently by the byte-level mapping.
3. BPE merges
Each pre-token is converted to its mapped Unicode characters and then merged
greedily: at every step the adjacent symbol pair with the lowest merge rank
(its position in model.merges) is joined, until no known pair
remains. The resulting tokens are looked up in model.vocab to produce
ids.
Validation
The tokenizer is verified against the real GPT-2 tokenizer.json.
Known reference ids match exactly:
| Text | Token ids |
|---|---|
"hello world" | { 31373, 995 } |
"Hello world" | { 15496, 995 } |
A full Decode( Encode( text ) ) == text round-trip holds for sentences
with punctuation, contractions, digits, multiple spaces and newlines. The headless
test harness is samples/ai/hftoktest.prg.
Bridging to the Transformer
HFTokenizer is the first step of an ongoing effort to integrate the FWH Transformer with the HuggingFace ecosystem, so FiveWin apps can reuse real-world AI resources — tokenizers today, datasets and (where the architecture allows) pretrained weights next.
The GPT-2 vocabulary has 50,257 tokens — far more than a small from-scratch model needs. The bridge therefore builds a compact vocabulary containing only the tokens that appear in your corpus, mapping HuggingFace ids to dense 1-based local ids:
- Tokenize the corpus with
HFTokenizer:Encode. - Collect the unique token ids into a compact 1-based vocabulary plus token strings.
- Create the
Transformerwith that vocabulary and train it (ForwardSeq/BackwardSeq). - Generate local ids, map them back to HuggingFace ids, and
Decodeto text.
End-to-end sample: samples/ai/hflmtest.prg.
Integration Roadmap
| Step | Status |
|---|---|
| HuggingFace byte-level BPE tokenizer (GPT-2) | Done |
| Tokenizer → Transformer bridge (compact vocabulary) | Done |
| One-line language-model wrapper class | In progress |
| Training on real corpora (Shakespeare, TinyStories) | Planned |
| Better decoding (top-p / nucleus sampling, repetition penalty) | Planned |
| Loading pretrained weights where the architecture allows | Exploring |
Notes
- Encode returns raw (0-based) HuggingFace ids. When feeding a 1-based FiveWin
Transformervocabulary, add 1 (or map through your own vocab). - Both merge formats are accepted: legacy
"tokA tokB"strings and the newer[ "tokA", "tokB" ]pairs. - The
tokenizer.jsonis a downloadable resource and is not shipped with FiveWin; download it once from the Hub. - Special/added tokens (e.g.
<|endoftext|>) are present in the vocab and resolve like any other token when they appear verbatim in the input.