HFTokenizer - HuggingFace BPE Tokenizer

Fonte: source/classes/hftokenizer.prg

HFTokenizer is a pure Harbour implementation of the HuggingFace byte-level Byte-Pair-Encoding (BPE) tokenizer used by GPT-2 and many derived models. It loads a standard tokenizer.json file (downloaded from the HuggingFace Hub) and reproduces GPT-2 tokenization bit-for-bit: the same text yields the exact same token ids.

This bridges the FiveWin Transformer class with real-world vocabularies, so a model can be trained or run on real text instead of toy token ids.

Tokenization Pipeline

flowchart LR subgraph Encode T[Raw text] --> PT[Pre-tokenize
GPT-2 regex] PT --> BU[Bytes → Unicode
byte_to_unicode] BU --> BPE[BPE merges
by rank] BPE --> ID[Token ids] end subgraph Decode ID2[Token ids] --> TK[ids → tokens] TK --> UB[Unicode → bytes] UB --> TXT[Original text] end

Getting a tokenizer.json

Download a tokenizer from the HuggingFace Hub. For GPT-2:

https://huggingface.co/gpt2/resolve/main/tokenizer.json

The file contains a model object with a vocab map (token → id) and a merges list (ordered BPE merge rules). GPT-2's file is about 1.35 MB with 50,257 tokens and 50,000 merges.

Methods

Method	Description
`New( cFile )`	Build the byte → unicode maps and, if `cFile` is given, load that `tokenizer.json`.
`Load( cFile )`	Parse a `tokenizer.json`: read `model.vocab` and `model.merges` (string or pair form both accepted). Returns `NIL` on a malformed file.
`Encode( cText )`	Tokenize text into an array of (0-based, raw HuggingFace) token ids.
`Decode( aIds )`	Reconstruct the original text from an array of token ids.
`Size()`	Number of tokens in the loaded vocabulary.

Example

#include "FiveWin.ch"

FUNCTION Main()
   LOCAL oTok, aIds, cBack

   oTok := HFTokenizer():New( "gpt2.tokenizer.json" )

   ? oTok:Size()                       // 50257

   aIds := oTok:Encode( "hello world" )
   ? hb_ValToExp( aIds )               // { 31373, 995 }

   cBack := oTok:Decode( aIds )
   ? cBack                             // hello world

RETURN NIL

How It Works

1. Byte → Unicode map

GPT-2's byte_to_unicode maps every one of the 256 byte values to a printable Unicode character, so no byte is ever unprintable or whitespace inside the BPE step. The space byte (0x20) famously becomes U+0120 (displayed as Ġ). The class builds this map once in New() and the exact inverse for Decode().

2. Pre-tokenizer

Before BPE, text is split into pieces following the GPT-2 regular expression, which keeps contractions ('s 't 're 've 'm 'll 'd) together and attaches a single leading space to the following word, number or symbol group (the well-known " ?\p{L}+" behaviour). The implementation covers ASCII letters, digits, symbols and whitespace runs; any non-ASCII byte is handled transparently by the byte-level mapping.

3. BPE merges

Each pre-token is converted to its mapped Unicode characters and then merged greedily: at every step the adjacent symbol pair with the lowest merge rank (its position in model.merges) is joined, until no known pair remains. The resulting tokens are looked up in model.vocab to produce ids.

Validation

The tokenizer is verified against the real GPT-2 tokenizer.json. Known reference ids match exactly:

Text	Token ids
`"hello world"`	`{ 31373, 995 }`
`"Hello world"`	`{ 15496, 995 }`

A full Decode( Encode( text ) ) == text round-trip holds for sentences with punctuation, contractions, digits, multiple spaces and newlines. The headless test harness is samples/ai/hftoktest.prg.

Bridging to the Transformer

HFTokenizer is the first step of an ongoing effort to integrate the FWH Transformer with the HuggingFace ecosystem, so FiveWin apps can reuse real-world AI resources — tokenizers today, datasets and (where the architecture allows) pretrained weights next.

The GPT-2 vocabulary has 50,257 tokens — far more than a small from-scratch model needs. The bridge therefore builds a compact vocabulary containing only the tokens that appear in your corpus, mapping HuggingFace ids to dense 1-based local ids:

Tokenize the corpus with HFTokenizer:Encode.
Collect the unique token ids into a compact 1-based vocabulary plus token strings.
Create the Transformer with that vocabulary and train it (ForwardSeq / BackwardSeq).
Generate local ids, map them back to HuggingFace ids, and Decode to text.

End-to-end sample: samples/ai/hflmtest.prg.

Integration Roadmap

Step	Status
HuggingFace byte-level BPE tokenizer (GPT-2)	Done
Tokenizer → Transformer bridge (compact vocabulary)	Done
One-line language-model wrapper class	In progress
Training on real corpora (Shakespeare, TinyStories)	Planned
Better decoding (top-p / nucleus sampling, repetition penalty)	Planned
Loading pretrained weights where the architecture allows	Exploring

Notes

Encode returns raw (0-based) HuggingFace ids. When feeding a 1-based FiveWin Transformer vocabulary, add 1 (or map through your own vocab).
Both merge formats are accepted: legacy "tokA tokB" strings and the newer [ "tokA", "tokB" ] pairs.
The tokenizer.json is a downloadable resource and is not shipped with FiveWin; download it once from the Hub.
Special/added tokens (e.g. <|endoftext|>) are present in the vocab and resolve like any other token when they appear verbatim in the input.