Transformer Tutorial
The TTransformer component implements the full transformer architecture described in the seminal paper "Attention Is All You Need" (Vaswani et al., 2017). It allows you to load pre-trained model weights, run inference, fine-tune on custom data, and even train small transformers from scratch — all from Harbour xBase code.
Complete, runnable examples are in the samples/projects/transformer/ folder
with 7 examples covering every aspect of the transformer component.
Sample Projects
| # | Project | Description |
|---|---|---|
| 1 | Attention Vis | Visualize self-attention weights for a given input sentence. |
| 2 | Text Generator | Load a pre-trained model and generate text from a prompt. |
| 3 | Train from Scratch | Train a tiny transformer on a toy dataset (copy task). |
| 4 | Tokenizer Explorer | Inspect tokenization, token IDs, and vocabulary. |
| 5 | Sentiment Analyzer | Fine-tune a transformer for positive/negative classification. |
| 6 | Translator | Sequence-to-sequence translation (e.g., English to Spanish). |
| 7 | Attention Walkthrough | Step-by-step walkthrough of the "Attention Is All You Need" architecture. |
Transformer Architecture
Properties
| Property | Type | Default | Description |
|---|---|---|---|
nLayers | Numeric | 6 | Number of encoder/decoder layers in the transformer stack. |
nHeads | Numeric | 8 | Number of attention heads per layer. Must divide nEmbedDim evenly. |
nEmbedDim | Numeric | 512 | Embedding dimension (d_model). Core width of the transformer. |
nFFDim | Numeric | 2048 | Feed-forward network inner dimension (typically 4x nEmbedDim). |
nVocabSize | Numeric | 50257 | Vocabulary size. Must match the tokenizer's vocabulary. |
nMaxSeqLen | Numeric | 512 | Maximum input/output sequence length in tokens. |
nDropout | Numeric | 0.1 | Dropout rate applied to attention and FFN outputs (0 = disabled). |
cWeightsFile | String | "" | Path to pre-trained weights file (GGUF format). Set before loading. |
lCausal | Logical | .T. | Use causal (decoder) mask. Set to .F. for encoder-only or bidirectional mode. |
Events
| Event | Parameters | Description |
|---|---|---|
OnAttention | aWeights, nLayer, nHead | Fired during forward pass with attention weight matrices. Use for visualization and analysis. |
OnGenerate | cToken, nStep | Fired for each token generated during inference. Enables streaming output. |
OnTrainStep | nStep, nLoss, nBatch | Fired after each training step with the current loss value. Use for progress display. |
OnLoss | nLoss | Fired when the loss value is computed during forward pass. Use for monitoring and early stopping. |
Example 1: Load a Pre-Trained Model
#include "hbbuilder.ch" function Main() local oTransformer, cOutput DEFINE TRANSFORMER oTransformer ; MODEL "models/tinyllama-1.1b-chat.Q4_K_M.gguf" ; CONTEXT 2048 ; GPU_LAYERS 0 if .not. oTransformer:lLoaded ? "Failed to load model:", oTransformer:cError return nil endif // Generate text from a prompt cOutput := oTransformer:Generate( ; "Once upon a time", ; // prompt 128 ) // max tokens ? cOutput return nil
Example 2: Streaming Token Generation
static function StreamExample() local oTransformer DEFINE TRANSFORMER oTransformer ; MODEL "models/phi3-mini.Q4_K_M.gguf" // Hook into the OnGenerate event for streaming oTransformer:OnGenerate := { |cToken, nStep| ; QOut( cToken ) } oTransformer:GenerateStream( ; "Explain the transformer architecture", ; 256 ) return nil
Example 3: Attention Visualization
static function AttentionVis() local oTransformer, oForm, oChart local aAttentionMaps := {} DEFINE TRANSFORMER oTransformer ; MODEL "models/tinyllama-1.1b.Q4_K_M.gguf" // Capture attention weights via OnAttention event oTransformer:OnAttention := ; { |aW, nL, nH| AAdd( aAttentionMaps, ; { nL, nH, aW } ) } // Run a forward pass oTransformer:Forward( "The cat sat on the mat" ) // Now aAttentionMaps contains weight matrices // for each layer and head - render as heatmap RenderAttentionHeatmap( aAttentionMaps ) return nil
Example 4: Training from Scratch
static function TrainExample() local oTransformer local aInputs := { "hello", "world", "test" } local aTargets := { "hello", "world", "test" } // copy task DEFINE TRANSFORMER oTransformer ; LAYERS 2 ; HEADS 4 ; EMBED_DIM 128 ; FF_DIM 256 ; VOCAB_SIZE 1000 ; MAX_SEQ_LEN 32 ; DROPOUT 0.1 // Monitor training progress oTransformer:OnTrainStep := ; { |nStep, nLoss, nBatch| ; QOut( "Step:", nStep, "Loss:", nLoss ) } oTransformer:OnLoss := ; { |nLoss| ; if nLoss < 0.01 ; QOut( "Converged!" ) ; endif } // Train for 1000 steps oTransformer:Train( aInputs, aTargets, ; 1000, // steps 0.001 ) // learning rate return nil
Architecture Details
Attention Mechanism
Each attention head computes the standard scaled dot-product attention:
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
The multi-head mechanism runs nHeads parallel attention computations and
concatenates the results, then applies a linear projection.
Causal Masking
When lCausal is .T. (the default), the transformer applies a
causal (triangular) mask that prevents each position from attending to future positions.
This is essential for autoregressive text generation. Set lCausal := .F. for
bidirectional tasks like classification or masked language modeling.
Feed-Forward Network
Each layer contains a position-wise feed-forward network:
FFN(x) = max(0, x * W1 + b1) * W2 + b2
where the inner dimension is nFFDim (typically 4x the embedding dimension).
For inference with pre-trained models, set GPU_LAYERS to a value > 0 if
you have a GPU. This offloads computation to the GPU and significantly speeds up
generation. For CPU-only machines, a quantized model (Q4_K_M) with nLayers = 2–4
provides a good balance of quality and speed.
Open any of the 7 sample projects in samples/projects/transformer/ to
see complete, runnable implementations of each pattern described in this tutorial.