Transformer Tutorial

The TTransformer component implements the full transformer architecture described in the seminal paper "Attention Is All You Need" (Vaswani et al., 2017). It allows you to load pre-trained model weights, run inference, fine-tune on custom data, and even train small transformers from scratch — all from Harbour xBase code.

Didactic Examples

Complete, runnable examples are in the samples/projects/transformer/ folder with 7 examples covering every aspect of the transformer component.

Sample Projects

#ProjectDescription
1Attention VisVisualize self-attention weights for a given input sentence.
2Text GeneratorLoad a pre-trained model and generate text from a prompt.
3Train from ScratchTrain a tiny transformer on a toy dataset (copy task).
4Tokenizer ExplorerInspect tokenization, token IDs, and vocabulary.
5Sentiment AnalyzerFine-tune a transformer for positive/negative classification.
6TranslatorSequence-to-sequence translation (e.g., English to Spanish).
7Attention WalkthroughStep-by-step walkthrough of the "Attention Is All You Need" architecture.

Transformer Architecture

graph LR A["Input Text"] --> B["Tokenizer"] B --> C["Embedding +\nPositional Encoding"] C --> D["Multi-Head\nSelf-Attention"] D --> E["Feed-Forward\nNetwork"] E --> F["Add & Norm\n(Residual)"] F --> G["Repeat N times\n(nLayers)"] G --> H["Linear +\nSoftmax"] H --> I["Output\nProbabilities"] style D fill:#58a6ff,stroke:#388bfd,color:#0d1117 style E fill:#58a6ff,stroke:#388bfd,color:#0d1117 style G fill:#d2a8ff,stroke:#bc8cff,color:#0d1117

Properties

PropertyTypeDefaultDescription
nLayersNumeric6Number of encoder/decoder layers in the transformer stack.
nHeadsNumeric8Number of attention heads per layer. Must divide nEmbedDim evenly.
nEmbedDimNumeric512Embedding dimension (d_model). Core width of the transformer.
nFFDimNumeric2048Feed-forward network inner dimension (typically 4x nEmbedDim).
nVocabSizeNumeric50257Vocabulary size. Must match the tokenizer's vocabulary.
nMaxSeqLenNumeric512Maximum input/output sequence length in tokens.
nDropoutNumeric0.1Dropout rate applied to attention and FFN outputs (0 = disabled).
cWeightsFileString""Path to pre-trained weights file (GGUF format). Set before loading.
lCausalLogical.T.Use causal (decoder) mask. Set to .F. for encoder-only or bidirectional mode.

Events

EventParametersDescription
OnAttentionaWeights, nLayer, nHeadFired during forward pass with attention weight matrices. Use for visualization and analysis.
OnGeneratecToken, nStepFired for each token generated during inference. Enables streaming output.
OnTrainStepnStep, nLoss, nBatchFired after each training step with the current loss value. Use for progress display.
OnLossnLossFired when the loss value is computed during forward pass. Use for monitoring and early stopping.

Example 1: Load a Pre-Trained Model

#include "hbbuilder.ch"

function Main()

   local oTransformer, cOutput

   DEFINE TRANSFORMER oTransformer ;
      MODEL "models/tinyllama-1.1b-chat.Q4_K_M.gguf" ;
      CONTEXT 2048 ;
      GPU_LAYERS 0

   if .not. oTransformer:lLoaded
      ? "Failed to load model:", oTransformer:cError
      return nil
   endif

   // Generate text from a prompt
   cOutput := oTransformer:Generate( ;
      "Once upon a time", ;  // prompt
      128 )                  // max tokens

   ? cOutput

return nil

Example 2: Streaming Token Generation

static function StreamExample()

   local oTransformer

   DEFINE TRANSFORMER oTransformer ;
      MODEL "models/phi3-mini.Q4_K_M.gguf"

   // Hook into the OnGenerate event for streaming
   oTransformer:OnGenerate := { |cToken, nStep| ;
      QOut( cToken ) }

   oTransformer:GenerateStream( ;
      "Explain the transformer architecture", ;
      256 )

return nil

Example 3: Attention Visualization

static function AttentionVis()

   local oTransformer, oForm, oChart
   local aAttentionMaps := {}

   DEFINE TRANSFORMER oTransformer ;
      MODEL "models/tinyllama-1.1b.Q4_K_M.gguf"

   // Capture attention weights via OnAttention event
   oTransformer:OnAttention := ;
      { |aW, nL, nH| AAdd( aAttentionMaps, ;
         { nL, nH, aW } ) }

   // Run a forward pass
   oTransformer:Forward( "The cat sat on the mat" )

   // Now aAttentionMaps contains weight matrices
   // for each layer and head - render as heatmap
   RenderAttentionHeatmap( aAttentionMaps )

return nil

Example 4: Training from Scratch

static function TrainExample()

   local oTransformer
   local aInputs  := { "hello", "world", "test" }
   local aTargets := { "hello", "world", "test" }  // copy task

   DEFINE TRANSFORMER oTransformer ;
      LAYERS 2 ;
      HEADS 4 ;
      EMBED_DIM 128 ;
      FF_DIM 256 ;
      VOCAB_SIZE 1000 ;
      MAX_SEQ_LEN 32 ;
      DROPOUT 0.1

   // Monitor training progress
   oTransformer:OnTrainStep := ;
      { |nStep, nLoss, nBatch| ;
         QOut( "Step:", nStep, "Loss:", nLoss ) }

   oTransformer:OnLoss := ;
      { |nLoss| ;
         if nLoss < 0.01 ;
            QOut( "Converged!" ) ;
         endif }

   // Train for 1000 steps
   oTransformer:Train( aInputs, aTargets, ;
      1000,  // steps
      0.001 ) // learning rate

return nil

Architecture Details

Attention Mechanism

Each attention head computes the standard scaled dot-product attention:

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

The multi-head mechanism runs nHeads parallel attention computations and concatenates the results, then applies a linear projection.

Causal Masking

When lCausal is .T. (the default), the transformer applies a causal (triangular) mask that prevents each position from attending to future positions. This is essential for autoregressive text generation. Set lCausal := .F. for bidirectional tasks like classification or masked language modeling.

Feed-Forward Network

Each layer contains a position-wise feed-forward network:

FFN(x) = max(0, x * W1 + b1) * W2 + b2

where the inner dimension is nFFDim (typically 4x the embedding dimension).

Performance Tips

For inference with pre-trained models, set GPU_LAYERS to a value > 0 if you have a GPU. This offloads computation to the GPU and significantly speeds up generation. For CPU-only machines, a quantized model (Q4_K_M) with nLayers = 2–4 provides a good balance of quality and speed.

Explore the Samples

Open any of the 7 sample projects in samples/projects/transformer/ to see complete, runnable implementations of each pattern described in this tutorial.

On This Page

Getting Started Component Palette IDE Features Tutorials Reference Platforms Sample Projects Transformer Architecture Properties Events Example 1: Load a Pre-Trained Model Example 2: Streaming Token Generation Example 3: Attention Visualization Example 4: Training from Scratch Architecture Details Attention Mechanism Causal Masking Feed-Forward Network