Docs / Tutorials / Transformer

Transformer Tutorial

The TTransformer component implements the full transformer architecture described in the seminal paper "Attention Is All You Need" (Vaswani et al., 2017). It allows you to load pre-trained model weights, run inference, fine-tune on custom data, and even train small transformers from scratch — all from Harbour xBase code.

Didactic Examples

Complete, runnable examples are in the samples/projects/transformer/ folder with 7 examples covering every aspect of the transformer component.

Sample Projects

#	Project	Description
1	Attention Vis	Visualize self-attention weights for a given input sentence.
2	Text Generator	Load a pre-trained model and generate text from a prompt.
3	Train from Scratch	Train a tiny transformer on a toy dataset (copy task).
4	Tokenizer Explorer	Inspect tokenization, token IDs, and vocabulary.
5	Sentiment Analyzer	Fine-tune a transformer for positive/negative classification.
6	Translator	Sequence-to-sequence translation (e.g., English to Spanish).
7	Attention Walkthrough	Step-by-step walkthrough of the "Attention Is All You Need" architecture.

Transformer Architecture

graph LR A["Input Text"] --> B["Tokenizer"] B --> C["Embedding +\nPositional Encoding"] C --> D["Multi-Head\nSelf-Attention"] D --> E["Feed-Forward\nNetwork"] E --> F["Add & Norm\n(Residual)"] F --> G["Repeat N times\n(nLayers)"] G --> H["Linear +\nSoftmax"] H --> I["Output\nProbabilities"] style D fill:#58a6ff,stroke:#388bfd,color:#0d1117 style E fill:#58a6ff,stroke:#388bfd,color:#0d1117 style G fill:#d2a8ff,stroke:#bc8cff,color:#0d1117

Properties

Property	Type	Default	Description
`nLayers`	Numeric	6	Number of encoder/decoder layers in the transformer stack.
`nHeads`	Numeric	8	Number of attention heads per layer. Must divide nEmbedDim evenly.
`nEmbedDim`	Numeric	512	Embedding dimension (d_model). Core width of the transformer.
`nFFDim`	Numeric	2048	Feed-forward network inner dimension (typically 4x nEmbedDim).
`nVocabSize`	Numeric	50257	Vocabulary size. Must match the tokenizer's vocabulary.
`nMaxSeqLen`	Numeric	512	Maximum input/output sequence length in tokens.
`nDropout`	Numeric	0.1	Dropout rate applied to attention and FFN outputs (0 = disabled).
`cWeightsFile`	String	""	Path to pre-trained weights file (GGUF format). Set before loading.
`lCausal`	Logical	.T.	Use causal (decoder) mask. Set to .F. for encoder-only or bidirectional mode.

Events

Event	Parameters	Description
`OnAttention`	`aWeights, nLayer, nHead`	Fired during forward pass with attention weight matrices. Use for visualization and analysis.
`OnGenerate`	`cToken, nStep`	Fired for each token generated during inference. Enables streaming output.
`OnTrainStep`	`nStep, nLoss, nBatch`	Fired after each training step with the current loss value. Use for progress display.
`OnLoss`	`nLoss`	Fired when the loss value is computed during forward pass. Use for monitoring and early stopping.

Example 1: Load a Pre-Trained Model

#include "hbbuilder.ch"

function Main()

   local oTransformer, cOutput

   DEFINE TRANSFORMER oTransformer ;
      MODEL "models/tinyllama-1.1b-chat.Q4_K_M.gguf" ;
      CONTEXT 2048 ;
      GPU_LAYERS 0

   if .not. oTransformer:lLoaded
      ? "Failed to load model:", oTransformer:cError
      return nil
   endif

   // Generate text from a prompt
   cOutput := oTransformer:Generate( ;
      "Once upon a time", ;  // prompt
      128 )                  // max tokens

   ? cOutput

return nil

Example 2: Streaming Token Generation

static function StreamExample()

   local oTransformer

   DEFINE TRANSFORMER oTransformer ;
      MODEL "models/phi3-mini.Q4_K_M.gguf"

   // Hook into the OnGenerate event for streaming
   oTransformer:OnGenerate := { |cToken, nStep| ;
      QOut( cToken ) }

   oTransformer:GenerateStream( ;
      "Explain the transformer architecture", ;
      256 )

return nil

Example 3: Attention Visualization

static function AttentionVis()

   local oTransformer, oForm, oChart
   local aAttentionMaps := {}

   DEFINE TRANSFORMER oTransformer ;
      MODEL "models/tinyllama-1.1b.Q4_K_M.gguf"

   // Capture attention weights via OnAttention event
   oTransformer:OnAttention := ;
      { |aW, nL, nH| AAdd( aAttentionMaps, ;
         { nL, nH, aW } ) }

   // Run a forward pass
   oTransformer:Forward( "The cat sat on the mat" )

   // Now aAttentionMaps contains weight matrices
   // for each layer and head - render as heatmap
   RenderAttentionHeatmap( aAttentionMaps )

return nil

Example 4: Training from Scratch

static function TrainExample()

   local oTransformer
   local aInputs  := { "hello", "world", "test" }
   local aTargets := { "hello", "world", "test" }  // copy task

   DEFINE TRANSFORMER oTransformer ;
      LAYERS 2 ;
      HEADS 4 ;
      EMBED_DIM 128 ;
      FF_DIM 256 ;
      VOCAB_SIZE 1000 ;
      MAX_SEQ_LEN 32 ;
      DROPOUT 0.1

   // Monitor training progress
   oTransformer:OnTrainStep := ;
      { |nStep, nLoss, nBatch| ;
         QOut( "Step:", nStep, "Loss:", nLoss ) }

   oTransformer:OnLoss := ;
      { |nLoss| ;
         if nLoss < 0.01 ;
            QOut( "Converged!" ) ;
         endif }

   // Train for 1000 steps
   oTransformer:Train( aInputs, aTargets, ;
      1000,  // steps
      0.001 ) // learning rate

return nil

Architecture Details

Attention Mechanism

Each attention head computes the standard scaled dot-product attention:

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

The multi-head mechanism runs nHeads parallel attention computations and concatenates the results, then applies a linear projection.

Causal Masking

When lCausal is .T. (the default), the transformer applies a causal (triangular) mask that prevents each position from attending to future positions. This is essential for autoregressive text generation. Set lCausal := .F. for bidirectional tasks like classification or masked language modeling.

Feed-Forward Network

Each layer contains a position-wise feed-forward network:

FFN(x) = max(0, x * W1 + b1) * W2 + b2

where the inner dimension is nFFDim (typically 4x the embedding dimension).

Performance Tips

For inference with pre-trained models, set GPU_LAYERS to a value > 0 if you have a GPU. This offloads computation to the GPU and significantly speeds up generation. For CPU-only machines, a quantized model (Q4_K_M) with nLayers = 2–4 provides a good balance of quality and speed.

Explore the Samples

Open any of the 7 sample projects in samples/projects/transformer/ to see complete, runnable implementations of each pattern described in this tutorial.

Getting Started Component Palette IDE Features Tutorials Reference Platforms Sample Projects Transformer Architecture Properties Events Example 1: Load a Pre-Trained Model Example 2: Streaming Token Generation Example 3: Attention Visualization Example 4: Training from Scratch Architecture Details Attention Mechanism Causal Masking Feed-Forward Network

Getting Started

Component Palette

IDE Features

Tutorials

Reference

Platforms

Transformer Tutorial

Sample Projects

Transformer Architecture

Properties

Events

Example 1: Load a Pre-Trained Model

Example 2: Streaming Token Generation

Example 3: Attention Visualization

Example 4: Training from Scratch

Architecture Details

Attention Mechanism

Causal Masking

Feed-Forward Network

On This Page