Transformer Neural Network

Fuente: source/classes/transformer.prg, source/function/matrixes.c, source/function/matrices.prg

FiveWin provides a pure Harbour/C implementation of the Transformer encoder-only architecture for sequence classification and language modeling. The implementation includes multi-head self-attention, feed-forward networks, layer normalization, and Adam/SGD optimizers. All heavy linear algebra is accelerated via C extensions in matrixes.c.

Architecture

flowchart LR Input[Input Tokens] --> Embed[Embedding Lookup] Embed --> Pos[Positional Encoding] Pos --> MHA[Multi-Head Attention] MHA --> Add1[+ Residual] Add1 --> LN1[Layer Norm] LN1 --> FF[Feed Forward] FF --> Add2[+ Residual] Add2 --> LN2[Layer Norm] LN2 --> Pool[Avg Pooling] Pool --> Proj[W_vocab Projection] Proj --> SM[Softmax] SM --> Output[Output Probabilities]

Classes

Transformer

The top-level class that orchestrates embeddings, positional encoding, stacked layers, and output projection.

Method	Description
`New( num_layers, d_model, n_heads, vocab_size, max_seq_len, hEmbeddings, aVocab )`	Create a Transformer with the given architecture. Initializes all sub-layers and embeddings.
`Forward( src_indices, pos_indices )`	Run a forward pass: embedding lookup, positional encoding, N transformer blocks, average pooling, and softmax over vocabulary.
`Backward( d_output, targets, src_indices, pos_indices )`	Backpropagate gradients through all layers and accumulate weight gradients.
`zero_grad()`	Reset all accumulated gradients to zero.

MultiHeadAttention

Computes scaled dot-product attention over multiple heads. Projects inputs into Query, Key, Value spaces via learned matrices, computes attention scores, and concatenates head outputs.

FeedForward

A position-wise two-layer MLP with ReLU activation that operates on each position independently: FF(x) = W2 * ReLU(W1 * x + b1) + b2.

LayerNorm

Applies layer normalization across the feature dimension: y = (x - mean) / sqrt(var + eps) * gamma + beta.

Optimizer_Adam

Adaptive Moment Estimation optimizer. Maintains first (m) and second (v) moment estimates with bias correction. Uses the C-accelerated hb_AdamUpdate() function for efficient parameter updates.

Optimizer_SGD

Stochastic Gradient Descent with optional momentum for simpler training scenarios.

Matrix Operations (C Extensions)

The matrixes.c file provides high-performance linear algebra exposed to Harbour as HB_FUNC wrappers:

Function	Description
`hb_MatrixMultiply( A, B )`	Standard matrix multiplication (MxN * NxP = MxP)
`hb_Softmax( logits )`	Row-wise softmax with numerical stability (max subtraction)
`hb_ReLU( M )`	Rectified Linear Unit: max(0, x) element-wise
`hb_AdamUpdate( W, dW, m, v, t, lr, beta1, beta2, eps )`	Adam optimizer parameter update step with bias correction
`hb_MatrixTranspose( M )`	Transpose matrix (swap rows and columns)
`hb_Matrix3DAdd( A, B )`	Element-wise addition for 3D tensors (batch x seq x dim)
`hb_Matrix3DAvg( X )`	Average pooling over the sequence dimension
`hb_Matrix3DExpand( M, n )`	Expand 2D gradient to 3D for backpropagation through sequence

FW_Matrix Class

The FW_Matrix class in matrices.prg provides a high-level wrapper for matrix manipulation with Harbour operator overloading:

oM1 := FW_Matrix():New( { { 1, 2 }, { 3, 4 } } )
oM2 := FW_Matrix():New( { { 5, 6 }, { 7, 8 } } )
oResult := oM1 * oM2   // operator overloading via OPERATOR keyword
oResult:View()         // display in TXBrowse grid

Training Pipeline

A typical training loop follows these steps:

Forward pass: call oTransformer:Forward( src_indices, pos_indices ) to get probability distributions.
Loss computation: compute cross-entropy loss between predictions and target tokens.
Backward pass: call oTransformer:Backward( d_output, targets, src_indices, pos_indices ) to propagate gradients.
Optimizer step: use Optimizer_Adam:Step( oTransformer ) or Optimizer_SGD:Step( oTransformer ) to update weights.
Zero gradients: call oTransformer:zero_grad() before the next iteration.

Example: Sentiment Classification

The sample samples/ai/transf2.prg extends the Transformer with a classification head for binary sentiment analysis (Positive/Negative):

#include "FiveWin.ch"

CLASS CustomTransformer INHERIT Transformer
   DATA W_proj, b_class, dW_proj, db_class, last_probs

   METHOD New( num_layers, d_model, n_heads, vocab_size, max_seq_len, ;
               hEmbeddings, aVocab, num_classes )
   METHOD ForwardClassification( src_indices, pos_indices )
   METHOD BackwardClassification( d_loss, train_src, train_pos )
   METHOD UpdateWeights( learningRate )
ENDCLASS

METHOD New( ... ) CLASS CustomTransformer
   ::Super:New( num_layers, d_model, n_heads, vocab_size, max_seq_len, ;
                hEmbeddings, aVocab )
   ::W_proj   := hb_MatrixRandom( vocab_size, num_classes )
   ::b_class  := hb_MatrixZero( 1, num_classes )
   ::dW_proj  := hb_MatrixZero( vocab_size, num_classes )
   ::db_class := hb_MatrixZero( 1, num_classes )
RETURN Self

// Classification forward: parent forward + W_proj projection + softmax
METHOD ForwardClassification( src_indices, pos_indices ) CLASS CustomTransformer
   LOCAL probs := ::Forward( src_indices, pos_indices )
   ::last_probs := probs
   LOCAL logits := hb_MatrixMultiply( probs, ::W_proj )
   // Add bias
   AEval( logits, { |row, i| row[ 1 ] += ::b_class[ 1 ][ 1 ] } )
   AEval( logits, { |row, i| row[ 2 ] += ::b_class[ 1 ][ 2 ] } )
RETURN hb_Softmax( logits )

// Training loop
oTransformer := CustomTransformer():New( 1, 6, 2, 8, 5, NIL, vocab, 2 )
FOR e := 1 TO 50
   FOR i := 1 TO Len( trainSrc )
      pred := oTransformer:ForwardClassification( { trainSrc[ i ] }, ;
                                                   { trainPos[ i ] } )
      loss := CrossEntropyLoss( pred, { trainLabels[ i ] } )
      oTransformer:BackwardClassification( loss[ 2 ], ;
                                           { trainSrc[ i ] }, { trainPos[ i ] } )
      oTransformer:UpdateWeights( 0.01 )
   NEXT
NEXT

Train with phrases like "buen dia" (Positive) and "malo trabajo" (Negative) using a vocabulary of 8 tokens and 50 training epochs.

Notes

Sinusoidal positional encoding is pre-computed at initialization for efficiency.
Residual connections use hb_Matrix3DAdd (C-accelerated) to avoid Harbour loop overhead.
Softmax in C subtracts row max before exponentiation to prevent floating-point overflow.
Visual training demos: samples/ai/transformer.prg (GUI) and samples/ai/transf1.prg (interactive).

Para un stack GPT basado solo en ecuaciones con autodiff genérico, ver TTransformerTL y el tutorial de Tensor Logic.