Transformer Neural Network
Fuente: source/classes/transformer.prg, source/function/matrixes.c, source/function/matrices.prg
FiveWin provides a pure Harbour/C implementation of the Transformer encoder-only architecture
for sequence classification and language modeling. The implementation includes multi-head
self-attention, feed-forward networks, layer normalization, and Adam/SGD optimizers. All
heavy linear algebra is accelerated via C extensions in matrixes.c.
Architecture
Classes
Transformer
The top-level class that orchestrates embeddings, positional encoding, stacked layers, and output projection.
| Method | Description |
|---|---|
New( num_layers, d_model, n_heads, vocab_size, max_seq_len, hEmbeddings, aVocab ) | Create a Transformer with the given architecture. Initializes all sub-layers and embeddings. |
Forward( src_indices, pos_indices ) | Run a forward pass: embedding lookup, positional encoding, N transformer blocks, average pooling, and softmax over vocabulary. |
Backward( d_output, targets, src_indices, pos_indices ) | Backpropagate gradients through all layers and accumulate weight gradients. |
zero_grad() | Reset all accumulated gradients to zero. |
MultiHeadAttention
Computes scaled dot-product attention over multiple heads. Projects inputs into Query, Key, Value spaces via learned matrices, computes attention scores, and concatenates head outputs.
FeedForward
A position-wise two-layer MLP with ReLU activation that operates on each position independently:
FF(x) = W2 * ReLU(W1 * x + b1) + b2.
LayerNorm
Applies layer normalization across the feature dimension: y = (x - mean) / sqrt(var + eps) * gamma + beta.
Optimizer_Adam
Adaptive Moment Estimation optimizer. Maintains first (m) and second (v) moment estimates with bias correction.
Uses the C-accelerated hb_AdamUpdate() function for efficient parameter updates.
Optimizer_SGD
Stochastic Gradient Descent with optional momentum for simpler training scenarios.
Matrix Operations (C Extensions)
The matrixes.c file provides high-performance linear algebra exposed to Harbour as HB_FUNC wrappers:
| Function | Description |
|---|---|
hb_MatrixMultiply( A, B ) | Standard matrix multiplication (MxN * NxP = MxP) |
hb_Softmax( logits ) | Row-wise softmax with numerical stability (max subtraction) |
hb_ReLU( M ) | Rectified Linear Unit: max(0, x) element-wise |
hb_AdamUpdate( W, dW, m, v, t, lr, beta1, beta2, eps ) | Adam optimizer parameter update step with bias correction |
hb_MatrixTranspose( M ) | Transpose matrix (swap rows and columns) |
hb_Matrix3DAdd( A, B ) | Element-wise addition for 3D tensors (batch x seq x dim) |
hb_Matrix3DAvg( X ) | Average pooling over the sequence dimension |
hb_Matrix3DExpand( M, n ) | Expand 2D gradient to 3D for backpropagation through sequence |
FW_Matrix Class
The FW_Matrix class in matrices.prg provides a high-level wrapper for matrix manipulation
with Harbour operator overloading:
oM1 := FW_Matrix():New( { { 1, 2 }, { 3, 4 } } )
oM2 := FW_Matrix():New( { { 5, 6 }, { 7, 8 } } )
oResult := oM1 * oM2 // operator overloading via OPERATOR keyword
oResult:View() // display in TXBrowse grid
Training Pipeline
A typical training loop follows these steps:
- Forward pass: call
oTransformer:Forward( src_indices, pos_indices )to get probability distributions. - Loss computation: compute cross-entropy loss between predictions and target tokens.
- Backward pass: call
oTransformer:Backward( d_output, targets, src_indices, pos_indices )to propagate gradients. - Optimizer step: use
Optimizer_Adam:Step( oTransformer )orOptimizer_SGD:Step( oTransformer )to update weights. - Zero gradients: call
oTransformer:zero_grad()before the next iteration.
Example: Sentiment Classification
The sample samples/ai/transf2.prg extends the Transformer with a classification head
for binary sentiment analysis (Positive/Negative):
#include "FiveWin.ch"
CLASS CustomTransformer INHERIT Transformer
DATA W_proj, b_class, dW_proj, db_class, last_probs
METHOD New( num_layers, d_model, n_heads, vocab_size, max_seq_len, ;
hEmbeddings, aVocab, num_classes )
METHOD ForwardClassification( src_indices, pos_indices )
METHOD BackwardClassification( d_loss, train_src, train_pos )
METHOD UpdateWeights( learningRate )
ENDCLASS
METHOD New( ... ) CLASS CustomTransformer
::Super:New( num_layers, d_model, n_heads, vocab_size, max_seq_len, ;
hEmbeddings, aVocab )
::W_proj := hb_MatrixRandom( vocab_size, num_classes )
::b_class := hb_MatrixZero( 1, num_classes )
::dW_proj := hb_MatrixZero( vocab_size, num_classes )
::db_class := hb_MatrixZero( 1, num_classes )
RETURN Self
// Classification forward: parent forward + W_proj projection + softmax
METHOD ForwardClassification( src_indices, pos_indices ) CLASS CustomTransformer
LOCAL probs := ::Forward( src_indices, pos_indices )
::last_probs := probs
LOCAL logits := hb_MatrixMultiply( probs, ::W_proj )
// Add bias
AEval( logits, { |row, i| row[ 1 ] += ::b_class[ 1 ][ 1 ] } )
AEval( logits, { |row, i| row[ 2 ] += ::b_class[ 1 ][ 2 ] } )
RETURN hb_Softmax( logits )
// Training loop
oTransformer := CustomTransformer():New( 1, 6, 2, 8, 5, NIL, vocab, 2 )
FOR e := 1 TO 50
FOR i := 1 TO Len( trainSrc )
pred := oTransformer:ForwardClassification( { trainSrc[ i ] }, ;
{ trainPos[ i ] } )
loss := CrossEntropyLoss( pred, { trainLabels[ i ] } )
oTransformer:BackwardClassification( loss[ 2 ], ;
{ trainSrc[ i ] }, { trainPos[ i ] } )
oTransformer:UpdateWeights( 0.01 )
NEXT
NEXT
Train with phrases like "buen dia" (Positive) and "malo trabajo" (Negative) using a
vocabulary of 8 tokens and 50 training epochs.
Notes
- Sinusoidal positional encoding is pre-computed at initialization for efficiency.
- Residual connections use
hb_Matrix3DAdd(C-accelerated) to avoid Harbour loop overhead. - Softmax in C subtracts row max before exponentiation to prevent floating-point overflow.
- Visual training demos:
samples/ai/transformer.prg(GUI) andsamples/ai/transf1.prg(interactive).