Backgammon - Behind the Moves

other

For my birthday, my wife bought me a magnificent wooden Backgammon set. It’s a particularly fascinating game. Unlike chess, which is perfectly deterministic, Backgammon distinguishes itself by its use of chance. Not to randomly determine the winner, but rather to put the players’ skills to the test against the whims of fate. With every roll of the dice, the game configuration evolves, constantly suspended in a state of uncertainty. The equation is therefore complex. Rest assured, though: a good Backgammon player, even with terrible luck on the dice, will almost always emerge victorious against an average player blessed with excellent rolls. Personally, I find myself somewhere in the middle: mediocre and unlucky…

As I started working on understanding various tactics and trying to grasp the probabilities inherent in the game mechanics, I began programming a simple ASCII Backgammon board editor. The initial goal was just to illustrate the articles I planned to write about my own progress.

Once the editor was finished, I thought, “Well, since the board representation is there, why not program the computer to play against me?” Initially, I just coded the basic rules for moving the checkers. A few days later, I figured that by integrating some more elaborate principles, I’d have a more seasoned opponent against whom I could measure my mediocrity. So, I armed it with principles gleaned from an excellent 1973 book: “Beginning Backgammon” by Tim Holland. Curiously, this wasn’t too difficult: by adjusting two dynamic variables (reward and penalty) based on the game situation (isolated blots, building primes, hitting opponent’s blots…), the computer’s capabilities genuinely improved. The level of play started approaching that of the apps you can download on your phone.

Chatting about this with a few colleagues, the idea of training a real AI model quickly surfaced. By literally changing just a few lines of code to enable self-play, I launched the computer against itself in an epic journey of 165,000 games. All were meticulously recorded in an SQLite database (now weighing in at 3 GB), offering a future model over 6 million moves to analyze.

Then came the moment to dive into the actual learning process. I’ll be honest: apart from training simple categorization models (often limited to binary answers), I had no idea how to conceptualize such training. So, I leaned heavily on Claude.ai and Gemini (Google AI Studio) to help me with this task.

What Makes Backgammon Challenging for AI?

Backgammon presents unique challenges for artificial intelligence:

It combines skill and luck through dice rolls
It requires both tactical moves and long-term strategy
The number of possible game states is astronomically large
Players must adapt to constantly changing board situations

Unlike chess or Go, where every move is deterministic, Backgammon’s dice element introduces probability and risk management - making it a great AI challenge.

Step 1: Creating a Minimalist Model

Just to see… and to avoid being completely overwhelmed, I asked these AIs to write a minimal reinforcement learning model. I provided them with the database structure and not much else. After a few clicks (okay, I exaggerate… probably a few hundred rounds of discussion, testing, and debugging), I had a functional script. The model didn’t take long to train – about ten minutes, if I recall correctly. A small model without grand ambitions, but one that knew how to play Backgammon. Well, it knew how to move its pieces legally and was ultimately a pretty good sport, in the sense that it never got angry watching me thoroughly beat it in every game.

Step 2: Designing the Advanced Training Pipeline

I then returned to my AI friends to improve the training. This is where I truly started feeling out of my depth. Through extensive discussions about Backgammon tactics, the AIs converged on a finely tuned, bespoke training protocol (the pipeline). Here’s what I managed to understand about its workings (without being capable of coding even 1/1000th of it…). My role in this design was mostly to offer very theoretical ideas which, echoing already established techniques, were rapidly transcribed into Python. Here’s an overview of the pipeline:

Supervised Learning (Pretraining)
- The Idea: Learn by imitating a “teacher” or existing examples. Instead of learning through trial-and-error (like in RL), we show the model game situations (state) and the “correct” move to play (target policy) or the final outcome of the game (target value) drawn from expert games or previous model versions.
- How? Load data (e.g., from my 6M+ move database). Each data point is a triplet: (Game State, Move Played, Final Game Outcome). Train the model to:
  - Predict the move: The policy head learns to output a high probability for the move actually played in that state in the database (using policy_loss, often Cross-Entropy). Success is measured by accuracy (percentage of times the predicted most likely move is the correct one).
  - Predict the outcome: The value head learns to predict whether the game was won (+1) or lost (-1) from that state (using value_loss, often Huber Loss or MSE).
- Why? Often used as pretraining. It gives the model an initial understanding of the game (legal moves, which positions generally lead to wins/losses) before letting it learn on its own via RL. This speeds up the RL process.
Reinforcement Learning (RL)
- The Idea: Learn through trial-and-error and delayed rewards. The agent plays against itself or an opponent, tries moves, and receives “rewards” (or “penalties”) based on the game’s outcome or intermediate goals (bearing off a checker, hitting the opponent). It adjusts its policy to maximize total long-term rewards.
- How?
  - Experience Generation: The agent plays games (advanced_self_play, _generate_...). For each move, it records: state (s), action chosen (a), policy used (π(a|s)), immediate reward (r), and next state (s’).
  - Learning (Update): Use collected experiences to improve the policy and value function. This is where PPO and GAE come in: GAE estimates the advantage of each action (A), and PPO updates the policy to favor advantageous actions and updates the value function to better predict future returns (more on that later).
- Why? Allows the model to potentially discover strategies better than those in the initial data and adapt to the long-term consequences of its actions.
Curriculum Learning
- The Idea: Start with simple problems and gradually increase the difficulty, like a student in school.
- How (in my code)?
  - Supervised: CURRICULUM_THRESHOLDS and stages in _get_curriculum_weights adjust how the loss is calculated. Early on (stage 0), focus more on learning the policy (what moves to play). Later, increase the weight of the value (is this a good position?).
  - Structured Reinforcement: The train_with_structured_curriculum function implements a curriculum based on game phases:
    - Phase 1 (Endgame): Agent only plays endgames (_generate_bearing_off...). Simpler, only bearing-off decisions.
    - Phase 2 (Simplified): Games with fewer checkers (_generate_simplified...). Less tactical complexity.
    - Phase 3 (Full): Complete games (advanced_self_play...). The hardest problem.
- Why? Helps the model learn faster and more stably by not immediately confronting it with the full complexity of the game.
Data Augmentation
- The Idea: Artificially create more training data from existing data to make the model more robust and help it generalize better.
- How (augmented_supervised_training)? Take an existing example (state, policy, value). Create a new version by adding slight “noise” (random variations) to the state (numeric features, not directly checker positions to avoid invalidating the board) and/or the target policy.
- Why? Exposing the model to slight variations of the same situations helps it avoid merely “memorizing” exact examples and react better to slightly different situations in actual play.
Elo Rating Evaluation
- The Idea: Measure the relative strength of different model versions (or against other players/models) by simulating matches, inspired by the chess rating system.
- How (evaluate_elo_verbose)? The current model plays a number of games (EVAL_GAMES) against one or more opponents (previous versions, noisy version). Based on results (wins, losses, draws) and initial Elo difference, each player’s Elo is adjusted. Winning against a stronger opponent yields a large Elo gain; losing to a weaker one causes a significant drop. The K-factor (ELO_K_FACTOR) controls the magnitude of adjustments.
- Why? Provides an objective (though potentially noisy) measure of the model’s playing strength progression over time, more direct than just looking at loss curves.
Arbitrator
- The Idea: A part of the code that knows the fundamental rules of the game, can verify if a model’s proposed move is valid, and can detect situations of stalemate or lack of progress.
- How (BackgammonArbitrator)?
  - validate_move: Checks if an action (move) is legal given the dice roll and board position (e.g., can I land here? Is the die available?).
  - check_for_progress: Looks at the recent history of positions to see if the game is stagnating (no checkers borne off, very similar positions).
- Why?
  - Validation: Prevents the model (especially during evaluation or if using a basic action selection logic) from making illegal moves.
  - Stagnation: Allows ending games stuck in infinite loops or dead-end situations, saving computation time.

Deeper Dive: PPO and GAE (How the RL Magic Happens)

Imagine learning Backgammon. You make a move. Was it good? Hard to tell immediately. It might lead to a great position in 5 turns, or seem good now but cost you the game later. RL aims to learn a Policy (how to choose a move) that maximizes long-term Reward (winning).

PPO (Proximal Policy Optimization) - Learning Better, Cautiously
- The Problem: When updating our strategy (policy), changing it too drastically based on a few recent games can be risky. What worked in those few games might be terrible overall.
- PPO’s Idea: Update the policy prudently. “I want to improve, but not change too much from my current strategy (the one I just used to play).”
- How (Simplified): The agent plays using its current old_policy. It assesses if actions taken were better or worse than expected (using the Advantage, see GAE). It calculates a ratio: probability(action | new_policy) / probability(action | old_policy). PPO then “clips” this ratio. If the new policy wants to make a good action much more likely (large ratio), PPO limits the increase. If it wants to make a bad action much less likely (small ratio), PPO limits the decrease. This “limit” is controlled by epsilon.
- Goal: By taking small, careful steps, PPO avoids breaking a somewhat working policy, ensuring stable, gradual improvement. It’s robust and popular.
GAE (Generalized Advantage Estimation) - Judging a Move’s Quality
- The Problem: How good was that move on turn 5 if the reward (+1 win/-1 loss) only comes at turn 40?
- GAE’s Idea: Estimate the Advantage of taking a specific action in a state: “How much better (or worse) was this action compared to the average action I could have taken?”
- How (Simplified): We need a Value Function (V) that estimates the expected future reward from a state s. For each step (state s -> reward r -> state s'), calculate the TD Error: delta = r + gamma * V(s') - V(s). This measures how reality (r + gamma * V(s')) differed from prediction (V(s)). GAE then looks at a weighted combination of future deltas: delta_t + (gamma*lambda)*delta_{t+1} + (gamma*lambda)^2*delta_{t+2} + .... lambda controls how much future errors matter.
- Goal: GAE provides a more stable, less noisy estimate of Advantage than simpler methods by balancing immediate feedback with longer-term consequences. PPO uses this Advantage estimate to update the policy.

In Short: The pipeline combines:

Supervised Learning: Initial knowledge base.
Augmentation: Robustness for that base.
RL (PPO+GAE): Self-improvement beyond initial data.
Curriculum: Easier learning curve.
Elo: Measuring real strength.
Arbitrator: Rule enforcement and game termination.

First Training Run & The Facepalm Moment

Teaching a Computer to Play

backgammon/
│
├── environment/
│ ├── init.py
│ ├── game.py # Backgammon Game Environment
│ └── arbitrator.py # Backgammon Arbitrator
│
├── models/
│ ├── init.py
│ ├── backgammon_model.py # Classe Improved Backgammon Model
│ └── version_manager.py # Model Version Manager
│
├── training/
│ ├── init.py
│ ├── agent.py # Advanced BackgammonRL
│ ├── supervised.py # Supervised training methods
│ └── reinforcement.py # Renforcement Learning methods
│
├── utils/
│ ├── init.py
│ ├── encoders.py # Stat coding functions
│ ├── evaluators.py # Elo system for evaluation
│ └── visualization.py # Visualization appetizer
│
└── main.py # Main script

I launched my first “victorious” training run early in the evening, expecting to find a fully baked model in the morning. To my surprise, it finished cooking in just 3 hours! I immediately integrated it into my ASCII interface and started testing, fully expecting to get crushed. Ultimately, while it was a bit more adept than the previous version and sometimes even delightfully aggressive, it proved incapable of adapting its play to the different phases of a Backgammon game. I logged several games I played against it and discussed its inability to weave any real strategy with the AIs. While they were more nuanced in their diagnosis than I was, they agreed that variable adjustments were necessary.

It was at that moment I remembered something stupid: during the testing phases, I had temporarily reduced its database usage to just 0.1%… What an idiot!

I’ve now rectified that oversight and relaunched the calculation. This time, it’s definitely not going to finish overnight… but hopefully, in a few days, I’ll have an opponent capable of putting me in my place. Wish me luck!

Oh well…

I’m writing these lines a month later. That brilliant plan? Turns out it was even more complicated than it already looked on paper. Supervised training went surprisingly well (over 80% prediction accuracy — not bad!), but reinforcement learning… let’s just say it was a lot less straightforward and far less forgiving of my total lack of experience. Between the endless bugs and the hyperparameter drama, I got absolutely nowhere. Each attempt would lock up my poor personal computer for an entire day (yes, I’m talking about a first-gen M1 Mac with 16GB of RAM — clearly not built for this kind of abuse). So I gave up on that quest and pivoted to something a bit more manageable: a heuristic blended with minimax, itself backed by an AI model that helps prune what, from my perspective, is a monstrously large decision tree. And that, dear reader, is what this new post is all about.