Beyond Supervision

other

Backgammon is a captivating blend of skill and luck, making it a classic testbed for Artificial Intelligence. While Reinforcement Learning (RL) giants like TD-Gammon have shown incredible prowess, my recent journey has explored alternative paths, highlighting both the challenges of pure RL and the potential of hybrid approaches. This post details that journey, starting from a solid supervised learning foundation and navigating the turbulent waters of RL before settling on a promising evolutionary strategy.

The Solid Foundation: Refining Heuristics and Supervised Learning

Before diving into complex learning algorithms, many AI approaches start with encoding human knowledge. In Backgammon, this often takes the form of a heuristic evaluation function. Simply put, a heuristic is a set of rules or calculations based on expert understanding that assigns a score to a board position, estimating how advantageous it is for a player. It considers factors like pip count (race progress), blots (vulnerable checkers), points made (blocking opponent), checkers on the bar, home board strength, and prime structures (sequences of made points).

My initial efforts focused on:

Reworking the Heuristic: I refined the logic and weighting of different factors (pip count, blot safety, point value, prime building, anchor value, etc.) within my heuristic evaluation function, creating different sets of weights (HeuristicWeights) tailored for the opening, midgame, and endgame phases. The goal was to create a stronger baseline understanding of positional value.
Higher Quality Training Data: Using this improved heuristic (let’s call it “Heuristic Benchmark” or “Champion 1”) to play against itself, I generated a large database of game states, moves, and outcomes. This self-play data, guided by the better heuristic, aimed to be of higher quality than randomly sampled games.
Supervised Learning: I then trained a neural network (ImprovedBackgammonModel using TensorFlow/Keras) in a supervised manner using this generated database. The network learned to predict the heuristic’s evaluation score and the best move (according to the heuristic) for given positions. This resulted in a competent model (“NN Champion” or “Champion 2”) achieving a respectable Elo rating around the 1500 mark – a solid intermediate player.

The Rocky Road of Reinforcement Learning

With a decent supervised model, the natural next step seemed to be Reinforcement Learning, specifically using Proximal Policy Optimization (PPO) with self-play, aiming to surpass the heuristic’s limitations and potentially discover stronger strategies. However, I quickly ran into several significant hurdles:

Computation Time: RL self-play is slow. Unlike supervised learning where batches of data are processed efficiently on a GPU, RL requires simulating thousands of game steps sequentially for each iteration to generate experiences. On my local machine, this meant training times stretched dramatically compared to the supervised phase.
Resource Bottlenecks: The process was often fragile. Generating experiences, calculating advantages (GAE), and performing PPO updates involves significant data manipulation and computation, leading to occasional memory issues or unexpected slowdowns.
Lack of Progress (Stagnation/Regression): This was the most frustrating part. Despite the theoretical power of RL, the Elo rating of my agent stubbornly refused to climb significantly beyond the ~1500 achieved via supervised learning. Monitoring the training revealed the agent was often stuck on a performance plateau, or worse, the Elo would decrease over iterations (as seen in the Elo graph showing a negative trend in one run). The policy loss remained flat near zero, while the entropy steadily decreased, indicating the agent was becoming very confident in a suboptimal strategy without actually improving its win rate.
Hyperparameter Hell: PPO and the RL setup have numerous sensitive hyperparameters: learning rate, clipping epsilon, value function coefficient, entropy coefficient, exploration rates (initial, decay, min), GAE lambda, PPO epochs per update… Finding the right combination felt like searching for a needle in a haystack and required deep understanding and extensive experimentation, often exceeding my grasp.

The Siren Call of Auto-Tuning (Optuna)

To combat the hyperparameter challenge, I considered using automated tuning libraries like Optuna. Optuna works by defining an “objective function” that runs a training trial with a set of hyperparameters suggested by Optuna’s algorithms (like Bayesian optimization). It then returns a performance metric (e.g., final Elo after X iterations). Optuna runs many such trials, exploring the hyperparameter space to find the best combination.

However, the roadblock here was again computation time. Each Optuna trial would need to run a significant number of RL iterations (e.g., 30-50) to get a meaningful Elo estimate. Given that a single RL run was already slow, running potentially hundreds of trials via Optuna quickly became impractical with my available resources and patience.

A New Hope: Hybrid Evolutionary Approach

Facing these challenges, I decided to pivot to a different strategy inspired by evolutionary algorithms and the successes of simpler self-play systems (like HC-Gammon mentioned in the literature), but keeping my strong heuristic and NN components. The core idea is variation, competitive evaluation, and selection.

Here’s the 3-step process I implemented, running over multiple “generations”:

Variation (Creating Challengers):
- Start with the current best NN model (the “NN Champion”).
- Create a small number (e.g., 3) of “challenger” models by making slight modifications to the NN Champion. Currently, I’m doing this by adding a tiny amount of random noise to the champion’s neural network weights. (An alternative, perhaps for later, would be to run 1-2 iterations of the PPO update for each challenger – a “micro-RL” step).
- Each challenger is saved as a separate .keras model file.
Evaluation (Competition Time!):
- This is the key step. For each generation, define a fixed evaluation pool. This pool must include:
  - The NN Champion itself (the parent of the challengers).
  - The Heuristic Benchmark (my refined heuristic AI, acting as a stable reference point).
  - (Optional but good): One or two strong NN models from previous generations (retrieved via the VersionManager).
- Each Challenger NN then plays a set number of games (e.g., 20-30) against every member of this evaluation pool. An Elo rating is calculated for each challenger based on its performance in these matches.
Selection (Survival of the Fittest NN):
- Compare the final Elo ratings achieved by all the Challengers NN in the evaluation step.
- The NN model (either one of the challengers or potentially the original NN Champion if no challenger performed significantly better) with the highest Elo is selected as the NN Champion for the next generation.
- This selected champion model is copied to best_model.keras and its details (generation number, Elo, performance stats) are logged. The temporary challenger files are cleaned up.

Visualizing the Loop:

+----------------------+      +----------------------+      +------------------------+
| Load Current NN Champ| ---> | Generate Challengers | ---> | Evaluate Challengers   |
| (from best_model.keras)|      |  (NN + Noise/MicroRL)|      | (vs. Pool incl. Heuristic|
+-----------+----------+      +----------------------+      +----------+-------------+
            ^                                                            |
            |                                                            | Select Best NN
            | Loop                                                         V
            |                               +----------------------+      +------------------------+
            |-------------------------------| Log & Update History | <--- | Copy Best NN to        |
                                            | (if new champ found) |      | best_model.keras     |
                                            +----------------------+      +------------------------+

Early Results & Future Steps

Implementing this evolutionary loop yielded immediately encouraging results. After just a few generations, the NN Champion’s Elo rating climbed past 1504, surpassing the plateau reached by both the initial supervised training and the struggling RL attempts.

Why might this work better in my current situation?

Direct Selection Pressure: Only variations that demonstrably win more (reflected in Elo) survive. Regression is actively prevented.
Stable Benchmark: Competing against the fixed heuristic ensures the NN doesn’t “forget” basic principles or drift into strange, exploitable strategies. It has to be good overall.
Avoiding RL Pitfalls: It sidesteps the complexities of tuning PPO’s internal mechanisms and the potential for it to get stuck due to poor exploration or noisy advantage signals, especially with limited compute.

My next steps are to let this evolutionary process run for more generations and monitor the Elo progression. If it eventually plateaus, I might implement the “micro-RL” variation step instead of simple noise, potentially allowing for more directed improvements.

Conclusion

While sophisticated RL algorithms hold immense power, they aren’t always the most practical solution, especially when facing resource constraints or difficult tuning challenges. My journey showed that a strong supervised starting point combined with a simpler, evolutionary approach focused on direct competitive selection (crucially, including a fixed heuristic benchmark) can be a viable and effective path forward. It highlights the value of hybrid methods and adapting strategies based on empirical results and available resources in the cool and geekish domain of Backgammon AI.