Six AI paradigms, one game engine, from scratch

The IILM AHL game-development course required building AI agent behaviour in a real-time engine and demonstrating it. The path of least resistance was obvious: Unity's Asset Store has pre-built AI packages for pathfinding, navigation meshes, and steering behaviours. Drop them in, configure a few inspector properties, run the demo, pass the assessment.
I had been building things in game engines since 2009 and had always treated the AI layer as a black box. I knew how to connect the existing systems; I did not know how any of them actually worked. If I used Asset Store plugins here, that would still be true after the course. So the constraint I set was simple: every algorithm implemented from scratch — no Asset Store AI packages, no Unity NavMesh for the pathfinding agent, no pre-built steering behaviours. The cost was time. The stake was whether I would walk out of the course with genuine models for how these algorithms work, or just configuration familiarity.
The choice produced five agents across six paradigms, all written in C#, all understood from the Bellman update forward.
The Q-Learning grid agent was the foundation. Implementing the Bellman equation directly — Q(s,a) ← Q(s,a) + α(r + γ · max Q(s',a') − Q(s,a)) — makes the tradeoffs legible in a way that reading the equation never does. You watch the Q-table converge over episodes. You see what happens when the learning rate is too high and the table oscillates rather than settles. The Q-table persists across episode resets via DontDestroyOnLoad on the parent GameObject — agent memory across episodes is not free, and Unity makes you be explicit about it.
The A pathfinding agent* was where the "no NavMesh" constraint bit hardest. NavMesh would have taken twenty minutes. Building the graph from collider positions at runtime — min-heap priority queue, open and closed sets, Euclidean heuristic, dynamic re-routing when obstacles move — took days. I am glad it did. Priority queues are not abstract when you are debugging why the agent routes around a wall it should not be avoiding, because the heuristic is overestimating the cost of the direct path.
The waypoint and steering agent explored a different paradigm: no search, just behavioural forces — seek, flee, arrive, pursue — composited into a velocity vector each frame. Simple in principle, surprisingly powerful in practice when you understand why the steering forces compose rather than conflict.
The PPO neural network agent was trained using Unity ML-Agents with custom observation spaces — fan-pattern raycasts, speed, angular velocity — and a reward signal I designed to incentivise useful behaviour rather than just forward progress. That reward design problem is harder than the algorithm itself.
The hybrid IL+PPO self-driving car was the most complex and the most directly connected to how I think about AI training today.
Training the car agent from random initialisation with pure reinforcement learning was not viable on the hardware I had. The agent must randomly stumble onto the track before it can discover that staying on it is rewarded. On a mid-range laptop — the same machine that had made Blender renders painful two years earlier — unconstrained training would have taken prohibitive time.
The solution was Imitation Learning before RL. I drove the car myself to produce demonstration data. Behavioural Cloning pre-trained the network on those recordings, giving it a warm-start policy that already roughly knew how to navigate the track. PPO then fine-tuned that policy with a custom reward signal — positive for forward progress, negative for boundary violations and collisions — generalising the agent to track sections it had never seen in the demonstrations. The trained .onnx policy runs in C# at game time via Unity's Sentis runtime, with no Python dependency in the final build.
I think about this pipeline often when working on fine-tuning problems now. A model that starts from a warm prior converges faster and more stably than one starting from noise. In LLM terms, this is the difference between RLHF on a base model versus RLHF on an instruction-tuned checkpoint — the instruction-tuned prior gives the RL phase something sensible to improve rather than something random to navigate out of. The game agent made that intuition concrete before I had the vocabulary to describe it in ML terms.
Every algorithm — Bellman updates, A* graph construction, PPO hyperparameter tuning, observation space design — was implemented and understood directly. This is the same principle I apply everywhere: the Asset Store version would have told me nothing. The from-scratch version told me what each algorithm actually costs and what it actually does.
In 2009, I wrote a car game in 3D RAD where the AI was a hardcoded set of waypoints the NPC car followed in a loop. In 2023, I trained a self-driving car agent with reinforcement learning that generalises to unseen track sections. The vehicle is the same. The gap between those two implementations is the entire story of how I learn: curiosity about why something works, a refusal to accept the black-box version, and enough time to close the gap through building. That arc is one of the clearest through-lines in everything I have shipped since.
Did this resonate?