A hybrid AI system combining A* path planning and Deep Q-Network (DQN) reinforcement learning to train a fighter jet agent to navigate complex environments and evade missile threats — built in Unity with ML-Agents.
This project tackles the problem of autonomous aerial navigation under threat — a challenge sitting at the intersection of classical AI, deep reinforcement learning, and real-time simulation. A Lockheed SR-71 agent must traverse a maze of walls, reach a distant target, and dynamically switch to a learned evasion policy the moment a missile launcher detects it.
The core research question: can a hybrid architecture outperform a single-algorithm approach by delegating subtasks to the method best suited to each?
The answer, demonstrated through 5M+ training steps across 24 parallelized environments, is yes.
A companion research paper accompanies this project: Mastering Fighter Jet Survival Tactics through RL-Driven Path Planning
- Hybrid A*/DQN Architecture — A* handles deterministic maze navigation; DQN handles stochastic missile evasion. The agent switches between them in real time based on threat detection.
- PID-Controlled Flight Dynamics — A custom PID controller drives smooth yaw correction, decoupling locomotion from decision-making and producing physically plausible movement.
- Proximity-Triggered Behavioral Switching — The
RocketLauncherBehavioruses sphere-cast detection to transition the agent's behavior policy on the fly, without resetting the episode. - Massively Parallel Training — 24 concurrent environment instances running in a single Unity scene, dramatically accelerating sample collection.
- Real-Time Statistics Tracking — A
StatsManagersingleton monitors per-episode outcomes (success, wall collision, timeout) and broadcasts live success rate to the Unity console.
┌─────────────────────────────────────────────────────┐
│ AircraftAgent │
│ │
│ ┌─────────────┐ ┌──────────────────────┐ │
│ │ A* Planner │◄─────►│ PathFinder / Grid │ │
│ │ (Armed) │ │ Manager │ │
│ └──────┬──────┘ └──────────────────────┘ │
│ │ │
│ │ [missile detected] │
│ ▼ │
│ ┌─────────────┐ ┌──────────────────────┐ │
│ │ DQN Policy │◄─────►│ ML-Agents Inference │ │
│ │ (DQNActive)│ │ (ONNX model) │ │
│ └──────┬──────┘ └──────────────────────┘ │
│ │ │
│ │ [missile out of range] │
│ └──────────────► A* resumes │
│ │
│ ┌───────────────────────────────────┐ │
│ │ PID Controller (yaw axis) │ │
│ │ Kp=0.05 Ki=0 Kd=0.002 │ │
│ └───────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
The environment is divided into three zones:
- Bottom — Aircraft spawn region (randomized X position, low Z)
- Middle — Missile/rocket launcher with a configurable detection radius (white wireframe sphere)
- Top — Target zone (green platform); reaching it yields +100 reward
Purple walls form a procedural obstacle maze the aircraft must navigate. Wall collisions incur a −1 − dist/10 penalty.
The terrain is discretized into a binary occupancy grid. Each cell stores a 0 (free) or 1 (obstacle). The PathFinder queries this grid to compute the optimal waypoint sequence from the agent's current position to the target.
Purple highlighted cells are obstacle-occupied nodes. The red and blue markers indicate the start and goal positions passed into the A* search at each step.
A* runs on the discrete 2D grid projected onto the XZ plane. The PathFinder singleton exposes a GetNextPoint(from, to, out nextWaypoint) API that the agent calls inside a coroutine loop. The aircraft follows each waypoint using physics forces, with the PID controller correcting heading error each FixedUpdate.
The heuristic is standard Euclidean distance, and the grid is rebuilt whenever obstacles move, keeping the path current throughout the episode.
When RocketLauncherBehavior detects the aircraft within radius, it calls SwitchBehavior(BehaviorType.DQN), which sets the ML-Agents BehaviorType to InferenceOnly and activates the trained ONNX model.
Observation Space (14-dimensional):
| Feature | Dimensions |
|---|---|
| Agent local position | 3 |
| Agent local rotation (Euler) | 3 |
| Target local position | 3 |
| Obstacle (missile) local position | 3 |
| Distance to target | 1 |
| Distance to obstacle | 1 |
Action Space:
- 2 continuous actions:
(ΔX, ΔZ)— lateral and forward movement deltas. ΔZis clamped to[0.1, 1.0]to prevent backward flight.
Reward Structure:
| Event | Reward |
|---|---|
| Reaching target | +100 |
| Wall collision | −1 − dist/10 |
| Timeout (>200s) | −1 − dist/100 |
A custom PID class handles yaw stabilization. At each FixedUpdate, the signed angular error between the aircraft's current heading and the target waypoint direction is fed into the controller:
output = Kp·e + Ki·∫e·dt + Kd·(de/dt)
This produces a smooth, physically coherent torque applied on the Y-axis, preventing the oscillation that raw proportional steering produces at high angular errors. Tuned values: Kp = 0.05, Ki = 0, Kd = 0.002.
Training used 24 concurrent environment instances within a single Unity scene, with each sub-environment containing a full independent copy of the agent, target, rocket launcher, and walls. Red platforms indicate in-progress episodes; green indicates a successful target reach. The parallelism provides high-throughput sample diversity, which is critical for learning generalizable evasion behaviors across random spawn configurations.
Training ran for 5 million steps using PPO (Proximal Policy Optimization) via the ML-Agents trainer.
| Metric | Trend |
|---|---|
| Cumulative Reward | Rises from ~20 → converges near 100 |
| Episode Length | Drops sharply after 1M steps, stabilizes ~20 |
| Policy Loss | Oscillates within a tight band (~0.118–0.125) |
| Value Loss | Drops dramatically from ~650 → ~75 |
The sharp episode-length reduction corresponds to the agent learning to reach the target quickly rather than exploring — a hallmark of reward saturation under PPO.
| Layer | Technology |
|---|---|
| Simulation Engine | Unity 2022.3.11f1 |
| RL Framework | Unity ML-Agents (PPO, DQN inference) |
| Language | C# |
| Training Monitor | TensorBoard |
| Model Format | ONNX (runtime inference) |
| Rendering | Universal Render Pipeline (URP) |
Requirements:
- Unity Editor 2022.3.11f1
- Unity ML-Agents package (see
Packages/manifest.json)
Run the project:
- Clone the repository and open the project in Unity 2022.3.11f1.
- Open the
AStar&DQNscene — this is the final scene containing both path planning and DQN inference. - Press Play. The agent will navigate using A* and switch to the trained DQN model upon missile detection.
Training from scratch:
- Install the ML-Agents Python package:
pip install mlagents - Open the
TrainingScene. - Run:
mlagents-learn config/DodgeMissile.yaml --run-id=run1 - Press Play in the Unity Editor.
Note: A* uses physics forces for locomotion; DQN uses direct translation. This is an intentional design split that reflects each method's control requirements.
RF Project V0.5/
├── Assets/
│ ├── Scripts/
│ │ ├── AircraftAgent.cs # Core agent — A*/DQN switching, observations, rewards
│ │ ├── RocketLauncherBehavior.cs # Threat detection and missile launch logic
│ │ ├── PID.cs # PID controller for yaw stabilization
│ │ └── StatsManager.cs # Episode statistics tracker (success rate)
│ ├── Materials/ # URP materials for reward visualization
│ └── Settings/ # URP renderer and pipeline settings
├── Figures/ # Visualizations used in the research paper
├── Packages/ # Unity package manifest
├── ProjectSettings/
└── Mastering_Fighter_Jet_Survival_Tactics_through_RL_Driven_Path_Planning.pdf
This project is documented in a full research report covering problem formulation, related work, methodology, experimental setup, and results analysis.
Built with Unity ML-Agents · Reinforcement Learning · Classical AI





