Mastering Fighter Jet Survival Tactics through RL-Driven Path Planning

A hybrid AI system combining A* path planning and Deep Q-Network (DQN) reinforcement learning to train a fighter jet agent to navigate complex environments and evade missile threats — built in Unity with ML-Agents.

Overview

This project tackles the problem of autonomous aerial navigation under threat — a challenge sitting at the intersection of classical AI, deep reinforcement learning, and real-time simulation. A Lockheed SR-71 agent must traverse a maze of walls, reach a distant target, and dynamically switch to a learned evasion policy the moment a missile launcher detects it.

The core research question: can a hybrid architecture outperform a single-algorithm approach by delegating subtasks to the method best suited to each?

The answer, demonstrated through 5M+ training steps across 24 parallelized environments, is yes.

A companion research paper accompanies this project: Mastering Fighter Jet Survival Tactics through RL-Driven Path Planning

Key Features

Hybrid A*/DQN Architecture — A* handles deterministic maze navigation; DQN handles stochastic missile evasion. The agent switches between them in real time based on threat detection.
PID-Controlled Flight Dynamics — A custom PID controller drives smooth yaw correction, decoupling locomotion from decision-making and producing physically plausible movement.
Proximity-Triggered Behavioral Switching — The RocketLauncherBehavior uses sphere-cast detection to transition the agent's behavior policy on the fly, without resetting the episode.
Massively Parallel Training — 24 concurrent environment instances running in a single Unity scene, dramatically accelerating sample collection.
Real-Time Statistics Tracking — A StatsManager singleton monitors per-episode outcomes (success, wall collision, timeout) and broadcasts live success rate to the Unity console.

System Architecture

┌─────────────────────────────────────────────────────┐
│                   AircraftAgent                     │
│                                                     │
│   ┌─────────────┐       ┌──────────────────────┐   │
│   │  A* Planner │◄─────►│  PathFinder / Grid   │   │
│   │  (Armed)    │       │  Manager             │   │
│   └──────┬──────┘       └──────────────────────┘   │
│          │                                          │
│          │  [missile detected]                      │
│          ▼                                          │
│   ┌─────────────┐       ┌──────────────────────┐   │
│   │  DQN Policy │◄─────►│  ML-Agents Inference │   │
│   │  (DQNActive)│       │  (ONNX model)        │   │
│   └──────┬──────┘       └──────────────────────┘   │
│          │                                          │
│          │  [missile out of range]                  │
│          └──────────────► A* resumes               │
│                                                     │
│   ┌───────────────────────────────────┐            │
│   │  PID Controller (yaw axis)        │            │
│   │  Kp=0.05  Ki=0  Kd=0.002         │            │
│   └───────────────────────────────────┘            │
└─────────────────────────────────────────────────────┘

Environment Design

Top-Down Layout

The environment is divided into three zones:

Bottom — Aircraft spawn region (randomized X position, low Z)
Middle — Missile/rocket launcher with a configurable detection radius (white wireframe sphere)
Top — Target zone (green platform); reaching it yields +100 reward

Purple walls form a procedural obstacle maze the aircraft must navigate. Wall collisions incur a −1 − dist/10 penalty.

A* Grid

The terrain is discretized into a binary occupancy grid. Each cell stores a 0 (free) or 1 (obstacle). The PathFinder queries this grid to compute the optimal waypoint sequence from the agent's current position to the target.

Purple highlighted cells are obstacle-occupied nodes. The red and blue markers indicate the start and goal positions passed into the A* search at each step.

Algorithms

A* Path Planning

A* runs on the discrete 2D grid projected onto the XZ plane. The PathFinder singleton exposes a GetNextPoint(from, to, out nextWaypoint) API that the agent calls inside a coroutine loop. The aircraft follows each waypoint using physics forces, with the PID controller correcting heading error each FixedUpdate.

The heuristic is standard Euclidean distance, and the grid is rebuilt whenever obstacles move, keeping the path current throughout the episode.

Deep Q-Network (DQN) — "DodgeMissile"

When RocketLauncherBehavior detects the aircraft within radius, it calls SwitchBehavior(BehaviorType.DQN), which sets the ML-Agents BehaviorType to InferenceOnly and activates the trained ONNX model.

Observation Space (14-dimensional):

Feature	Dimensions
Agent local position	3
Agent local rotation (Euler)	3
Target local position	3
Obstacle (missile) local position	3
Distance to target	1
Distance to obstacle	1

Action Space:

2 continuous actions: (ΔX, ΔZ) — lateral and forward movement deltas.
ΔZ is clamped to [0.1, 1.0] to prevent backward flight.

Reward Structure:

Event	Reward
Reaching target	+100
Wall collision	−1 − dist/10
Timeout (>200s)	−1 − dist/100

PID Controller

A custom PID class handles yaw stabilization. At each FixedUpdate, the signed angular error between the aircraft's current heading and the target waypoint direction is fed into the controller:

output = Kp·e + Ki·∫e·dt + Kd·(de/dt)

This produces a smooth, physically coherent torque applied on the Y-axis, preventing the oscillation that raw proportional steering produces at high angular errors. Tuned values: Kp = 0.05, Ki = 0, Kd = 0.002.

Training

Parallel Environment Setup

Training used 24 concurrent environment instances within a single Unity scene, with each sub-environment containing a full independent copy of the agent, target, rocket launcher, and walls. Red platforms indicate in-progress episodes; green indicates a successful target reach. The parallelism provides high-throughput sample diversity, which is critical for learning generalizable evasion behaviors across random spawn configurations.

Results

Training ran for 5 million steps using PPO (Proximal Policy Optimization) via the ML-Agents trainer.

Metric	Trend
Cumulative Reward	Rises from ~20 → converges near 100
Episode Length	Drops sharply after 1M steps, stabilizes ~20
Policy Loss	Oscillates within a tight band (~0.118–0.125)
Value Loss	Drops dramatically from ~650 → ~75

The sharp episode-length reduction corresponds to the agent learning to reach the target quickly rather than exploring — a hallmark of reward saturation under PPO.

Tech Stack

Layer	Technology
Simulation Engine	Unity 2022.3.11f1
RL Framework	Unity ML-Agents (PPO, DQN inference)
Language	C#
Training Monitor	TensorBoard
Model Format	ONNX (runtime inference)
Rendering	Universal Render Pipeline (URP)

Getting Started

Requirements:

Unity Editor 2022.3.11f1
Unity ML-Agents package (see Packages/manifest.json)

Run the project:

Clone the repository and open the project in Unity 2022.3.11f1.
Open the AStar&DQN scene — this is the final scene containing both path planning and DQN inference.
Press Play. The agent will navigate using A* and switch to the trained DQN model upon missile detection.

Training from scratch:

Install the ML-Agents Python package: pip install mlagents
Open the TrainingScene.
Run: mlagents-learn config/DodgeMissile.yaml --run-id=run1
Press Play in the Unity Editor.

Note: A* uses physics forces for locomotion; DQN uses direct translation. This is an intentional design split that reflects each method's control requirements.

Project Structure

RF Project V0.5/
├── Assets/
│   ├── Scripts/
│   │   ├── AircraftAgent.cs        # Core agent — A*/DQN switching, observations, rewards
│   │   ├── RocketLauncherBehavior.cs  # Threat detection and missile launch logic
│   │   ├── PID.cs                  # PID controller for yaw stabilization
│   │   └── StatsManager.cs         # Episode statistics tracker (success rate)
│   ├── Materials/                  # URP materials for reward visualization
│   └── Settings/                   # URP renderer and pipeline settings
├── Figures/                        # Visualizations used in the research paper
├── Packages/                       # Unity package manifest
├── ProjectSettings/
└── Mastering_Fighter_Jet_Survival_Tactics_through_RL_Driven_Path_Planning.pdf

Research Paper

This project is documented in a full research report covering problem formulation, related work, methodology, experimental setup, and results analysis.

Read the paper (PDF)

Built with Unity ML-Agents · Reinforcement Learning · Classical AI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mastering Fighter Jet Survival Tactics through RL-Driven Path Planning

Overview

Key Features

System Architecture

Environment Design

Top-Down Layout

A* Grid

Algorithms

A* Path Planning

Deep Q-Network (DQN) — "DodgeMissile"

PID Controller

Training

Parallel Environment Setup

Results

Tech Stack

Getting Started

Project Structure

Research Paper

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Assets		Assets
Figures		Figures
Packages		Packages
ProjectSettings		ProjectSettings
config		config
.gitignore		.gitignore
.vsconfig		.vsconfig
Mastering_Fighter_Jet_Survival_Tactics_through_RL_Driven_Path_Planning.pdf		Mastering_Fighter_Jet_Survival_Tactics_through_RL_Driven_Path_Planning.pdf
ReadMe.md		ReadMe.md

Folders and files

Latest commit

History

Repository files navigation

Mastering Fighter Jet Survival Tactics through RL-Driven Path Planning

Overview

Key Features

System Architecture

Environment Design

Top-Down Layout

A* Grid

Algorithms

A* Path Planning

Deep Q-Network (DQN) — "DodgeMissile"

PID Controller

Training

Parallel Environment Setup

Results

Tech Stack

Getting Started

Project Structure

Research Paper

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages