Skip to content

Alpsource/Mastering-Fighter-Jet-Survival-Tactics

Repository files navigation

Mastering Fighter Jet Survival Tactics through RL-Driven Path Planning

A hybrid AI system combining A* path planning and Deep Q-Network (DQN) reinforcement learning to train a fighter jet agent to navigate complex environments and evade missile threats — built in Unity with ML-Agents.


Overview

This project tackles the problem of autonomous aerial navigation under threat — a challenge sitting at the intersection of classical AI, deep reinforcement learning, and real-time simulation. A Lockheed SR-71 agent must traverse a maze of walls, reach a distant target, and dynamically switch to a learned evasion policy the moment a missile launcher detects it.

The core research question: can a hybrid architecture outperform a single-algorithm approach by delegating subtasks to the method best suited to each?

The answer, demonstrated through 5M+ training steps across 24 parallelized environments, is yes.

A companion research paper accompanies this project: Mastering Fighter Jet Survival Tactics through RL-Driven Path Planning


Key Features

  • Hybrid A*/DQN Architecture — A* handles deterministic maze navigation; DQN handles stochastic missile evasion. The agent switches between them in real time based on threat detection.
  • PID-Controlled Flight Dynamics — A custom PID controller drives smooth yaw correction, decoupling locomotion from decision-making and producing physically plausible movement.
  • Proximity-Triggered Behavioral Switching — The RocketLauncherBehavior uses sphere-cast detection to transition the agent's behavior policy on the fly, without resetting the episode.
  • Massively Parallel Training — 24 concurrent environment instances running in a single Unity scene, dramatically accelerating sample collection.
  • Real-Time Statistics Tracking — A StatsManager singleton monitors per-episode outcomes (success, wall collision, timeout) and broadcasts live success rate to the Unity console.

System Architecture

┌─────────────────────────────────────────────────────┐
│                   AircraftAgent                     │
│                                                     │
│   ┌─────────────┐       ┌──────────────────────┐   │
│   │  A* Planner │◄─────►│  PathFinder / Grid   │   │
│   │  (Armed)    │       │  Manager             │   │
│   └──────┬──────┘       └──────────────────────┘   │
│          │                                          │
│          │  [missile detected]                      │
│          ▼                                          │
│   ┌─────────────┐       ┌──────────────────────┐   │
│   │  DQN Policy │◄─────►│  ML-Agents Inference │   │
│   │  (DQNActive)│       │  (ONNX model)        │   │
│   └──────┬──────┘       └──────────────────────┘   │
│          │                                          │
│          │  [missile out of range]                  │
│          └──────────────► A* resumes               │
│                                                     │
│   ┌───────────────────────────────────┐            │
│   │  PID Controller (yaw axis)        │            │
│   │  Kp=0.05  Ki=0  Kd=0.002         │            │
│   └───────────────────────────────────┘            │
└─────────────────────────────────────────────────────┘

Environment Design

Top-Down Layout

Training Environment Layout      Full Map Overview

The environment is divided into three zones:

  • Bottom — Aircraft spawn region (randomized X position, low Z)
  • Middle — Missile/rocket launcher with a configurable detection radius (white wireframe sphere)
  • Top — Target zone (green platform); reaching it yields +100 reward

Purple walls form a procedural obstacle maze the aircraft must navigate. Wall collisions incur a −1 − dist/10 penalty.

A* Grid

A* Grid - Perspective View

The terrain is discretized into a binary occupancy grid. Each cell stores a 0 (free) or 1 (obstacle). The PathFinder queries this grid to compute the optimal waypoint sequence from the agent's current position to the target.

A* Grid - Close-Up with Binary Values

Purple highlighted cells are obstacle-occupied nodes. The red and blue markers indicate the start and goal positions passed into the A* search at each step.


Algorithms

A* Path Planning

A* runs on the discrete 2D grid projected onto the XZ plane. The PathFinder singleton exposes a GetNextPoint(from, to, out nextWaypoint) API that the agent calls inside a coroutine loop. The aircraft follows each waypoint using physics forces, with the PID controller correcting heading error each FixedUpdate.

The heuristic is standard Euclidean distance, and the grid is rebuilt whenever obstacles move, keeping the path current throughout the episode.

Deep Q-Network (DQN) — "DodgeMissile"

When RocketLauncherBehavior detects the aircraft within radius, it calls SwitchBehavior(BehaviorType.DQN), which sets the ML-Agents BehaviorType to InferenceOnly and activates the trained ONNX model.

Observation Space (14-dimensional):

Feature Dimensions
Agent local position 3
Agent local rotation (Euler) 3
Target local position 3
Obstacle (missile) local position 3
Distance to target 1
Distance to obstacle 1

Action Space:

  • 2 continuous actions: (ΔX, ΔZ) — lateral and forward movement deltas.
  • ΔZ is clamped to [0.1, 1.0] to prevent backward flight.

Reward Structure:

Event Reward
Reaching target +100
Wall collision −1 − dist/10
Timeout (>200s) −1 − dist/100

PID Controller

A custom PID class handles yaw stabilization. At each FixedUpdate, the signed angular error between the aircraft's current heading and the target waypoint direction is fed into the controller:

output = Kp·e + Ki·∫e·dt + Kd·(de/dt)

This produces a smooth, physically coherent torque applied on the Y-axis, preventing the oscillation that raw proportional steering produces at high angular errors. Tuned values: Kp = 0.05, Ki = 0, Kd = 0.002.


Training

Parallel Environment Setup

24 Parallel Training Environments in Unity

Training used 24 concurrent environment instances within a single Unity scene, with each sub-environment containing a full independent copy of the agent, target, rocket launcher, and walls. Red platforms indicate in-progress episodes; green indicates a successful target reach. The parallelism provides high-throughput sample diversity, which is critical for learning generalizable evasion behaviors across random spawn configurations.

Results

Training Metrics — TensorBoard

Training ran for 5 million steps using PPO (Proximal Policy Optimization) via the ML-Agents trainer.

Metric Trend
Cumulative Reward Rises from ~20 → converges near 100
Episode Length Drops sharply after 1M steps, stabilizes ~20
Policy Loss Oscillates within a tight band (~0.118–0.125)
Value Loss Drops dramatically from ~650 → ~75

The sharp episode-length reduction corresponds to the agent learning to reach the target quickly rather than exploring — a hallmark of reward saturation under PPO.


Tech Stack

Layer Technology
Simulation Engine Unity 2022.3.11f1
RL Framework Unity ML-Agents (PPO, DQN inference)
Language C#
Training Monitor TensorBoard
Model Format ONNX (runtime inference)
Rendering Universal Render Pipeline (URP)

Getting Started

Requirements:

  • Unity Editor 2022.3.11f1
  • Unity ML-Agents package (see Packages/manifest.json)

Run the project:

  1. Clone the repository and open the project in Unity 2022.3.11f1.
  2. Open the AStar&DQN scene — this is the final scene containing both path planning and DQN inference.
  3. Press Play. The agent will navigate using A* and switch to the trained DQN model upon missile detection.

Training from scratch:

  1. Install the ML-Agents Python package: pip install mlagents
  2. Open the TrainingScene.
  3. Run: mlagents-learn config/DodgeMissile.yaml --run-id=run1
  4. Press Play in the Unity Editor.

Note: A* uses physics forces for locomotion; DQN uses direct translation. This is an intentional design split that reflects each method's control requirements.


Project Structure

RF Project V0.5/
├── Assets/
│   ├── Scripts/
│   │   ├── AircraftAgent.cs        # Core agent — A*/DQN switching, observations, rewards
│   │   ├── RocketLauncherBehavior.cs  # Threat detection and missile launch logic
│   │   ├── PID.cs                  # PID controller for yaw stabilization
│   │   └── StatsManager.cs         # Episode statistics tracker (success rate)
│   ├── Materials/                  # URP materials for reward visualization
│   └── Settings/                   # URP renderer and pipeline settings
├── Figures/                        # Visualizations used in the research paper
├── Packages/                       # Unity package manifest
├── ProjectSettings/
└── Mastering_Fighter_Jet_Survival_Tactics_through_RL_Driven_Path_Planning.pdf

Research Paper

This project is documented in a full research report covering problem formulation, related work, methodology, experimental setup, and results analysis.

Read the paper (PDF)


Built with Unity ML-Agents · Reinforcement Learning · Classical AI

About

Hybrid A* path planning + DQN reinforcement learning system in Unity ML-Agents — trains a fighter jet to navigate obstacle mazes and evade missile threats using real-time behavioral switching and a custom PID flight controller.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors