Deep Reinforcement Learning for Sim-to-Real Policy Transfer of VTOL-UAVs Offshore Docking Operations

Ali M. Ali, Aryaman Gupta, Hashim A. Hashim — 2024 · arXiv

arXiv ↗ PDF ↗ Added May 28, 2026

sim-to-real hierarchical-policy domain-randomization UAV fallback-controller safety

TL;DR

<p class="text-sm leading-relaxed">Hierarchical DRL (model-based approach + PPO landing) for VTOL-UAV autonomous docking on wave-disturbed offshore platforms with sim-to-real transfer.

Summary

<p class="text-sm leading-relaxed">Decomposes offshore UAV landing into model-based approach phase and DRL landing phase; uses JONSWAP spectrum domain randomization per episode for sim-to-real generalization. PPO outperforms DQN variants: 0.327 m/s impact velocity vs 0.820 m/s (DQN), converging in <200 episodes.

Key contributions

— Hierarchical decomposition: model-based approach phase + offline-trained PPO landing phase reduces training time and improves success rate.
— JONSWAP spectrum domain randomization per episode as stochastic wave disturbance model for sim-to-real policy generalization.

Novelty

<p class="text-sm leading-relaxed">Unlike prior DRL UAV docking (Hwangbo et al., Koch et al.) targeting static platforms, combines phase decomposition with per-episode JONSWAP wave randomization for moving offshore stations.

Methods

— PPO actor-critic with GAE for continuous thrust control in landing phase
— JONSWAP spectrum + inverse Fourier transform for per-episode randomized wave generation
— Dueling/Double DQN with experience replay as discrete-action baselines

Strengths

— DQN vs Double DQN vs Dueling DQN vs PPO comparison with identical network architectures enables fair algorithm benchmarking.
— JONSWAP per-episode randomization is physically grounded domain randomization — not generic noise injection.

Weaknesses

— No real-world hardware experiments; sim-to-real transfer claim is unvalidated beyond numerical simulation.
— 1D vertical dynamics only (z-axis); full 6-DOF landing with lateral disturbances not addressed.

Future work

— Incorporate visual feedback from onboard camera into landing phase policy.
— Validate learned policy on physical VTOL-UAV hardware over water.

Key insights

— Phase decomposition (model-based approach + DRL landing) cuts training complexity without sacrificing policy quality.
— Per-episode JONSWAP randomization is sufficient for generalized sim-to-real policy transfer in maritime UAV docking.

My thoughts

My Thoughts

Initial Reaction

Connections

Implementation Notes

Open Questions

Related papers

Extracted by claude-sonnet-4-6.