Reinforcement Learning for Orbital Transfers at the 2026 AI Winter School (Brown University)

January 9, 2026 6 min read

Hands-on workshop framing orbital transfer as a controlled two-body dynamics problem: Hohmann benchmark, Gymnasium environment, PPO policies, reward shaping, and diagnostics for chatter and delta-v efficiency.

At the 2026 AI Winter School, hosted by the Center for the Fundamental Physics of the Universe at Brown University, I led a 2.5-hour hands-on workshop on reinforcement learning for orbital transfers.

The workshop was not about claiming that RL is the right way to solve a textbook astrodynamics problem. The point was to use a problem with known mechanics and a known analytic solution as a controlled environment for learning the applied RL workflow: define the state, action, reward, and diagnostics; train a policy; compare it against ground truth; then explain the failure modes.

Control Problem

The notebook used nondimensional two-body dynamics: unit gravitational parameter, an initial circular orbit at radius 1, and a target circular orbit at radius 1.6. I call those radii $r_{1}$ and $r_{2}$ below. The model omitted drag, finite-duration thrust, J2 perturbations, third bodies, attitude dynamics, and mass depletion. Control was a tangential impulse applied once per simulation step.

Here the arrowed r denotes the spacecraft position vector; the plain r denotes its scalar radius. The state evolves under central gravity:

\begin{aligned}\frac{d\vec{r}}{dt} &= \vec{v} \\ \frac{d\vec{v}}{dt} &= -\frac{\mu\vec{r}}{r^3} \\ r &= |\vec{r}|\end{aligned}

That stripped-down model is still useful because the relevant orbital invariants are visible. For a circular target orbit, the target specific energy and angular momentum are known:

\begin{aligned}E^* &= -\frac{\mu}{2r_2} \\ L^* &= \sqrt{\mu r_2}\end{aligned}

The RL environment did not need to know the absolute orbital angle. The observation vector used normalized radius, radial velocity, tangential velocity, angular-momentum error, energy error, and previous action. Removing angle makes the policy rotationally symmetric: the same local orbital state should produce the same control decision anywhere around the planet.

Hohmann Benchmark

Before training PPO, the notebook computed the Hohmann transfer. For circular, coplanar orbits with two impulsive burns, the transfer semi-major axis is:

a_T = \frac{r_1 + r_2}{2}

The two burns and transfer time are:

\begin{aligned}\Delta v_1 &= \sqrt{\mu\left(\frac{2}{r_1} - \frac{1}{a_T}\right)} - \sqrt{\frac{\mu}{r_1}} \\ \Delta v_2 &= \sqrt{\frac{\mu}{r_2}} - \sqrt{\mu\left(\frac{2}{r_2} - \frac{1}{a_T}\right)} \\ T &= \pi\sqrt{\frac{a_T^3}{\mu}}\end{aligned}

That closed-form solution gave the workshop a real yardstick: not just “did the agent reach the target,” but how much Δv it spent, how many burns it used, whether it circularized, and whether the trajectory matched the expected burn-coast-burn structure.

Hohmann transfer trajectory: the transfer ellipse touching the inner start orbit and the outer target orbit

The verification plots made the baseline concrete: the trajectory should be an ellipse touching r1 and r2, the radius history should rise from the inner orbit to the outer orbit after the second burn, and cumulative Δv should match the analytic total.

Verification plots for the simulated Hohmann transfer: radius versus time rising to the target, two thrust impulses showing the burn-coast-burn structure, and cumulative delta-v matching the ideal total

One useful teaching detail appeared immediately: even the analytic plan does not land exactly on r2 in a finite-timestep simulator if the second burn fires on the first step at or after the computed transfer time. The total Δv still matches theory, but the final orbit has a small residual timing error. That separates continuous-time theory, numerical integration, and policy behavior before RL enters the discussion.

RL Formulation

The Gymnasium environment held the physics fixed and varied the control interface:

Component	Implementation
Observation	Normalized `r`, `vr`, `vt`, target angular-momentum error, target energy error, previous action
Discrete action	`coast`, full prograde impulse, full retrograde impulse
Continuous action	throttle in `[-1, 1]`, mapped to a signed tangential Δv impulse
Success criteria	tolerances on $\| r - r_{2} \|$ , $\| v_{r} \|$ , and $\| L - L^{*} \|$
Failure criteria	crash/escape radius or episode timeout

The dense reward used a combined energy/angular-momentum error:

\mathrm{err} = \frac{|E - E^*|}{|E^*|} + \frac{|L - L^*|}{|L^*|}

with shaping approximately proportional to:

\mathrm{err}_{previous} - \gamma\,\mathrm{err}_{current}

Then the environment subtracted fuel and ignition/switching penalties, added a one-time success bonus on first entry into the tolerance region, and added a holding reward for staying there. PPO was trained with observation/reward normalization during training, frozen normalization statistics during evaluation, and deterministic policy rollout for diagnostics.

What the Diagnostics Caught

The important lesson was that endpoint success is too weak a metric. A policy can enter the success region and still be a poor transfer.

The notebook compared policies using trajectory, radius history, radial velocity, thrust impulses, cumulative Δv, number of burns or active-thrust steps, closest-to-target statistics, and the mission report against the Hohmann ideal.

The failure modes were instructive:

Discrete control: small fixed impulses can reach the target, but often with many prograde/retrograde corrections. The orbit may satisfy the tolerance band while wasting Δv.
Continuous control: throttle control is more expressive, but it can learn micro-thrusting: almost continuous small corrections that keep the error low while hiding poor fuel efficiency.
Tolerance exploitation: a policy can appear to “beat” the ideal by stopping inside loose tolerances on a slightly elliptical orbit. That is not a better transfer; it is a reminder that the metric defines the game.
Final-state ambiguity: final radius alone is misleading for eccentric orbits. Closest approach, radial-velocity history, angular momentum, and thrust history are needed to interpret what the policy actually learned.

Experiment Loop

The final notebook section let participants edit a ModeConfig and rerun training. The knobs were not decorative; each one changes the control problem:

dv_mag: control authority per step
fuel_cost_penalty: cost of using Δv
ignition_penalty: cost of turning on or changing thrust
reward_shaping_scale: strength of dense energy/angular-momentum shaping
training_timesteps, learning_rate, and ent_coef: PPO optimization and exploration behavior

The practical standard was:

train a policy
inspect trajectory, thrust, and Δv
compare against Hohmann
explain the failure mode
change one parameter or design choice
rerun

That loop is the point of using RL in a problem with analytic ground truth. The goal is not to celebrate a learned policy for reaching the target. The goal is to make the policy’s behavior legible enough that failures are evidence for the next experiment.

Materials

Workshop Recording

Slides

Slides (PDF)

Code

RL for Orbital Transfers Notebook

Event Page

For the full schedule and recordings across all modules:

2026 AI Winter School (Indico)

Category
AI and Machine Learning 8