Differential dynamic programming

Differential dynamic programming (DDP) is an optimal control algorithm of the trajectory optimization class. The algorithm was introduced in 1966 by Mayne^[1] and subsequently analysed in Jacobson and Mayne's eponymous book.^[2] The algorithm uses locally-quadratic models of the dynamics and cost functions, and displays quadratic convergence. It is closely related to Pantoja's step-wise Newton's method.^[3]^[4]

Finite-horizon discrete-time problems

The dynamics

𝐱_{i + 1} = 𝐟 (𝐱_{i}, 𝐮_{i})

1

describe the evolution of the state $𝐱$ given the control $𝐮$ from time $i$ to time $i + 1$ . The total cost $J_{0}$ is the sum of running costs $ℓ$ and final cost $ℓ_{f}$ , incurred when starting from state $𝐱$ and applying the control sequence $𝐔 \equiv {𝐮_{0}, 𝐮_{1} \dots, 𝐮_{N - 1}}$ until the horizon is reached:

J_{0} (𝐱, 𝐔) = \sum_{i = 0}^{N - 1} ℓ (𝐱_{i}, 𝐮_{i}) + ℓ_{f} (𝐱_{N}),

where $𝐱_{0} \equiv 𝐱$ , and the $𝐱_{i}$ for $i > 0$ are given by Eq. 1. The solution of the optimal control problem is the minimizing control sequence $𝐔^{*} (𝐱) \equiv {argmin}_{𝐔} J_{0} (𝐱, 𝐔) .$ Trajectory optimization means finding $𝐔^{*} (𝐱)$ for a particular $𝐱_{0}$ , rather than for all possible initial states.

Dynamic programming

Let $𝐔_{i}$ be the partial control sequence $𝐔_{i} \equiv {𝐮_{i}, 𝐮_{i + 1} \dots, 𝐮_{N - 1}}$ and define the cost-to-go $J_{i}$ as the partial sum of costs from $i$ to $N$ :

J_{i} (𝐱, 𝐔_{i}) = \sum_{j = i}^{N - 1} ℓ (𝐱_{j}, 𝐮_{j}) + ℓ_{f} (𝐱_{N}) .

The optimal cost-to-go or value function at time $i$ is the cost-to-go given the minimizing control sequence:

V (𝐱, i) \equiv \min_{𝐔_{i}} J_{i} (𝐱, 𝐔_{i}) .

Setting $V (𝐱, N) \equiv ℓ_{f} (𝐱_{N})$ , the dynamic programming principle reduces the minimization over an entire sequence of controls to a sequence of minimizations over a single control, proceeding backwards in time:

V (𝐱, i) = \min_{𝐮} [ℓ (𝐱, 𝐮) + V (𝐟 (𝐱, 𝐮), i + 1)] .

2

This is the Bellman equation.

Differential dynamic programming

DDP proceeds by iteratively performing a backward pass on the nominal trajectory to generate a new control sequence, and then a forward-pass to compute and evaluate a new nominal trajectory. We begin with the backward pass. If

ℓ (𝐱, 𝐮) + V (𝐟 (𝐱, 𝐮), i + 1)

is the argument of the $\min [\cdot]$ operator in Eq. 2, let $Q$ be the variation of this quantity around the $i$ -th $(𝐱, 𝐮)$ pair:

\begin{matrix} Q (δ 𝐱, δ 𝐮) \equiv & ℓ (𝐱 + δ 𝐱, 𝐮 + δ 𝐮) & + V (𝐟 (𝐱 + δ 𝐱, 𝐮 + δ 𝐮), i + 1) \\ - & ℓ (𝐱, 𝐮) & - V (𝐟 (𝐱, 𝐮), i + 1) \end{matrix}

and expand to second order

\approx \frac{1}{2} {[\begin{matrix} 1 \\ δ 𝐱 \\ δ 𝐮 \end{matrix}]}^{𝖳} [\begin{matrix} 0 & Q_{𝐱}^{𝖳} & Q_{𝐮}^{𝖳} \\ Q_{𝐱} & Q_{𝐱 𝐱} & Q_{𝐱 𝐮} \\ Q_{𝐮} & Q_{𝐮 𝐱} & Q_{𝐮 𝐮} \end{matrix}] [\begin{matrix} 1 \\ δ 𝐱 \\ δ 𝐮 \end{matrix}]

3

The $Q$ notation used here is a variant of the notation of Morimoto where subscripts denote differentiation in denominator layout.^[5] Dropping the index $i$ for readability, primes denoting the next time-step $V^{'} \equiv V (i + 1)$ , the expansion coefficients are

\begin{matrix} Q_{𝐱} & = ℓ_{𝐱} + 𝐟_{𝐱}^{𝖳} V'_{𝐱} \\ Q_{𝐮} & = ℓ_{𝐮} + 𝐟_{𝐮}^{𝖳} V'_{𝐱} \\ Q_{𝐱 𝐱} & = ℓ_{𝐱 𝐱} + 𝐟_{𝐱}^{𝖳} V'_{𝐱 𝐱} 𝐟_{𝐱} + {V_{𝐱}}^{'} \cdot 𝐟_{𝐱 𝐱} \\ Q_{𝐮 𝐮} & = ℓ_{𝐮 𝐮} + 𝐟_{𝐮}^{𝖳} V'_{𝐱 𝐱} 𝐟_{𝐮} + V'_{𝐱} \cdot 𝐟_{𝐮 𝐮} \\ Q_{𝐮 𝐱} & = ℓ_{𝐮 𝐱} + 𝐟_{𝐮}^{𝖳} V'_{𝐱 𝐱} 𝐟_{𝐱} + V'_{𝐱} \cdot 𝐟_{𝐮 𝐱} . \end{matrix}

The last terms in the last three equations denote contraction of a vector with a tensor. Minimizing the quadratic approximation (3) with respect to $δ 𝐮$ we have

{δ 𝐮}^{*} = \underset{δ 𝐮}{argmin} Q (δ 𝐱, δ 𝐮) = - Q_{𝐮 𝐮}^{- 1} (Q_{𝐮} + Q_{𝐮 𝐱} δ 𝐱),

4

giving an open-loop term $𝐤 = - Q_{𝐮 𝐮}^{- 1} Q_{𝐮}$ and a feedback gain term $𝐊 = - Q_{𝐮 𝐮}^{- 1} Q_{𝐮 𝐱}$ . Plugging the result back into (3), we now have a quadratic model of the value at time $i$ :

\begin{matrix} Δ V (i) & = & - \frac{1}{2} Q_{𝐮}^{T} Q_{𝐮 𝐮}^{- 1} Q_{𝐮} \\ V_{𝐱} (i) & = Q_{𝐱} & - Q_{𝐱 𝐮} Q_{𝐮 𝐮}^{- 1} Q_{𝐮} \\ V_{𝐱 𝐱} (i) & = Q_{𝐱 𝐱} & - Q_{𝐱 𝐮} Q_{𝐮 𝐮}^{- 1} Q_{𝐮 𝐱} . \end{matrix}

Recursively computing the local quadratic models of $V (i)$ and the control modifications ${𝐤 (i), 𝐊 (i)}$ , from $i = N - 1$ down to $i = 1$ , constitutes the backward pass. As above, the Value is initialized with $V (𝐱, N) \equiv ℓ_{f} (𝐱_{N})$ . Once the backward pass is completed, a forward pass computes a new trajectory:

\begin{matrix} \hat{𝐱} (1) & = 𝐱 (1) \\ \hat{𝐮} (i) & = 𝐮 (i) + 𝐤 (i) + 𝐊 (i) (\hat{𝐱} (i) - 𝐱 (i)) \\ \hat{𝐱} (i + 1) & = 𝐟 (\hat{𝐱} (i), \hat{𝐮} (i)) \end{matrix}

The backward passes and forward passes are iterated until convergence. If the Hessians $Q_{𝐱 𝐱}, Q_{𝐮 𝐮}, Q_{𝐮 𝐱}, Q_{𝐱 𝐮}$ are replaced by their Gauss-Newton approximation, the method reduces to the iterative Linear Quadratic Regulator (iLQR).^[6]

Regularization and line-search

Differential dynamic programming is a second-order algorithm like Newton's method. It therefore takes large steps toward the minimum and often requires regularization and/or line-search to achieve convergence.^[7]^[8] Regularization in the DDP context means ensuring that the $Q_{𝐮 𝐮}$ matrix in Eq. 4 is positive definite. Line-search in DDP amounts to scaling the open-loop control modification $𝐤$ by some $0 < α < 1$ .

Monte Carlo version

Sampled differential dynamic programming (SaDDP) is a Monte Carlo variant of differential dynamic programming.^[9]^[10]^[11] It is based on treating the quadratic cost of differential dynamic programming as the energy of a Boltzmann distribution. This way the quantities of DDP can be matched to the statistics of a multidimensional normal distribution. The statistics can be recomputed from sampled trajectories without differentiation.

Sampled differential dynamic programming has been extended to Path Integral Policy Improvement with Differential Dynamic Programming.^[12] This creates a link between differential dynamic programming and path integral control,^[13] which is a framework of stochastic optimal control.

Constrained problems

Interior Point Differential dynamic programming (IPDDP) is an interior-point method generalization of DDP that can address the optimal control problem with nonlinear state and input constraints.^[14]

References

^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

External links

The open-source software framework acados provides an efficient and embeddable implementation of DDP.

[1] Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[2] Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[3] Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[4] Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[5] Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[6] Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[7] Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[8] Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[9] Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[10] Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[11] Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[12] Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[13] Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[14] Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

Differential dynamic programming

Contents

Finite-horizon discrete-time problems

Dynamic programming

Differential dynamic programming

Regularization and line-search

Monte Carlo version

Constrained problems

See also

References

External links

Navigation menu

Differential dynamic programming

Finite-horizon discrete-time problems

Dynamic programming

Differential dynamic programming

Regularization and line-search

Monte Carlo version

Constrained problems

See also

References

External links

Navigation menu

Search