Tree-Guided Diffusion Planner

1Department of Computer Science & Engineering, 2Interdisciplinary Program of AI
Seoul National University


Tree-guided Diffusion Planner (TDP) is a zero-shot test-time planning framework that balances exploration and exploitation through structured trajectory generation. It addresses the limitations of gradient guidance by exploring diverse trajectory regions and harnessing gradient information across this expanded solution space using only pretrained models and test-time reward signals.

header-image.

(1) Parent Branching: diverse parent trajectories are produced via fixed-potential particle guidance [1] to encourage broad exploration.

(2) Sub-Tree Expansion: sub-trajectories are locally refined through fast conditional denoising guided by task objectives.

TDP consistently outperforms state-of-the-art zero-shot planning approaches across a wide range of guidance functions, involving non-convex objectives, non-differentiable constraints, and multi-reward structures.


Method

Parent Branching

Unlike conventional gradient guidance methods that pull samples toward high-reward regions, PG introduces repulsive forces that push samples apart in the data space. This leads to broad coverage of dynamically feasible trajectories independent of task objectives. A single denoising step for parent branching denoted as:

$$ \left[\boldsymbol{\mu}^{i}_{\text{control}},\; \boldsymbol{\mu}^{i}_{\text{obs}}\right] \leftarrow \mathbf{SD}(\mathcal{J}, \boldsymbol{\mu}_{\theta}(\boldsymbol{\tau}^{i})) $$ $$ \boldsymbol{\mu}^{i-1}_{\text{control}} \leftarrow \boldsymbol{\mu}^{i}_{\text{control}} + \alpha_p \Sigma^i \nabla \Phi(\boldsymbol{\mu}^{i}_{\text{control}}), \quad \boldsymbol{\mu}^{i-1}_{\text{obs}} \leftarrow \boldsymbol{\mu}^{i}_{\text{obs}} + \alpha_g \Sigma^i \nabla \mathcal{J}(\boldsymbol{\mu}^{i}_{\text{obs}}) $$ $$ \boldsymbol{\mu}^{i-1} \leftarrow \left[\boldsymbol{\mu}^{i-1}_{\text{control}},\; \boldsymbol{\mu}^{i-1}_{\text{obs}}\right] $$ $$ \boldsymbol{\tau}^{i-1} \sim \mathcal{N}(\boldsymbol{\mu}^{i-1}, \Sigma^i) $$ where $\boldsymbol{\mu}^i_{\text{control}}$ and $\boldsymbol{\mu}^i_{\text{obs}}$ denote the control and observation components of the predicted mean of the denoising trajectory at timestep $i$, and $(\alpha_p, \alpha_g)$ are the guidance strengths for the particle guidance and gradient guidance, respectively. $\mathbf{SD}(\mathcal{J}, \cdot)$ is the state decomposition function that autonomously partitions the state vector into control and observation components based on the test-time task $\mathcal{J}$, while $\Phi(\cdot)$ denotes the radical basis function (RBF) kernel in particle guidance.

Sub-Tree Expansion

For each parent trajectory, a random branch site is selected, and a child trajectory is generated by denoising from a partially noised version of the parent trajectory in order to refine parent trajectories using gradient guidance signals. Sub-tree expansion proceeds as:

$$ \boldsymbol{\tau}_{\text{child}}^{N_f} \sim q_{N_f}(\boldsymbol{\tau}_{\text{parent}}, \boldsymbol{C}) \quad \text{where} \; \boldsymbol{C}=\{ \boldsymbol{s}_k \}_{k=0}^{b} \; \text{and} \; b\sim Uniform\left(0, T_{\text{pred}}\right) $$

where $\boldsymbol{C}$ denotes the parent trajectory prefix, $q_{N_f}$ is the partial forward noising distribution with $N_f$ denoising steps, and $\boldsymbol{\tau}_{\text{child}}^{N_f}$ is the partially noised trajectory from which the child trajectory is denoised during sub-tree expansion.

The full algorithm of TDP is provided in Algorithm.

Maze2D Gold-picking

Maze2D gold-picking task is a planning problem with a test-time non-differentiable constraint, where the agent must generate a feasible trajectory that satisfies an initial state, a final goal state, and an intermediate goal state (the gold position gold-coin Logo).

Two Gold-picking tasks in Maze2D-Large [3].

Gradient-based guidance typically requires selecting a guidance strength $\alpha$ to balance adherence to the guide signal and trajectory fidelity. However, $\alpha$ is highly task-dependent, and exhaustive tuning across tasks introduces significant overhead during evaluation. On the Maze2D gold-picking task, the MCSS (Monte-Carlo Sampling with Selection) baseline exhibits $\alpha$-dependent performance, whereas TDP remains robust across varying values of the guidance strength $\alpha$. $\alpha_0$ is guidance strength used in the main paper.


Pick-and-Where-to-Place ($\texttt{PnWP}$)

$\texttt{PnWP}$ with Kuka robot arm [4].

We introduce a non-convex exploration task in robot arm manipulation enviornment. The agent must infer suitable placement location based on the reward distribution and plan pick-and-place actions. Since $x^*_{local}$ has a wide peak and $x^*_{global}$ has a narrow peak, agents easily get stuck in the local optima unless the planner sufficiently explores the trajectory space. Mono-level guided sampling methods (i.e., TAT [2], MCSS) tend to converge to local optima, often stacking all blocks at $x^*_{local}$.

TDP is a bi-level search framework.

TAT
(Highest-weighted trajectory)
MCSS
(Highest-scoring trajectory)
TDP

AntMaze Multi-goal Exploration

Multi-goal Exploration in AntMaze-Large [3].

We introduce a multi-reward exploration task in AntMaze environment. The diffusion planner predicts the next 64 steps (highlighted in bright on the map) using a combined Gaussian reward signal derived from multiple goals. Goals must be visited in descending order of priority, with higher-priority goals represented by stronger and narrower Gaussian functions.

For example, as illustrated in the figure above, the first goal the agent visits is $g_2$ at $t = t_3$. If the agent subsequently visits $g_1$, $g_4$, and $g_3$ after $t=t_3$, it successfully reaches all four goals ($g_2 \rightarrow g_1 \rightarrow g_4 \rightarrow g_3$). However, two precedence rules are violated ($g_2 \rightarrow g_1$ and $g_4 \rightarrow g_3$) while the remaining four rules ($g_2 \rightarrow g_4$, $g_2 \rightarrow g_3$, $g_1 \rightarrow g_4$, and $g_1 \rightarrow g_3$) are satisfied. In this scenario, the agent achieves a goal completion score of 4/4 but only 4/6 priority sequence match accuracy. The maximum accuracy of 6/6 is obtained only when all goals are visited in the correct prioritized order (i.e., $g_1 \rightarrow g_2 \rightarrow g_3 \rightarrow g_4$).

TDP achieves more goals, with higher sequence accuracy.

MCSS
(X)
MCSS
($g_2 \rightarrow g_3 \rightarrow $ X)
TDP
($g_1 \rightarrow g_2 \rightarrow g_3 \rightarrow g_4$)

TDP completes tasks in fewer timesteps.

MCSS
($g_1 \rightarrow g_4 \rightarrow g_3 \rightarrow g_2$, slow)
TDP
($g_1 \rightarrow g_4 \rightarrow g_3 \rightarrow g_2$, fast)

In these videos, COMPLETE indicates that the agent has visited all four goals, while SUCCESS requires visiting all four goals as well as reaching them in the correct prioritized order.

Full Algorithm

Reference

[1]: Gabriele Corso, Yilun Xu, Valentin De Bortoli, Regina Barzilay, and Tommi S. Jaakkola. Particle guidance: non-i.i.d. diverse sampling with diffusion models, 2024. URL https://arxiv.org/abs/2310.13102.

[2]: Lang Feng, Pengjie Gu, Bo An, and Gang Pan. Resisting stochastic risks in diffusion planners with the trajectory aggregation tree, 2024. URL https://arxiv.org/abs/2405.17879.

[3]: Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning, 2021. URL https://arxiv.org/abs/2004.07219.

[4]: Caelan Reed Garrett, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. Pddlstream: Integrating symbolic planners and blackbox samplers via optimistic adaptive planning, 2020. URL https://arxiv.org/abs/4181802.08705.