Unlike conventional gradient guidance methods that pull samples toward high-reward regions, PG introduces repulsive forces that push samples apart in the data space. This leads to broad coverage of dynamically feasible trajectories independent of task objectives. A single denoising step for parent branching denoted as:
$$ \left[\boldsymbol{\mu}^{i}_{\text{control}},\; \boldsymbol{\mu}^{i}_{\text{obs}}\right] \leftarrow \mathbf{SD}(\mathcal{J}, \boldsymbol{\mu}_{\theta}(\boldsymbol{\tau}^{i})) $$ $$ \boldsymbol{\mu}^{i-1}_{\text{control}} \leftarrow \boldsymbol{\mu}^{i}_{\text{control}} + \alpha_p \Sigma^i \nabla \Phi(\boldsymbol{\mu}^{i}_{\text{control}}), \quad \boldsymbol{\mu}^{i-1}_{\text{obs}} \leftarrow \boldsymbol{\mu}^{i}_{\text{obs}} + \alpha_g \Sigma^i \nabla \mathcal{J}(\boldsymbol{\mu}^{i}_{\text{obs}}) $$ $$ \boldsymbol{\mu}^{i-1} \leftarrow \left[\boldsymbol{\mu}^{i-1}_{\text{control}},\; \boldsymbol{\mu}^{i-1}_{\text{obs}}\right] $$ $$ \boldsymbol{\tau}^{i-1} \sim \mathcal{N}(\boldsymbol{\mu}^{i-1}, \Sigma^i) $$ where $\boldsymbol{\mu}^i_{\text{control}}$ and $\boldsymbol{\mu}^i_{\text{obs}}$ denote the control and observation components of the predicted mean of the denoising trajectory at timestep $i$, and $(\alpha_p, \alpha_g)$ are the guidance strengths for the particle guidance and gradient guidance, respectively. $\mathbf{SD}(\mathcal{J}, \cdot)$ is the state decomposition function that autonomously partitions the state vector into control and observation components based on the test-time task $\mathcal{J}$, while $\Phi(\cdot)$ denotes the radical basis function (RBF) kernel in particle guidance.For each parent trajectory, a random branch site is selected, and a child trajectory is generated by denoising from a partially noised version of the parent trajectory in order to refine parent trajectories using gradient guidance signals. Sub-tree expansion proceeds as:
$$ \boldsymbol{\tau}_{\text{child}}^{N_f} \sim q_{N_f}(\boldsymbol{\tau}_{\text{parent}}, \boldsymbol{C}) \quad \text{where} \; \boldsymbol{C}=\{ \boldsymbol{s}_k \}_{k=0}^{b} \; \text{and} \; b\sim Uniform\left(0, T_{\text{pred}}\right) $$where $\boldsymbol{C}$ denotes the parent trajectory prefix, $q_{N_f}$ is the partial forward noising distribution with $N_f$ denoising steps, and $\boldsymbol{\tau}_{\text{child}}^{N_f}$ is the partially noised trajectory from which the child trajectory is denoised during sub-tree expansion.
The full algorithm of TDP is provided in Algorithm.
Maze2D gold-picking task is a planning problem with a test-time non-differentiable constraint, where the agent must generate a feasible trajectory that satisfies an initial state, a final goal state, and an intermediate goal state (the gold position ).
Gradient-based guidance typically requires selecting a guidance strength $\alpha$ to balance adherence to the guide signal and trajectory fidelity.
However, $\alpha$ is highly task-dependent, and exhaustive tuning across tasks introduces significant overhead during evaluation.
On the Maze2D gold-picking task, the MCSS (Monte-Carlo Sampling with Selection) baseline exhibits $\alpha$-dependent performance, whereas TDP remains robust across varying values of the guidance strength $\alpha$.
$\alpha_0$ is guidance strength used in the main paper.
We introduce a non-convex exploration task in robot arm manipulation enviornment. The agent must infer suitable placement location based on the reward distribution and plan pick-and-place actions. Since $x^*_{local}$ has a wide peak and $x^*_{global}$ has a narrow peak, agents easily get stuck in the local optima unless the planner sufficiently explores the trajectory space. Mono-level guided sampling methods (i.e., TAT [2], MCSS) tend to converge to local optima, often stacking all blocks at $x^*_{local}$.
We introduce a multi-reward exploration task in AntMaze environment. The diffusion planner predicts the next 64 steps (highlighted in bright on the map) using a combined Gaussian reward signal derived from multiple goals. Goals must be visited in descending order of priority, with higher-priority goals represented by stronger and narrower Gaussian functions.
For example, as illustrated in the figure above, the first goal the agent visits is $g_2$ at $t = t_3$. If the agent subsequently visits $g_1$, $g_4$, and $g_3$ after $t=t_3$, it successfully reaches all four goals ($g_2 \rightarrow g_1 \rightarrow g_4 \rightarrow g_3$). However, two precedence rules are violated ($g_2 \rightarrow g_1$ and $g_4 \rightarrow g_3$) while the remaining four rules ($g_2 \rightarrow g_4$, $g_2 \rightarrow g_3$, $g_1 \rightarrow g_4$, and $g_1 \rightarrow g_3$) are satisfied. In this scenario, the agent achieves a goal completion score of 4/4 but only 4/6 priority sequence match accuracy. The maximum accuracy of 6/6 is obtained only when all goals are visited in the correct prioritized order (i.e., $g_1 \rightarrow g_2 \rightarrow g_3 \rightarrow g_4$).
In these videos, COMPLETE indicates that the agent has visited all four goals, while SUCCESS requires visiting all four goals as well as reaching them in the correct prioritized order.
[1]: Gabriele Corso, Yilun Xu, Valentin De Bortoli, Regina Barzilay, and Tommi S. Jaakkola. Particle guidance: non-i.i.d. diverse sampling with diffusion models, 2024. URL https://arxiv.org/abs/2310.13102.
[2]: Lang Feng, Pengjie Gu, Bo An, and Gang Pan. Resisting stochastic risks in diffusion planners with the trajectory aggregation tree, 2024. URL https://arxiv.org/abs/2405.17879.
[3]: Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning, 2021. URL https://arxiv.org/abs/2004.07219.
[4]: Caelan Reed Garrett, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. Pddlstream: Integrating symbolic planners and blackbox samplers via optimistic adaptive planning, 2020. URL https://arxiv.org/abs/4181802.08705.