Task and Motion Planning (TAMP) integrates high-level task planning with low-level motion feasibility, but existing methods are costly in long-horizon problems due to excessive motion sampling. While LLMs provide commonsense priors, they lack 3D spatial reasoning and cannot ensure geometric or dynamic feasibility. We propose a kinodynamic TAMP framework based on a hybrid state tree that uniformly represents symbolic and numeric states during planning, enabling task and motion decisions to be jointly decided. Kinodynamic constraints embedded in the TAMP problem are verified by an off-the-shelf motion planner and physics simulator, and a VLM guides exploring a TAMP solution and backtracks the search based on visual rendering of the states. Experiments on the simulated domains and in the real world show 32.14% - 1166.67% increased average success rates compared to traditional and LLM-based TAMP planners and reduced planning time on complex problems, with ablations further highlighting the benefits of VLM guidance.
An overview of our kinodynamic TAMP pipeline. We employ a physics simulator as the transition model to verify both geometric and dynamic constraints, and leverage a VLM to guide the search.
Given a problem PDDL and a domain PDDL, a top-k symbolic planner first generates diverse symbolic task plans. We then construct a discrete state graph G that represents a reduced skeleton space, allowing us to explore alternative task-level decisions without restarting the symbolic planner when motion refinement fails.
Guided by G, we expand a hybrid state tree T, where each edge is refined through motion planning and validated via physics simulation. We further leverage a VLM to guide child node selection during search, enabling a greedy BFS-like exploration biased by the VLM.
If a node ht fails to expand, we retry random sampling up to K times. If expansion remains unsuccessful, we prompt the VLM to predict a backtracking node hr, from which the expansion resumes. The VLM is provided with simulator-rendered images of the current node ht, the goal state, a JSON-encoded representation of the hybrid state tree expanded so far, and constraint-violation feedback from previous expansion attempts.
[Table 1] reports the average success rates (%) and planning times (s) of all baselines for 3 ≤ n ≤ 6, where n denotes the number of target objects. Planning times are averaged over successful trials only. For methods with a 0% success rate, the planning time is reported as “Timeout”.
Our approach outperforms PDDLStream, a domain-independent traditional TAMP baseline, as well as LLM3, an LLM-based TAMP baseline, while jointly considering both geometric and dynamic constraints.
Robotic demonstration of our TAMP planner in the Blocksworld domain (n = 6) using dual UR5e manipulators. The initial configuration consists of six blocks stacked on the table (leftmost image), and the goal is to rearrange them into the stacking sequence shown in 14. Please refer to the video above for real-time demonstration.
@article{kwon2025kinodynamic,
title={Kinodynamic Task and Motion Planning using VLM-guided and Interleaved Sampling},
author={Kwon, Minseo and Kim, Young J},
journal={arXiv preprint arXiv:2510.26139},
year={2025}
}