LLM Agent & Embodied VLA

智能体与具身大模型小组

关于我们

近年来,大语言模型(Large Language Models, LLMs)通过在海量跨领域语料上进行大规模预训练,逐步演化出一个可迁移的通用知识库,并展现出强大的语义理解、上下文推理、任务分解与多步规划能力。这些能力使其在处理开放世界问题时表现出显著的潜力,尤其适用于需要灵活应变、动态调整的场景。


为充分发挥大模型在真实开放环境中的决策潜力,我们聚焦于以下研究方向:


1. 大模型智能体(LLM Agent):以大模型为"大脑",赋予其目标理解、任务分解、工具调用与经验记忆的能力,使其能在复杂任务中主动决策与动态调整。


2. 具身大模型(Embodied VLA):将大模型嵌入物理环境中,通过融合视觉-语言-动作(VLA)模型与机器人控制系统,使其具备环境感知与物理交互能力。大模型不仅"知道该做什么",更能"看见环境变化"、"走到指定位置"、"抓取目标物体",并通过与环境的实时交互不断优化行为策略。

研究方向

大模型智能体(LLM Agent)

  • 大模型工具调用(LLM Tool Use)
  • GUI 智能体(GUI Agent)
  • 多智能体协作(Multi-Agent Collaboration)

具身大模型(Embodied VLA)

  • 强化学习训练(RL for VLA)
  • 空间理解与感知(Spatial Understanding and Perception)

近期工作

智能体

CoBel-World Framework
CoBel-World: Harnessing LLM Reasoning to Build a Collaborative Belief World for Optimizing Embodied Multi-Agent Collaboration
Under Review [CCF-A] Paper
Effective real-world multi-agent collaboration requires not only accurate planning but also the ability to reason about collaborators' intents -- a crucial capability for avoiding miscoordination and redundant communication under partial observable environments. Due to their strong planning and reasoning capabilities, large language models (LLMs) have emerged as promising autonomous agents for collaborative task solving. However, existing collaboration frameworks for LLMs overlook their reasoning potential for dynamic intent inference, and thus produce inconsistent plans and redundant communication, reducing collaboration efficiency. To bridge this gap, we propose CoBel-World, a novel framework that equips LLM agents with a collaborative belief world -- an internal representation jointly modeling the physical environment and collaborators' mental states. CoBel-World enables agents to parse open-world task knowledge into structured beliefs via a symbolic belief language, and perform zero-shot Bayesian-style belief updates through LLM reasoning. This allows agents to proactively detect potential miscoordination (e.g., conflicting plans) and communicate adaptively. Evaluated on challenging embodied benchmarks (i.e., TDW-MAT and C-WAH), CoBel-World significantly reduces communication costs by 22-60% and improves task completion efficiency by 4-28% compared to the strongest baseline. Our results show that explicit, intent-aware belief modeling is essential for efficient and human-like collaboration in LLM-based multi-agent systems.
CATP-LLM Framework
CATP-LLM: Empowering Large Language Models for Cost-Aware Tool Planning
ICCV 2025 [CCF-A] Paper Code
Utilizing large language models (LLMs) for tool planning has emerged as a promising avenue for developing general AI systems, where LLMs automatically schedule external tools (e.g., vision models) to tackle complex tasks based on task descriptions. To push this paradigm toward practical applications, it is crucial for LLMs to consider tool execution costs (e.g., execution time) for tool planning. Unfortunately, prior studies overlook the tool execution costs, leading to the generation of expensive plans whose costs outweigh their benefits in terms of task performance. To fill this gap, we propose the Cost-Aware Tool Planning with LLMs (CATP-LLM) framework, which for the first time provides a coherent design to empower LLMs for cost-aware tool planning. Specifically, To facilitate efficient concurrent tool execution and cost reduction, we design a tool planning language to enhance the LLM for creating multi-branch non-sequential plans. Moreover, we propose a cost-aware offline reinforcement learning algorithm to fine-tune the LLM to optimize the performance-cost trade-off in tool planning. In the lack of public cost-related datasets, we further present OpenCATP, the first dataset for cost-aware planning, which comprises 11,100 evaluation samples from diverse tasks. Extensive experiments show that CATP-LLM outperforms GPT-4 even when using Llama2-7B as its backbone, with the average improvement of 1.5%-93.9% in terms of plan quality.

大模型端到端决策控制

Trailblazer Framework
Large Language Models as Generalist Policies for Network Optimization
In Submission Paper
Designing control policies to ensure robust network services is essential to modern digital infrastructure. However, the dominant paradigm for network optimization relies on designing specialist policies based on handcrafted rules or deep learning models, leading to poor generalization across diverse tasks and environments. In contrast, large language models (LLMs), pretrained on Internet-scale corpora, provide a rich and unified knowledge base that encodes fundamental networking principles. Combined with their emergent abilities in generalization to unseen scenarios, LLMs offer a transformative foundation for generalist network policies that can generalize across diverse tasks and environments with minimal adaptation. In this paper, we present Trailblazer, the first systematic framework to realize such a generalist policy for networking. Trailblazer incorporates a network alignment scheme to ground the LLM in specific networking tasks, and an adaptive policy collaboration mechanism that offloads simple control cases from the LLM to a lightweight policy for computational efficiency. Through extensive simulations and large-scale real-world online evaluation on Douyin (the Chinese version of TikTok), Trailblazer, powered by a single LLM, demonstrates stronger cross-task and cross-environment generalization than conventional specialist policies. Our results validate LLMs as the foundation for generalist network policies, and position Trailblazer as the first step toward the generalist-driven paradigm that enables strong generalization with minimal efforts in policy design.
NetLLM Framework
NetLLM: Adapting Large Language Models for Networking
SIGCOMM 2024 [CCF-A] (140+ Citations, 180+ Github Stars) Paper Code
Many networking tasks now employ deep learning (DL) to solve complex prediction and optimization problems. However, current design philosophy of DL-based algorithms entails intensive engineering overhead due to the manual design of deep neural networks (DNNs) for different networking tasks. Besides, DNNs tend to achieve poor generalization performance on unseen data distributions/environments. Motivated by the recent success of large language models (LLMs), this work studies the LLM adaptation for networking to explore a more sustainable design philosophy. With the powerful pre-trained knowledge, the LLM is promising to serve as the foundation model to achieve "one model for all tasks" with even better performance and stronger generalization. In pursuit of this vision, we present NetLLM, the first framework that provides a coherent design to harness the powerful capabilities of LLMs with low efforts to solve networking problems. Specifically, NetLLM empowers the LLM to effectively process multimodal data in networking and efficiently generate task-specific answers. Besides, NetLLM drastically reduces the costs of fine-tuning the LLM to acquire domain knowledge for networking. Across three networking-related use cases - viewport prediction, adaptive bitrate streaming and cluster job scheduling, we showcase that the NetLLM-adapted LLM significantly outperforms state-of-the-art algorithms.

加入我们

王智 教授

清华大学深圳国际研究生院

个人主页: https://www.mmlab.top

邮箱: wangzhi@sz.tsinghua.edu.cn

吴铎

在读博士生

个人主页: https://duowuyms.github.io

邮箱: wu-d24@mails.tsinghua.edu.cn