Transcrib3D: 3D Referring Expression Resolution through Large Language Models

1Toyota Technological Institute at Chicago, 2Toyota Research Institute *Equal Contribution

Transcrib3D reasons and acts according to complex 3D referring expression with real robots.

Abstract

If robots are to work effectively alongside people, they must be able to interpret natural language references to objects in their 3D environment. Understanding 3D referring expressions is challenging---it requires the ability to both parse the 3D structure of the scene and correctly ground free-form language in the presence of distraction and clutter. We introduce Transcrib3D, an approach that brings together 3D detection methods and the emergent reasoning capabilities of large language models (LLMs), using text as the unifying medium, which allows us to sidestep the need for multi-modal representation learning, which would require massive amounts of annotated 3D data. As a demonstration of its effectiveness, Transcrib3D achieves state-of-the-art results on 3D referring benchmarks, with a great leap in performance from previous multi-modality baselines. To improve upon zero-shot performance and facilitate local deployment on edge computers and robots, we propose self-correction for fine-tuning that trains smaller models with close performance to large models on this task. We implement our method on a real robot to perform pick-and-place tasks given queries that contain challenging referring expressions.

Framework

Framework

The overall Transcrib3D framework, which takes as input the colored point-cloud and referring expression (in green), and outputs the ID or bounding box of the referent object. To resolve the referring expression ``the chair in the corner of the room, between the white and yellow desks'', the framework needs to locate the pillow in the green box, while all other pillows in red boxes are distractors.

Experiments

Framework

Compare with Code-as-policies

First example Second example

Unlike other frameworks (e.g. Code-as-policies) that rely on off-the-shelf open vocabulary object detectors which typically struggle on understanding complex referring expression, our method correctly identifies referred objects even with complex referring expression.