Tools Fail: Detecting Silent Errors in Faulty Tools (2024)

Jimin Sun1* So Yeon Min2 Yingshan Chang2 Yonatan Bisk2
1CohereAI  2Carnegie Mellon University
jimin@cohere.com

Abstract

Tools have become a mainstay of LLMs, allowing them to retrieve knowledge not in their weights, to perform tasks on the web, and even to control robots. However, most ontologies and surveys of tool-use have assumed the core challenge for LLMs is choosing the tool.Instead, we introduce a framework for tools more broadly which guides us to explore a model’s ability to detect “silent” tool errors, and reflect on how to plan. This more directly aligns with the increasingly popular use of models as tools.We provide an initial approach to failure recovery with promising results both on a controlled calculator setting and embodied agent planning.

\mdfdefinestyle

codestylebackgroundcolor=bg,linecolor=black,linewidth=1pt,rightline=false,leftline=false

Tools Fail: Detecting Silent Errors in Faulty Tools


Jimin Sun1* So Yeon Min2 Yingshan Chang2 Yonatan Bisk21CohereAI  2Carnegie Mellon Universityjimin@cohere.com


1 Introduction

Tools offer a convenient way to augment capabilities beyond text-based reasoning,from executing code to incorporating recent data through web search, and even facilitating multimodal interactions.While the term “tool” is often interpreted to mean offloading specific deterministic functions to external APIs, astasks grow more complex, the definition is expandingto include learned modules such as translators and object detectors, as well as heuristics-based policies like search algorithms and robotic skills. LLMs themselves are also being used as tools, particularly as task planners in robotics, chained with vision models and robot policies to perform navigation and manipulation(Ahn etal., 2022; Huang etal., 2022a, b; Liang etal., 2022; Singh etal., 2022a; Li etal., 2023; Xu etal., 2023; Zeng etal., 2023). *Work done while at Carnegie Mellon University.

As tools take on more responsibilities, assessing and ensuring their reliability becomes crucial; a failure in one tool can trigger a cascade of errors, leading to complete task failure.Recent studies have suggested recovery mechanisms, such as correcting inputs based on API error messages (Pan etal., 2023a; Zhang etal., 2023; Chen etal., 2023b; Pan etal., 2023b). However, most methods rely on two underlying assumptions: that accurate inputs guarantee flawless outputs, and that errors are accompanied by explicit signals. Yet, real-world scenarios challenge the premises, as failures often arise from unpredictable environmental dynamics and inherent inaccuracies of tools themselves.

Tools Fail: Detecting Silent Errors in Faulty Tools (1)

This paper introduces a taxonomy to categorize sources of tool-related errors and recovery methods. We shed light on the often overlooked case: “tool-based” failures. As opposed to input-based errors which are often accompanied by error messages, most tool failures are “silent.” This poses unique reasoning challenges for the LLM, which must actively 1. detect the failure, 2. infer the source, and 3. plan recovery strategies. In this paper, we focus on the first step, detection, as it is the prerequisite for downstream fault assignment and recovery.

We investigate tool errors in two distinct settings (Fig.1) – a controlled environment where an LLM solves arithmetic problems using a broken calculator, and a more natural “broken” tool setting involving a multimodal instruction-following agent.We investigate whether LLMs can detect incorrect tool outputs without explicit error signals, to observe overtrusting of tools. Motivated by how humans detect tool failures based on internal expectations of correct outputs, we devise three in-context interventions, and find that LLMs can learn to doubt tools and detect mistakes.Following the taxonomy, we further examine how much and what type of deviation is necessary to trigger the LLM’s recognition of the tool error in each setting.

2 Related Work

Tools

Text-based tools help compensate for LLMs’ relative weakness in world knowledge and computational precision(Lewis etal., 2020; Parisi etal., 2022; Gao etal., 2023; Schick etal., 2023; Yao etal., 2023).Multimodal tools allow LLMs to receive inputs from other modalities and generate grounded answers(Gupta and Kembhavi, 2023; Wu etal., 2023; Yang etal., 2023; Zeng etal., 2023).Outputs of Vision-Language models(Radford etal., 2021), Object Detectors, OCR models, and speech-to-text APIs(Zeng etal., 2023) have been added to the LLM’s prompt, enabling zero-shot inference on multimodal tasks.

Agents

Research on LLM agents spans multi-step tasks in gaming(Wang etal., 2023a; Wu etal., 2024), web navigation(Qin etal., 2023; Shinn etal., 2023; Yao etal., 2023), and code generation(Shinn etal., 2023; Yao etal., 2023).Most focus on the selection and utilization of tools(Wang etal., 2023a; Qin etal., 2023; Wu etal., 2024), and enhancing reasoning through self-evaluation and feedback(Shinn etal., 2023; Wang etal., 2023a; Chen etal., 2023a; Xu etal., 2023; Madaan etal., 2024).

Adapting LLMs to tool-use

Existing works use in-context learning (ICL)Lu etal. (2023); Shen etal. (2024), finetuningSchick etal. (2023), and trial-and-errorWang etal. (2024) to adapt LLM to tool-use. However, the focus has been on adapting to “newer” tools, from demonstrations or documentations, and the question of tool reliability and recovering from “unreliable” tools has not been actively investigated. While malfunctioning APIs are preemptively filtered out in API-centric environmentsQin etal. (2023), the strategies for addressing ineffective learned tools, as in gamesWang etal. (2023a); Wu etal. (2024) or multimodal tasks (Zeng etal., 2022), have been less explored. Overall, existing approaches tend to amalgamate various tool failure modes under the umbrella term “reasoning,” focusing primarily on the most salient aspect of failure within their specific domain. In contrast, we distinctly identify and thoroughly analyze errors related to tool arguments, the tools themselves, and the alignment with environmental dynamics.

3 Background

Notation

We outline a typical tool-use scenario in Fig.1a with the following notation:

x𝑥\displaystyle xitalic_x:task input:absenttask input\displaystyle:\text{task input}: task inputi𝑖\displaystyle iitalic_i:tool input:absenttool input\displaystyle:\text{tool input}: tool input
y^^𝑦\displaystyle\hat{y}over^ start_ARG italic_y end_ARG:predicted task output:absentpredicted task output\displaystyle:\text{predicted task output}: predicted task outputo𝑜\displaystyle oitalic_o:tool output:absenttool output\displaystyle:\text{tool output}: tool output
c𝑐\displaystyle citalic_c:context information:absentcontext information\displaystyle:\text{context information}: context informationtθsubscript𝑡𝜃\displaystyle t_{\theta}italic_t start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT:tool:absenttool\displaystyle:\text{tool}: tool

The LLM first selects tools and constructs tool-specific arguments i𝑖iitalic_i from the task input x𝑥xitalic_x. Based on the tool result o𝑜oitalic_o, the final task prediction y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG is made. Notably, the flexibility of LLMs as an interface allows inputs to be enriched with context information c𝑐citalic_c throughout the task. c𝑐citalic_c may include task specifics, API docstrings, any external feedback like error messages, or even previous action trajectories in interactive tasks.

Additionally, we denote the oracle values of the input, output, context as isuperscript𝑖i^{*}italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, osuperscript𝑜o^{*}italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and csuperscript𝑐c^{*}italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. The tool input i𝑖iitalic_i and output o𝑜oitalic_o may contain inaccuracies since they are essentially outputs of preceding LLM/tool calls. Fig.1b demonstrates a scenario where i𝑖iitalic_i contains a mistake (15 x 58 should be 15 * 58). The context c𝑐citalic_c can also be incomprehensive or noisy, as they are approximations of the real world.Moreover, the tool tθsubscript𝑡𝜃t_{\theta}italic_t start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT can be suboptimal in multiple dimensions. For deterministic APIs, a suboptimal tool may have been chosen by an LLM(Schick etal., 2023).For learned tools, the tool itself is an inherently imperfect parameterized model, thus tθsubscript𝑡𝜃t_{\theta}italic_t start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.

Defining Error

The suboptimality of i𝑖iitalic_i, c𝑐citalic_c, and tθsubscript𝑡𝜃t_{\theta}italic_t start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT manifest as suboptimal tool outputs o𝑜oitalic_o, that deviate from osuperscript𝑜o^{*}italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. The deviation can be as critical and explicit, leading to error messages in Fig.1b, or weakly wrong like the Object Detector output in Fig.1d.In fact, the severity of a tool error depends on how critically the mistake impacts downstream task performance. In Fig.1d, the Object Detector misidentifying the Tomato as an Apple, is crucial to the task, but mistaking objects like Bread would not hinder the task as much.As the high-level goal is task success rather than perfect tool utilization, it is important to rectify critical mistakes, whereas harmless mistakes can be disregarded.

To formalize this notion of “task-critical” tool-use mistakes,we introduce an error threshold ϵitalic-ϵ\epsilonitalic_ϵ to define a range of tool outputs that are not “critically” wrong. Intervention is only necessary when the deviation between the tool output and the oracle, d(o,o)𝑑𝑜superscript𝑜d(o,o^{*})italic_d ( italic_o , italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), is larger than ϵitalic-ϵ\epsilonitalic_ϵ, thereby degrading the performance/quality of the final task output y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG.

d(o,o)𝑑𝑜superscript𝑜\displaystyle d(o,o^{*})italic_d ( italic_o , italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )>ϵstask(y^|o)<stask(y^|o)absentitalic-ϵsubscript𝑠taskconditional^𝑦𝑜subscript𝑠taskconditional^𝑦superscript𝑜\displaystyle>\epsilon\;\Longrightarrow\;s_{\text{task}}(\hat{y}|o)<s_{\text{%task}}(\hat{y}|o^{*})> italic_ϵ ⟹ italic_s start_POSTSUBSCRIPT task end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG | italic_o ) < italic_s start_POSTSUBSCRIPT task end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG | italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )(1)
wherestasktask performance metricwheresubscript𝑠tasktask performance metric\displaystyle\text{where }s_{\text{task}}\coloneqq\text{task performance metric}where italic_s start_POSTSUBSCRIPT task end_POSTSUBSCRIPT ≔ task performance metric

This is analogous to how humans approach errors; the goal is not a perfect world model but to accomplish a task.As long as we can grab the apple, we do not need to know its exact shape or coordinates.

4 Error sources

The tool output o𝑜oitalic_o is accurate if and only if:

  1. 1.

    The tool inputs i𝑖iitalic_i are accurate.

  2. 2.

    The context c𝑐citalic_c is correct and sufficient.

  3. 3.

    The tool tθsubscript𝑡𝜃t_{\theta}italic_t start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT makes correct predictions.

Formally, to obtain o𝑜oitalic_o with deviation smaller than ϵitalic-ϵ\epsilonitalic_ϵ, d(o,o)𝑑𝑜superscript𝑜d(o,o^{*})italic_d ( italic_o , italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), is a union of component error bounds:

d(o,o)<ϵ𝑑𝑜superscript𝑜italic-ϵ\displaystyle d(o,o^{*})<\epsilonitalic_d ( italic_o , italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) < italic_ϵ(2)
d(i,i)<ϵitool inputd(c,c)<ϵccontextd(tθ,tθ)<ϵttool correctnessabsentsubscript𝑑𝑖superscript𝑖subscriptitalic-ϵ𝑖tool inputsubscript𝑑𝑐superscript𝑐subscriptitalic-ϵ𝑐contextsubscript𝑑subscript𝑡𝜃subscript𝑡superscript𝜃subscriptitalic-ϵ𝑡tool correctness\displaystyle\Leftarrow\underbrace{d(i,i^{*})<\epsilon_{i}}_{\text{tool input}%}\;\land\;\underbrace{d(c,c^{*})<\epsilon_{c}}_{\text{context}}\;\land\;%\underbrace{d(t_{\theta},t_{\theta^{*}})<\epsilon_{t}}_{\text{tool correctness}}⇐ under⏟ start_ARG italic_d ( italic_i , italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) < italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT tool input end_POSTSUBSCRIPT ∧ under⏟ start_ARG italic_d ( italic_c , italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) < italic_ϵ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT context end_POSTSUBSCRIPT ∧ under⏟ start_ARG italic_d ( italic_t start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) < italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT tool correctness end_POSTSUBSCRIPT

If any condition above is not met, output errors will lead to task failure.The following sections discuss each condition, and a table of corresponding real-world error scenarios is presented in App. A.

4.1 Input: d(i,i)>ϵi𝑑𝑖superscript𝑖subscriptitalic-ϵ𝑖d(i,i^{*})>\epsilon_{i}italic_d ( italic_i , italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) > italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

Imperfect tool inputs often result from incorrect outputs from a prior tool, like errors in LLM-generated code or noisy images.For deterministic tools (e.g., code interpreters), most errors are due to tool inputs, andmalformed inputs typically trigger an error message.However, well-formed inputs with incorrect content (e.g., ambiguous queries for search APIs) can produce erroneous outputs that inadvertently propagate through subsequent steps.

4.2 Context: d(c,c)>ϵc𝑑𝑐superscript𝑐subscriptitalic-ϵ𝑐d(c,c^{*})>\epsilon_{c}italic_d ( italic_c , italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) > italic_ϵ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT

Partial observability of the surrounding environment can be another source of tool error, resulting in a lack of context for a tool to function properly. This is often inevitable early in the planning trajectory in interactive task settings. For example, an embodied agent may need to explore hidden objects in closed receptacles through trial-and-error, in order to obtain enough information for the task.

4.3 Tool: d(tθ,tθ)>ϵt𝑑subscript𝑡𝜃subscript𝑡superscript𝜃subscriptitalic-ϵ𝑡d(t_{\theta},t_{\theta^{*}})>\epsilon_{t}italic_d ( italic_t start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) > italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

Tools themselves can make mistakes, even when the input or context is perfect. This situation is especially prominent as learnable tools are becoming more widely adopted in practice.LLMs are prone to generating factually incorrect statements even when reference documents are provided through context(Krishna etal., 2024). Search APIs might fail not because of the input query’s clarity, but due to an imperfect database/dense retrieval method.The tool’s precision can also contribute to failure – heuristic-based search/manipulation robot policies can fall apart when they lack the precision needed to address the complexity of real-world scenarios.

Due to the absence of explicit error signals, tool-based errors require the tool-using model to reason over indirect cues. In easier cases, errors can be recognized based on well-calibrated confidence scores.Much harder cases, however, arise when a tool confidently produces errors.In such scenarios, a broader context may help identify these hidden errors. Multiple tools presenting conflicting evidence (e.g., fact verification tool vs search API), disagreement between different modalities(Lee etal., 2021), or prediction inconsistencies over multiple trials(Kadavath etal., 2022; Wang etal., 2023c) or timesteps(Chaplot etal., 2020), may help surface potential limitations of the tool.

5 Recovery behaviors

Next, we categorize current recovery methods from previous literature into two groups: Refine and Replace, and advocate for meta-cognitive reasoning.

5.1 Refine: ii𝑖superscript𝑖i\to i^{*}italic_i → italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, cc𝑐superscript𝑐c\to c^{*}italic_c → italic_c start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT

Recovering from tool failures often involves refining the tool input. This is particularly effective when the failure is followed by explicit feedback signals that indicate “what” to fix. Inputs can be rewritten guided byAPI error messages and human/LLM feedback(Madaan etal., 2023; Shinn etal., 2023; Wang etal., 2023b). In the planning literature (e.g., TAMPGarrett etal. (2021); Ding etal. (2023)), this is referred to as “closed-loop planning,” where plans are continuously updated by new observations, task progress, or clarification questions(Huang etal., 2022b; Singh etal., 2022a; Song etal., 2022). Augmenting the context based on increased observability changes the input’s interpretation.Refine methods are well-suited to LLMs as they can flexibly accept varying lengths of text-based feedback. In contrast, corrections to other modalities (e.g. image lighting or non-verbal communication) remain open challenges for VLMs.

5.2 Replace: tθtθsubscript𝑡𝜃subscript𝑡superscript𝜃t_{\theta}\to t_{\theta^{*}}italic_t start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT → italic_t start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT

When errors originate from the tool itself, the aim is to move tθsubscript𝑡𝜃t_{\theta}italic_t start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT closer to tθsubscript𝑡superscript𝜃t_{\theta^{*}}italic_t start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, aligning it more closely with the final task. Mitigation strategies vary based on how easily the tool can be fixed at inference time. For LLMs, in-context examples are used to elicit specific task capabilities from more generic reasoning abilities, a method further enhanced by retrieving samples that are more pertinent to the specific test example(Rubin etal., 2022; Song etal., 2022).Ensembles over multiple predictions also offer a non-invasive way to improve tool performance(Anil etal., 2023; Wang etal., 2023c; Chen etal., 2024).Test-time adaptation methods(Wang etal., 2021) can be useful, though application requires access to the tool’s internal parameters.The aforementioned strategies focus on improving the tool’s performance in isolation, which may not translate to better task performance. In Fig.1d, better ImageNet performance does not guarantee detecting the Tomato.Understanding the interplay between tools and task performance remains an open question of system dynamics and credit assignment.

When improving the tool is not viable or when adjustments are insufficient, the best strategy can be to switch to a different tool. Research on assistance-seeking agents implicitly model this behavior, with agents identifying whento delegate the action to a human/oracle(Singh etal., 2022b; Xie etal., 2022). In NLP, Krishna etal. (2024) introduce a fact-checking tool that edits unsupported claims in LLM-generated summaries, advocating for the strategic use of alternative tools to ensure quality and reliability.

5.3 LLMs as a Meta-Reasoner: ϵi,ϵc,ϵtsubscriptitalic-ϵ𝑖subscriptitalic-ϵ𝑐subscriptitalic-ϵ𝑡absent\epsilon_{i},\epsilon_{c},\epsilon_{t}\uparrowitalic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ↑

For humans, the tools we employ are not perfect. But tools can err because humans can fix incorrect outputs – misrecognized card numbers through an OCR system are corrected ad-hoc by the user. Similarly, imbuing LLMs with the ability to recognize and handle errors flexibly allows for tools to make mistakes, effectively increasing the permissible error thresholds of the tool components ϵi,ϵc,ϵtsubscriptitalic-ϵ𝑖subscriptitalic-ϵ𝑐subscriptitalic-ϵ𝑡\epsilon_{i},\epsilon_{c},\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in Eq. 2.An LLM’s meta-cognitive ability to reason over uncertainty and realize its knowledge limits have received some attention(Kadavath etal., 2022; Kuhn etal., 2023). The next step is to jointly reason over their uncertainty/knowledge and that of another tool or agent. This compounds in multi-tool or multi-LM settings.Existing recovery methods that presuppose the cause and tweak a single knob may not yield overall improvement unless limitations of the right variables are resolved.

In summary, we identify three challenges:

  1. 1.

    Failure Detection: Recognizing failures and assessing their severity – d(o,o)>ϵ𝑑𝑜superscript𝑜italic-ϵd(o,o^{*})>\epsilonitalic_d ( italic_o , italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) > italic_ϵ ?

  2. 2.

    Fault Assignment: Identifying which tool caused the error (in multi-tool settings), with the exact source – i𝑖iitalic_i, c𝑐citalic_c, or tθsubscript𝑡𝜃t_{\theta}italic_t start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT?

  3. 3.

    Recovery Planning: Selecting the most effective recovery strategy.Refine vs Replace?

Explicit error signals (though rare) can obviate all three problems.More importantly, silent tool errors are the opposite case, where even detection is not straightforward although the problem is pervasive. In this work, we delve into “silent” tool errors, a relatively overlooked area in tool-error research, focusing on the foremost problem: error detection.

6 A broken calculator

Humans use tools with a rough expectation of what correct results should look like, allowing them to spot outputs that are obviously wrong. For example, for multiplying 120 by 131, we can expect a result around 10,000 and ending in zero, even if we don’t know the exact answer.If the tool makes arithmetic mistakes,can LLMs also detect faulty outputs?

6.1 Task setting

We devise a controlled setting where an LLM answers simple math problems with an external tool, a calculator. In this case, the calculator is broken and returns incorrect outputs.

First, we programmatically generate 300 equations that involve two random operators from {+,,×}\{+,-,\times\}{ + , - , × } and three random integers (e.g., 9×(20+7)92079\times(20+7)9 × ( 20 + 7 )). The equations have three levels of difficulty, which is determined by the range that the integers are sampled from: easy [20,20]2020[-20,20][ - 20 , 20 ], medium [100,100]100100[-100,100][ - 100 , 100 ], and hard [1000,1000]10001000[-1000,1000][ - 1000 , 1000 ].We give the incorrect tool output to the model, and test whether models are able to recognize the error. We compare five different models: GPT-3.5 and GPT-4, Command-R and Command-R+, Gemini-1.5.

# Task

What is the answer to: (2 + 3) * 5?

Refer to the tool output below.

# Calculator API

result = (2 + 3) * 5

result

25 # broken tool setting -> 21 / 205 / -25

# Format

Return your answer in this format:

Thought: Your reasoning process

Answer:

...

# Answer


6.2 Preliminary experiments

We begin by estimating the models’ capabilities to solve math problems on their own, to better understand the downstream implication of having a credible/broken calculator in the loop. Specifically, we query the LLM with five different prompts – three non-tool and two tool-use prompts.

Non-tool setting

The non-tool settings serve as a proxy to gauge the model’s task capability, providing a basis to compare the effects of incorporating tools with varying levels of credibility. We ask the model to solve the math problems on its own, with three different prompting methods:

  1. 1.

    Direct: Asking the equation directly (e.g., "What is the answer to (2+3)*5?")

  2. 2.

    Chain-of-Thought (CoT): Asking to explain its reasoning step-by-step prior to answering.

  3. 3.

    CoT Few-Shot: In addition to reasoning, the model is provided five in-context examples.

Tool-use setting

We assume two types of calculators – Correct and Broken. Fig.2 shows the tool-use prompt, where the model is asked to answer the question referring to the tool output (bold). For Correct tool, the ground truth answer is provided as the tool result.For Broken tool, we give a perturbed answer using one of the following three:

  1. 1.

    Digit replacement: One digit is replaced with a different number (e.g., 2521252125\to 2125 → 21)

  2. 2.

    Magnitude shift: Digits are inserted/removed, resulting in magnitude shifts in the range 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT and 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT (e.g., 252052520525\to 20525 → 205)

  3. 3.

    Sign inversion: The sign is flipped, changing positive numbers to negative and negative numbers to positive (e.g., 2525252525\to-2525 → - 25)

Inspired byWei etal. (2022); Yao etal. (2023), we specify a “Thought” section, to encourage the model to generate its reasoning prior to answering.

Results

We report the preliminary experiment results in App. B and Fig.3.When the tool is broken, the accuracy drops drastically for all perturbation categories, with the exception of Sign Inversion on GPT-4 and Gemini-1.5.With broken tools, performance drops far below the best no-tool setting’s performance, up to 47%. We find that models tend to overtrust tools – copying the incorrect output (with hallucinated justification) rather than ignore the tool in favor of its own answer.

Tools Fail: Detecting Silent Errors in Faulty Tools (2)

6.3 In-context intervention strategies

Humans leverage various contextual cues like prior tool failures to calibrate the level of trust associated with their tools. Further, AI chatbots include disclaimers like “The model can make mistakes” to ensure answers are scrutinized.Can LLMs also leverage such information effectively?

We test three types of contextual cues that can raise the awareness towards potential tool mistakes: a simple disclaimer, prediction confidence scores, and a checklist of criteria to look out for. For each method, we evaluate the prediction accuracy on both perturbed and non-perturbed tool outputs, in ZST, CoT, and FST settings. The prompt…

Oblivious (Obl.) does not mention any indications that the tool can cause errors Fig.2.

Disclaimer (Disc.) includes a simple disclaimer: “The tool can sometimes give incorrect answers. Please verify the correctness of the tool output.”

Confidence (Conf.) includes the confidence score of the tool’s prediction, in addition to the disclaimer. Since the calculator is not a probabilistic model, we devise a score [0,1] based on the string edit distance between the ground truth and the perturbed output.For learned tools, model confidence can be used.

Checklist (Check.) is motivated by heuristics that humans use, which includes a list of criteria to check the tool output, based on the perturbation. For the math task, the checklist consists of:

  1. 1.

    Is the positive or negative sign correct?

  2. 2.

    Is the magnitude of the number correct?

  3. 3.

    Is the last digit correct?

  4. 4.

    Are all the digits correct?

ZSTCoTCoT+FST
Model

Obl.

Disc.

Conf.

Check.

Obl.

Disc.

Conf.

Check.

Obl.

Disc.

Conf.

Check.

GPT-3.5235344464681798087898684
GPT-4768285858689899190918889
Command-R161416142942444711235346
Command-R+577679816084827671828678
Gemini-1.5849076879395959094949494
Results

Table 1 shows how effectively each method helps the LLM notice and correct mistakes. For most models, even a simple disclaimer prevents naively believing perturbed answers, boosting accuracy up to 30%. As humans, LLMs can better detect mistakes when provided the context that tools can be wrong. Chain-of-thought prompting and in-context examples further help models recover performance, nearly to the best no-tool scores.

7 Detecting tool-based mistakes

The results in §6 suggest that it is challenging for LLMs to simultaneously detect and override faulty outputs, even for capabilities that are decently performed without tools. Thus, next we narrow the LLM’s responsibility to “detecting” mistakes.111We reformulate the calculator setting into a binary Accept/Reject task (Fig.6).We balance the 300 perturbed equations in §6.2 with 300 correct samples to account for false positives.

Results
ZSTCoT
Model

Obl.

Disc.

Conf.

Check.

Obl.

Disc.

Conf.

Check.

GPT-3.57986868370678375
GPT-49295949196979694
Command-R6264676059688071
Command-R+8389877773788177
Gemini-1.59294949695969689

The models are often able to identify the incorrect outputs (Table 2) despite not being able to produce the correct answer – even in conditions where they would have without a tool present. Smaller models (GPT-3.5, Command-R) are more sensitive to in-context information.Where in Oblivious, most small model errors are due to overtrusting tools, and with in-context intervention, the prediction skews heavily towards rejecting outputs, leading to high false positive rates. In contrast, errors occur in similar rates for the larger models.

Tools Fail: Detecting Silent Errors in Faulty Tools (3)

Surprisingly, CoT does not always improve performance over Zero-shot. We find that the majority of CoT errors are the model falsely rejecting correct outputs – caused by failure in faithfully copying the original equation’s terms in its reasoning steps.Incorrect reasoning cases are more frequently observed in the CoT setting, contradicting Table 1 where CoT outperformed Zero-shot. While more investigation is needed, we speculate that the effectiveness of CoT might depend on task complexity, because the model is burdened to simultaneosly 1. solve the equation and 2. spot mistakes in the Detection+CoT setting. A two-step process where the LLM first generates its answer, then compares its own answer to tool outputs in a second call may alleviate this issue, which we leave to future work.

7.1 When are mistakes easier to detect?

For humans, whether a mistake is detected might depend on the type of mistake (blatant vs subtle), the difficulty of the original question, or the answerer’s task proficiency. Are some mistakes, past a certain level of deviation, just more obvious than others? Does the property of the question matter? Or does it relate to the model’s internal knowledge – do you need to “know” the answer to detect errors?In Fig.4,we analyze the models’ rejection rate on the perturbed outputs with respect to six features:

  • Numeric Difference The absolute difference between the correct and perturbed answer.

  • Symbolic Difference The string edit (Levenshtein) distance. Smaller symbolic deviations are expected to be less noticeable. Symbolic difference only loosely correlates with numeric differences (ρ=0.49𝜌0.49\rho=0.49italic_ρ = 0.49), for example 123123123123 to 123123-123- 123 vs 119119119119.

  • Perturbation Type Digit replacement, Magnitude shift, and Sign inversion from §6.2. We separate last digit replacement as it is easier for humans to detect than other digit positions by mental math.

  • Magnitude in Equation Equations are binned into three difficulty levels (§6.1), based on the magnitude of numbers involved in the equation. Relatedly, LLMs have been shown to find larger numbers harder to reason overNogueira etal. (2021); Lee etal. (2023); An etal. (2023); Duan and Shi (2024).

  • Answer Magnitude The magnitude of the correct answer, in log scale (log10|x|)subscript10𝑥(\log_{10}{|x|})( roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT | italic_x | ). Similar to above, but provides more fine-grained measurements.

  • Perceived Difficulty This is inferred via the model’s ability to answer the equation in §6.2. The categories are: The model (1) answered correctly with a “Direct” prompt, (2) required CoT or Few-Shot examples, and (3) gets the equation wrong even after applying these methods. The number of samples vary for each bin, depending on the model.

Numeric/String Difference and Perturbation Type attribute the rejection rate to the error’s “wrongness.” Magnitude is associated with the question itself, and Perceived Difficulty targets the model’s internal knowledge.

7.2 Analysis

  • Numeric vs Symbolic Unlike numeric difference, symbolic deviations appear highly correlated with rejection rates. This aligns with literature that LLMs are not performing arithmetic “reasoning,” but memorizing stringsChang and Bisk (2024).

  • Perturbation Types For humans, Sign Inversion and Last Digit are likely the easiest to spot.LLMs also find some perturbation types more obvious than others – Sign Inversion for GPT-4 and Gemini, Magnitude for Command-R and GPT-3.5, and Last Digit Replacement for Command-R+. Most models find Last Digit Replacements easier to spot than other digits. Sensitivity is likely attributable to differing representations/tokenizationNogueira etal. (2021); Liu and Low (2023).

  • Large Numbers Models struggle with large values in both Numbers in Equation and Magnitude.Equations with large numbers can be easier depending on the operations involved. For instance, (1000998)×2=4100099824(1000-998)\times 2=4( 1000 - 998 ) × 2 = 4 is easier than 10×11×12=1320101112132010\times 11\times 12=132010 × 11 × 12 = 1320. Notably, the rejection rate for answers larger than 106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT drops sharply for all models.

  • Perceived Difficulty Problems that are more easily answered by the model, are also more easily detected when exposed to errors.While this might raise a question on the utility of imperfect tools, we find that the larger models (GPT-4, Gemini-1.5-Pro, Command-R+) can “detect” the mistake for the majority of questions, even for ones that it were not able to answer correctly. This sheds light on the feasibility of using LLMs as a tool planner that evaluates the credibility of tools and reroutes functions accordingly to alternative tools. Smaller models, however, overtrust the tool and allow errors to pass.

8 Natural tool errors: ALFRED

Tools Fail: Detecting Silent Errors in Faulty Tools (4)

We now consider a setting where tool-based errors occur more naturally via ALFREDShridhar etal. (2020), an embodied instruction following benchmark.Involving language understanding, perception, spatial reasoning, and action planning capabilities, a common approach is to incorporate multiple specialized modulesBlukis etal. (2022); Min etal. (2022), as opposed to end-to-end training.

Multiple modules, or tools collaborating with each other in ALFRED offer a unique opportunity to study the robustness of LLMs to various tool errors. As in Fig.1d, the object detector’s mistakes are silently passed on to subsequent tools, leading to error cascades in the Action Planner. In such scenarios, LLMs that can detect tool errors help improve the system’s robustness, by correcting some obvious semantic anomaliesElhafsi etal. (2023) or delegating operations to other tools or humans.

In this section, we investigate whether LLMs can detect these realistic, multimodal tool errors arising from individual modules used in the FILM architectureMin etal. (2022). Specifically, we test the LLM’s fault detection capability on two distinct tools – the object detector and the action planner.222Object detection uses a finetuned MaskRCNN model. Action planning is done by the Fast Marching MethodSethian (1996), a heuristic-based algorithm.

8.1 Multimodal tool-error detection dataset

We create a classification task where the model Accept/Rejects the tool output, based on the current context.For the action planner, the model has to assess the feasibility of the predicted action, and reject actions that are to fail (e.g., facing an obstacle for MoveAhead, Fig.5).For the object detector, the LLM evaluates the correctness of the detection results with respect to the image, and reject outputs that mistaken important task objects. We note that imperfect outputs can still be labeled as “Accept” if only containing task-irrelevant errors.

We collect agent trajectories from the ALFRED validation set with actions and API responses whether the action succeeded.For the object detector, we gather RGB images with detection results and groundtruth semantic information. We provide detailed statistics of each dataset in App. C.1.

VLMZSTCoT

Obl.

Disc.

Conf.

Check.

Obl.

Disc.

Conf.

Check.

Action PlannerGPT-4o4342404457555260
Gemini4955506364646265
Object DetectorGPT-4o6868666768696668
Gemini6060566267666566

8.2 Experimental setting

Models

We test tool evaluation accuracy against the two best closed-source Vision-Language Models: GPT-4o and Gemini-1.5-Pro-latest.As in the calculator, we evaluate models on Zero-Shot (ZST) and Chain-of-Thought (CoT) settings. The prompt includes the task state (e.g., current subgoal, steps taken), tool docstrings (e.g., possible actions, object categories), and the current tool output. We provide example prompts in the Appendix: Action Planner (C.2), Object Detector (C.3).

8.3 Results

Models are able to reach 60-70 F1 scores with raised awareness through ICL and CoT prompting (Tab.3).In particular, specifying the potential failure modes in the Checklist prompt is effective for evaluating the action planner, where the error modes are more diverse than the Object Detector.In contrast, giving the raw confidence scores is not as helpful,as it demands additional interpretation.As these results are all zero-shot evaluations, we expect further improvements in few-shot or finetuning scenarios.Details of the Action Planner and Object Detector along with analysis are presented in Appendix C.

9 Conclusion

We characterize the trust dynamics of modern LLMs with respect to tool usage. By establishing an extensive taxonomy of tool-related errors and recovery strategies, we identify fundamental challenges associated with integrating learned tools. Our experiments span both synthetic and natural tool failures, and affirms current LLMs’ ability to identify silent tool failures. This work paves the way for future research on harnessing LLMs as sophisticated tool-reasoners.

10 Limitations

This study, while comprehensive in its scope, has certain limitations regarding the diversity and breadth of the models and datasets used. Firstly, for the calculator experiments, we employ five LLMs, mostly closed-source. Including smaller, open-source models, and models specifically fine-tuned for tool-use would have offered more insights into the models’ tool trusting behavior. In the experiments involving embodied agents, we limited our focus to only two API-based Vision-Language Models (VLMs). Incorporating smaller, open-source VLMs would have offered opportunities to explore into the models’ internal workings, revealing additional nuances in how models handle unreliable tools.

Secondly, the action planner and object detection dataset we constructed based on ALFRED trajectories is fairly small in size – Action Planner (490) and Object Detector (214). In terms of diversity, running multiple models/agents in addition to FILM would have enabled collecting a wider array of failure modes. Moreover, the action’s success or failure is highly dependent on the affordances provided by the AI2-THOR framework which may not accurately reflect real-world scenarios. For example, a ‘Put’ action might fail due to the system perceiving a surface as cluttered, even when there is visibly sufficient space available. A dataset encompassing a wider variety of scenarios and higher diversity would potentially provide deeper insights into the practical applications and limitations of current AI systems in navigating real-world environments.

References

  • Ahn etal. (2022)Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, RosarioJauregui Ruano, Kyle Jeffrey, Sally Jesmonth, NikhilJayant Joshi, RyanC. Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, DiegoM Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, F.Xia, Ted Xiao, Peng Xu, Sichun Xu, and Mengyuan Yan. 2022.Do as i can, not as i say: Grounding language in robotic affordances.In Conference on Robot Learning.
  • An etal. (2023)Jisu An, Junseok Lee, and Gahgene Gweon. 2023.Does chatgpt comprehend the place value in numbers when solving math word problems.In Proceedings of the Workshop” Towards the Future of AI-augmented Human Tutoring in Math Learning” co-located with The 24th International Conference on Artificial Intelligence in Education (AIED 2023), Tokyo, Japan, volume 3491, pages 49–58.
  • Anil etal. (2023)Gemini Team GoogleRohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, AndrewM. Dai, Anja Hauth, etal. 2023.Gemini: A family of highly capable multimodal models.ArXiv, abs/2312.11805.
  • Blukis etal. (2022)Valts Blukis, Chris Paxton, Dieter Fox, Animesh Garg, and Yoav Artzi. 2022.A persistent spatial semantic representation for high-level natural language instruction execution.In Proceedings of the 5th Conference on Robot Learning, volume 164 of Proceedings of Machine Learning Research, pages 706–717. PMLR.
  • Chang and Bisk (2024)Yingshan Chang and Yonatan Bisk. 2024.Language models need inductive biases to count inductively.arXiv preprint arXiv:2405.20131.
  • Chaplot etal. (2020)DevendraSingh Chaplot, Helen Jiang, Saurabh Gupta, and Abhinav Gupta. 2020.Semantic curiosity for active visual learning.In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part VI, volume 12351 of Lecture Notes in Computer Science, pages 309–326. Springer.
  • Chen etal. (2024)Lingjiao Chen, JaredQuincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, and James Zou. 2024.Are more llm calls all you need? towards scaling laws of compound inference systems.Preprint, arXiv:2403.02419.
  • Chen etal. (2023a)Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, etal. 2023a.Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents.arXiv preprint arXiv:2308.10848.
  • Chen etal. (2023b)Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023b.Teaching large language models to self-debug.arXiv preprint arXiv:2304.05128.
  • Ding etal. (2023)Yan Ding, Xiaohan Zhang, Chris Paxton, and Shiqi Zhang. 2023.Task and motion planning with large language models for object rearrangement.In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2086–2092. IEEE.
  • Duan and Shi (2024)Shaoxiong Duan and Yining Shi. 2024.From interpolation to extrapolation: Complete length generalization for arithmetic transformers.
  • Elhafsi etal. (2023)Amine Elhafsi, Rohan Sinha, Christopher Agia, Edward Schmerling, Issa AD Nesnas, and Marco Pavone. 2023.Semantic anomaly detection with large language models.Auton. Robots, 47(8):1035–1055.
  • Gao etal. (2023)Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023.Pal: program-aided language models.In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
  • Garrett etal. (2021)CaelanReed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver, LesliePack Kaelbling, and Tomás Lozano-Pérez. 2021.Integrated task and motion planning.Annual review of control, robotics, and autonomous systems, 4:265–293.
  • Gupta and Kembhavi (2023)Tanmay Gupta and Aniruddha Kembhavi. 2023.Visual programming: Compositional visual reasoning without training.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14953–14962.
  • Huang etal. (2022a)Wenlong Huang, P.Abbeel, Deepak Pathak, and Igor Mordatch. 2022a.Language models as zero-shot planners: Extracting actionable knowledge for embodied agents.ArXiv, abs/2201.07207.
  • Huang etal. (2022b)Wenlong Huang, F.Xia, Ted Xiao, Harris Chan, Jacky Liang, PeterR. Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. 2022b.Inner monologue: Embodied reasoning through planning with language models.In Conference on Robot Learning.
  • Kadavath etal. (2022)Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zachary Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, John Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, TomB. Brown, Jack Clark, Nicholas Joseph, Benjamin Mann, Sam McCandlish, Christopher Olah, and Jared Kaplan. 2022.Language models (mostly) know what they know.ArXiv, abs/2207.05221.
  • Krishna etal. (2024)Kundan Krishna, Sanjana Ramprasad, Prakhar Gupta, ByronC Wallace, ZacharyC Lipton, and JeffreyP Bigham. 2024.Genaudit: Fixing factual errors in language model outputs with evidence.arXiv preprint arXiv:2402.12566.
  • Kuhn etal. (2023)Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023.Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.In The Eleventh International Conference on Learning Representations.
  • Lee etal. (2021)MichelleA. Lee, Matthew Tan, Yuke Zhu, and Jeannette Bohg. 2021.Detect, reject, correct: Crossmodal compensation of corrupted sensors.In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 909–916.
  • Lee etal. (2023)Nayoung Lee, Kartik Sreenivasan, JasonD Lee, Kangwook Lee, and Dimitris Papailiopoulos. 2023.Teaching arithmetic to small transformers.arXiv preprint arXiv:2307.03381.
  • Lewis etal. (2020)Patrick S.H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020.Retrieval-augmented generation for knowledge-intensive NLP tasks.In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  • Li etal. (2023)Boyi Li, Philipp Wu, Pieter Abbeel, and Jitendra Malik. 2023.Interactive task planning with language models.ArXiv, abs/2310.10645.
  • Liang etal. (2022)Jacky Liang, Wenlong Huang, F.Xia, Peng Xu, Karol Hausman, Brian Ichter, PeterR. Florence, and Andy Zeng. 2022.Code as policies: Language model programs for embodied control.2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500.
  • Liu and Low (2023)Tiedong Liu and Bryan KianHsiang Low. 2023.Goat: Fine-tuned llama outperforms gpt-4 on arithmetic tasks.arXiv preprint arXiv:2305.14201.
  • Lu etal. (2023)Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, YingNian Wu, Song-Chun Zhu, and Jianfeng Gao. 2023.Chameleon: Plug-and-play compositional reasoning with large language models.In Thirty-seventh Conference on Neural Information Processing Systems.
  • Madaan etal. (2023)Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, BodhisattwaPrasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023.Self-refine: Iterative refinement with self-feedback.In Advances in Neural Information Processing Systems, volume36, pages 46534–46594. Curran Associates, Inc.
  • Madaan etal. (2024)Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, etal. 2024.Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems, 36.
  • Min etal. (2022)SoYeon Min, DevendraSingh Chaplot, PradeepKumar Ravikumar, Yonatan Bisk, and Ruslan Salakhutdinov. 2022.FILM: following instructions in language with modular methods.In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  • Nogueira etal. (2021)Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. 2021.Investigating the limitations of transformers with simple arithmetic tasks.arXiv preprint arXiv:2102.13019.
  • Pan etal. (2023a)Liangming Pan, Alon Albalak, Xinyi Wang, and WilliamYang Wang. 2023a.Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning.arXiv preprint arXiv:2305.12295.
  • Pan etal. (2023b)Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and WilliamYang Wang. 2023b.Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies.arXiv preprint arXiv:2308.03188.
  • Parisi etal. (2022)Aaron Parisi, Yao Zhao, and Noah Fiedel. 2022.Talm: Tool augmented language models.arXiv preprint arXiv:2205.12255.
  • Qin etal. (2023)Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, etal. 2023.Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789.
  • Radford etal. (2021)Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021.Learning transferable visual models from natural language supervision.CoRR, abs/2103.00020.
  • Rubin etal. (2022)Ohad Rubin, Jonathan Herzig, and Jonathan Berant. 2022.Learning to retrieve prompts for in-context learning.In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, Seattle, United States. Association for Computational Linguistics.
  • Schick etal. (2023)Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023.Toolformer: Language models can teach themselves to use tools.In Thirty-seventh Conference on Neural Information Processing Systems.
  • Sethian (1996)JA Sethian. 1996.A fast marching level set method for monotonically advancing fronts.Proceedings of the National Academy of Sciences, 93(4):1591–1595.
  • Shen etal. (2024)Yongliang Shen, Kaitao Song, XuTan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2024.Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36.
  • Shinn etal. (2023)Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023.Reflexion: language agents with verbal reinforcement learning.In Neural Information Processing Systems.
  • Shridhar etal. (2020)Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. 2020.ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Singh etal. (2022a)Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. 2022a.Progprompt: Generating situated robot task plans using large language models.2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530.
  • Singh etal. (2022b)KunalPratap Singh, Luca Weihs, Alvaro Herrasti, Jonghyun Choi, Aniruddha Kembhavi, and Roozbeh Mottaghi. 2022b.Ask4help: Learning to leverage an expert for embodied tasks.Advances in Neural Information Processing Systems, 35:16221–16232.
  • Song etal. (2022)ChanHee Song, Jiaman Wu, Clay Washington, BrianM. Sadler, Wei-Lun Chao, and YuSu. 2022.Llm-planner: Few-shot grounded planning for embodied agents with large language models.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2986–2997.
  • Wang etal. (2024)Boshi Wang, Hao Fang, Jason Eisner, Benjamin VanDurme, and YuSu. 2024.Llms in the imaginarium: tool learning through simulated trial and error.arXiv preprint arXiv:2403.04746.
  • Wang etal. (2021)Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. 2021.Tent: Fully test-time adaptation by entropy minimization.In International Conference on Learning Representations.
  • Wang etal. (2023a)Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023a.Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291.
  • Wang etal. (2023b)Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. 2023b.Mint: Evaluating llms in multi-turn interaction with tools and language feedback.Preprint, arXiv:2309.10691.
  • Wang etal. (2023c)Xuezhi Wang, Jason Wei, Dale Schuurmans, QuocV. Le, EdH. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023c.Self-consistency improves chain of thought reasoning in language models.In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  • Wei etal. (2022)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, EdH. Chi, QuocV Le, and Denny Zhou. 2022.Chain of thought prompting elicits reasoning in large language models.In Advances in Neural Information Processing Systems.
  • Wu etal. (2023)Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. 2023.Visual chatgpt: Talking, drawing and editing with visual foundation models.Preprint, arXiv:2303.04671.
  • Wu etal. (2024)Yue Wu, SoYeon Min, Shrimai Prabhumoye, Yonatan Bisk, RussR Salakhutdinov, Amos Azaria, TomM Mitchell, and Yuanzhi Li. 2024.Spring: Studying papers and reasoning to play games.Advances in Neural Information Processing Systems, 36.
  • Xie etal. (2022)Annie Xie, Fahim Tajwar, Archit Sharma, and Chelsea Finn. 2022.When to ask for help: Proactive interventions in autonomous reinforcement learning.Advances in Neural Information Processing Systems, 35:16918–16930.
  • Xu etal. (2023)Mengdi Xu, Peide Huang, Wenhao Yu, Shiqi Liu, Xilun Zhang, Yaru Niu, Tingnan Zhang, Fei Xia, Jie Tan, and Ding Zhao. 2023.Creative robot tool use with large language models.Preprint, arXiv:2310.13065.
  • Yang etal. (2023)Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, CeLiu, Michael Zeng, and Lijuan Wang. 2023.Mm-react: Prompting chatgpt for multimodal reasoning and action.Preprint, arXiv:2303.11381.
  • Yao etal. (2023)Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, KarthikR Narasimhan, and Yuan Cao. 2023.React: Synergizing reasoning and acting in language models.In The Eleventh International Conference on Learning Representations.
  • Zeng etal. (2023)Andy Zeng, Maria Attarian, brian ichter, KrzysztofMarcin Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, MichaelS Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, and Pete Florence. 2023.Socratic models: Composing zero-shot multimodal reasoning with language.In The Eleventh International Conference on Learning Representations.
  • Zeng etal. (2022)Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, etal. 2022.Socratic models: Composing zero-shot multimodal reasoning with language.arXiv preprint arXiv:2204.00598.
  • Zhang etal. (2023)Kechi Zhang, Zhuo Li, Jia Li, GeLi, and Zhi Jin. 2023.Self-edit: Fault-aware code editor for code generation.In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada. Association for Computational Linguistics.

Appendix

\parttoc

Appendix A Overview of Tool Errors

In Table4, we compile a list of tools that support various modalities, with respective real-world tool-error scenarios. We categorize specific error scenarios by its source of failure – the tool input, the tool itself, or context information.

Source of failure
ModalityCapabilityToolTool inputTool itselfContext
TextMathematical
computation
Calculator
Code interpreter
-API syntax error
-Incorrect content
NANA
Code
validation
Code interpreter-Code syntax error
-Version updates (e.g., deprecated functions)
-Incorrect content
NANA
World
knowledge
Search API-Ambiguous query-Incomplete DB
-Irrelevant results (e.g., different word sense)
Task planningLLM/VLM-Prompt includes non-existent objects due to previous perception errors-API call failure
-Plan includes unsupported actions/objects
-Incorrect steps
-Invalid plan due to partial observability (e.g., closed receptacles)
ImageText recognitionOCR model-Blurry/noisy image-Parsing mistakes
Visual
perception
Vision-Language Models (CLIP)
Semantic segmentation (Fast-RCNN)
Object detectors (M-DETR)
-Camera noise
-Poor lighting
-Unknown object
-Detection failure
-Hallucination
-Wrong categories
-Bad segmentation mask
Depth estimators-Estimation errors
Sensory PerceptionPose Estimation, Map buildingSLAM-Sensor drift-Algorithmic errors-Environmental interference (e.g. moving humans, key object change)
AudioAuditory
perception
Speech-to-text API (Socratic Models)-Audio noise-Recognition errors
ActionNavigationPath-planning algorithms (A*, Fast Marching Method)-Collision
-Circling with no progress
-Change in obstacle locations
ManipulationSkills-Grip failure

Appendix B Math problems

Table5 reports the accuracy of models on “answering” math equations, plotted in Figure3. The numbers in the parentheses indicate the relative gain/loss compared to the best no-tool setting (in bold). In short, Chain-of-Thought prompting improves arithmetic performance, which is further enhanced by few-shot in-context examples.Correct tool-use yields strongest results, supporting existing literature that employ reliable tools.

We share an example prompt for Accept/Reject task for the calculator setting in Figure6. This is comparable to Figure2, where the task inputs are identical, but the primary task is to “answer” the equation rather than “evaluating” the tool output.

ModelDirectCoTCoT-FSCorrect toolBroken tool
GPT-3.561.079.785.398.7 (+13.4)22.7 (-62.6)
GPT-464.089.089.797.7 (+8.0)76.0 (-13.7)
Command-R34.352.363.386.3 (+23.0)16.0 (-47.3)
Command-R+62.075.777.393.7 (+16.4)56.7 (-20.6)
Gemini-1.586.790.388.798.3 (+8.0)83.7 (-6.6)

# Task

You are given the equation: (2 + 3) * 5. The task is to

evaluate the result of the equation provided by the tool.

Refer to the tool output below.

# Calculator API

result = (2 + 3) * 5

result

-25 # broken tool setting -> 21 / 205 / -25

# Format

Return your answer in this format:

Thought: Your reasoning process

Evaluation: Accept/Reject

...

# Answer

Appendix C ALFRED

C.1 Dataset

Tools Fail: Detecting Silent Errors in Faulty Tools (5)
Tools Fail: Detecting Silent Errors in Faulty Tools (6)

For the dataset used for action planner evaluation, we plot the histogram of actions and task types in Figure7. For Action Type (Left), Pickup and Put are the most frequent actions, as most task types necessitate these actions for object interaction. Toggle and OpenClose are merged from the canonical actions ToggleOn+ToggleOff, and OpenObject+CloseObject, respectively. We note that ToggleOff and CloseObject was always successful for the FILM agent, as these actions are attempted at the same location where the preconditioning action (ToggleOn, OpenObject) was successful. Merging the related actions help balance out the Accept/Reject label distribution per action category.

Similarly in Figure8, we describe object frequencies in the object detector evaluation dataset. Large receptacle objects like CounterTop and Cabinet are observed the most frequently.

C.2 Action Planner Evaluation

Figure9 shows an example prompt used for action planner evaluation. The prompt consists of general task instructions, a docstring explaining how the Planner API works, the agent’s status on task progress. For the Disclaimer setting, it is informed that the planner can make mistakes. In the Confidence setting, a confidence score isprovided alongside the predicted action, which is the success rate of the past five actions. We additionally note that this confidence score may not always align well with tool success rates in this setting, which might be one reason why the Confidence prompt underperforms the Oblivious prompt in Table3. The Checklist lists the common failure modes of the planner suggested action. The previous four actions and their success/failure is also presented. Our analysis into the reasoning steps of the LLM shows that models are capable of inferring the robot’s state based on this information (e.g., [(Open, Fail), (MoveAhead, Success), (Open, Fail), (MoveAhead, Success)] -> Reasoning: ... the previous attempts suggest that the robot might have been trying to open the microwave from too far away).

In Figure11, we further analyze the tool evaluation accuracy per different action type. Actions require different preconditions to succeed.For instance, successful Pickup, demands target object in the agent’s view, within reachable distance, while the agent’s hand is empty.Thus, different actions require varying levels of spatial reasoning, object/scene detection, and task understanding for assessing feasibility. Compared to interaction actions that may require all the aforementioned capabilities, navigation actions like MoveAhead might be expected as the easiest to infer feasibility, as it mostly relies on spatial reasoning of obstacles. Surprisingly, we find that this is not the case – because evaluating MoveAhead solely depends on spatial information, it is in fact harder to evaluate compared to other interaction actions, the model having less hints to utilize. For interaction actions, models were able to predict tool success based on objects, which compensates their limited spatial reasoning capability.

A robot is working on household tasks in a simulator environment. The robot follows a series of low-level actions to accomplish the task. The robot uses an external tool, a low-level action planner, which predicts the next action to follow. The provided image is a first-person view from the robots perspective. Refer to the tool suggested action below and decide whether to accept or reject the tool output, based on your judgement of whether the action would succeed/fail.

The tool can sometimes give incorrect answers. Please cross-check the output based on the image and robot state, to verify the correctness

and feasibility of the planner’s output.

The tool’s prediction confidence (between 0 and 1) is also provided, which may hint the correctness of the output. Confidence is based on previous

action attempts and success/failure.

The following are some scenarios where the Planner action might fail.

1. Interaction actions might fail if the object is too far from you. In this case, you need to approach closer to the object.

2. Interaction actions might fail when you do not have a good view of the object.

3. If another object is in your path, MoveAhead will fail due to collision. In this case, you need to walk around the obstacle.

4. If a receptacle is occupied with another object, Put will fail.

# Tool: Planner API

The Planner API provides a function that takes the task_state, observed_state as input and returns the next suggested action. The action is computed based on the agent and target objects location, based on the robots internal spatial map.

## Task

possible_actions = [’MoveAhead’, Open(Receptacle)’, Close(Receptacle)’, Pickup(Object)’, Put(Object, Receptacle)’, ToggleOn(Object)’, ToggleOff(Object)’, Slice(Object)’]

## Robot state

task_state = {

task_description’: "Pick up a pillow and turn a lamp on.",

completed_subgoals’: [],

current_subgoal’: "Pickup Pillow",

num_steps_taken’: 56

}

print(observed_state)

Current room has: Bed, Pillow on a Bed, Cabinet, Drawer, Dresser, GarbageCan, Shelf, SideTable, Sofa, Pillow on a Sofa.

Previous action attempts: [(MoveAhead, Success), (MoveAhead, Success), (MoveAhead, Success), (MoveAhead, Success)]

## Planner output at current step

output = Planner(task_state, observed_state)

print(output)

Pickup(Pillow), 0.8

# Format

Return your answer in this format:

Tool output: [ACTION]

Thought: Your reasoning process

Evaluation: Accept/Reject

The evaluation is a single word indicating whether you accept or reject the tool output. Do not provide any reasoning in the evaluation. Provide your reasoning in the thought section.

# Answer

C.3 Object Detector Evaluation

Figure10 shows an example prompt used for object detector evaluation. Similar to the action planner prompt in Figure9, general instructions, tool docstring, robot states are given. The robot state here additionally includes the remaining subgoals, as it is helpful in determining which objects are task relevant or not. For instance, while the current subgoal is (’Pickup’, ’Apple’), correcting detection mistakes for ’Microwave’ would be beneficial, as it is needed in future subgoals. For Oblivious, Disclaimer, and Checklist, the tool output is given in a nested dictionary format, where objects are binned into ’detected’ and ’filtered’, based on the detector threshold. For the Confidence setting, the detection results are provided in a single dictionary, with objects and their respective raw confidence scores. The instruction mentions that objects with score below 60 will be filtered out. Based on the raw scores, the LLM has to interpret whether specific objects will be kept or discarded.

In Figure12, we plot the LLM’s evaluation accuracy with respect to the number of mistakes made by the detector, which is one indication of the deviation, d(o,o)𝑑𝑜superscript𝑜d(o,o^{*})italic_d ( italic_o , italic_o start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). As the number of detection mistakes increase, it is indeed easier for models to evaluate tool correctness. However, we find that models tend to reject even many acceptable tool outputs where the mistake is not crucial, with the accuracy being extremely low when the number of mistakes are zero in both plots. The models seem to understand when the tool is wrong, but struggles with telling apart task-critical vs tolerable tool mistakes.

A robot is working on household tasks in a simulator environment. The provided image is a first-person view from the robots perspective. The robot uses an external tool, an object detector to identify which objects are in the current scene. Refer to the tool output below and evaluate the correctness of the detector with respect to the provided image, and decide whether to accept or reject the tool output. If objects important to the task are ignored by the detector, the tool output should be rejected. Mistakes with regard to task-irrelevant mistakes are acceptable.

The tool can sometimes give incorrect answers. Please cross-check the output based on the image and robot state, to verify the correctness of the

detector’s output.

The tool’s prediction confidence (between 0 and 100) is also provided, which may hint the correctness of the output. Keep in mind that objects with

confidence scores below 60 will be ignored.

The following are common examples where the detector mistakes may hinder the robot’s ability to accomplish the task. Consider these cases in your

reasoning steps.

1. Missing task-relevant objects in the scene. In particular, small objects (e.g., keys, credit card) are prone to be missed.

2. Hallucinating task-relevant objects that are not in the scene. For example, objects that are similar in shape or color (e.g., apple vs tomato) may

be mistaken.

# Tool: Object Detector API

The Detector API provides a function that takes the current_image as input and returns the list of objects detected in the image. The obj_categories and receptacles are predefined as below. The prediction consists of two parts: the predicted objects and the filtered objects. The filtered objects are object detections ignored as the detection confidence was lower than the threshold. Only the detected objects will be passed on.

Detector.obj_categories = [’AlarmClock’, Apple’, AppleSliced’, BaseballBat’, BasketBall’, Book’, Bowl’, Box’, Bread’, BreadSliced’, ButterKnife’, CD’, Candle’, CellPhone’, ... ]

Detector.receptacles = [’ArmChair’, BathtubBasin’, Bed’, Cabinet’, Cart’, CoffeeMachine’, CoffeeTable’, CounterTop’, Desk’, DiningTable’, Drawer’, Dresser’, Fridge’, ... ]

## Robot state

task_state = {

task_description’: "Place a cooked apple into the sink.",

completed_subgoals’: [(’Pickup’, Apple’)],

remaining_subgoals’: [(’Open’, Microwave’), (’Put’, Microwave’), (’Close’, Microwave’), (’ToggleOn’, Microwave’), (’ToggleOff’, Microwave’), (’Open’, Microwave’), (’Pickup’, Apple’), (’Close’, Microwave’), (’Put’, SinkBasin’)],

num_steps_taken’: 235

}

## Detector output on current image

Detector(current_image)

# {’Apple’: 3.09, ’Knife’: 0.55, ’CounterTop’: 63.31, ’DiningTable’: 47.09} for Confidence

# other prompting methods:

{

detected’: {’CounterTop’},

filtered’: {’DiningTable’, Apple’, Knife’}

}

# Format

Return your answer in this format:

Thought: Your reasoning process on the provided information (image, task_state and tool_output)

Evaluation: Accept/Reject

The evaluation is a single word indicating whether you accept or reject the tool output. Do not provide any reasoning in the evaluation. Provide your reasoning in the thought section.

# Answer

Tools Fail: Detecting Silent Errors in Faulty Tools (7)
Tools Fail: Detecting Silent Errors in Faulty Tools (8)
Tools Fail: Detecting Silent Errors in Faulty Tools (2024)

References

Top Articles
Latest Posts
Article information

Author: Prof. Nancy Dach

Last Updated:

Views: 5976

Rating: 4.7 / 5 (77 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Prof. Nancy Dach

Birthday: 1993-08-23

Address: 569 Waelchi Ports, South Blainebury, LA 11589

Phone: +9958996486049

Job: Sales Manager

Hobby: Web surfing, Scuba diving, Mountaineering, Writing, Sailing, Dance, Blacksmithing

Introduction: My name is Prof. Nancy Dach, I am a lively, joyous, courageous, lovely, tender, charming, open person who loves writing and wants to share my knowledge and understanding with you.