Twitter Feed
Couldn't connect with Twitter

What are you looking for?

Simply enter your keyword and we will help you find what you need.
NeurogFuture of AI Decoding AI Agents: Unveiling LLM Proficiency Through AgentBench Evaluation

Decoding AI Agents: Unveiling LLM Proficiency Through AgentBench Evaluation

In a publication dated August 7, 2023, on arXiv, Xiao Liu and team from institutions including Tsinghua University, The Ohio State University, and UC Berkeley present their findings on evaluating large language models (LLMs) using the newly introduced benchmark, AgentBench.


In the fast-growing world of AI, looking at large language models (LLMs) as agents has become an important research topic. This recent paper dives into this topic, introducing a comprehnsive assessment tool named AgentBench. This write-up gives a broad view of what the paper found and talks about.


The paper looked closely at 25 LLMs using AgentBench. This included models from companies, like gpt-4 and gpt-3.5-turbo, as well as freely accessible alternatives.. The outcomes highlighted noteworthy disparities, revealing that while top-tier models like gpt-4 exhibited commendable performance as agents, substantial variations existed when compared to their freely available counterparts.

LLMs often face these problems when acting as agents:


  1. Action Validity: Making sure the actions are right and make sense.
  2. Action Diversity: Coming up with different actions.
  3. Action Efficiency: Doing actions that get the best result quickly.
  4. Action Consistency: Doing the same kind of actions in similar situations.
  5. Action Interpretability: Making sure actions can be understood easily.


AgentBench has many test areas:


  1. Operating System (OS)
  2. Database (DB)
  3. Knowledge Graph (KG)
  4. Digital Card Game (DCG)
  5. Thinking Puzzles (LTP)
  6. Housework (HH)
  7. Online Shopping (WS)
  8. Web Searching (WB)


From the tests, they found:


Top Performers: Gpt-4 did the best, getting the top scores in six areas. Gpt-3.5-turbo was close behind, doing better than gpt-4 in the KG and WB areas. Claude-instant came third, doing well in the OS, DB, DCG, and HH areas.


Best Free Models: Openchat-8192-13b emerged as the top-performing free model, claiming the fourth overall position. It excelled in the LTP test and also in the OS and HH tests. Wizardlm-30b, another freely available model, secured the fifth position, demonstrating competence in the OS, DB, DCG, and WS tests.


Areas of improvement : Baichuan-13b-chat faced challenges, exhibiting weakness  in six areas and displaying limited performance in the WS and LTP tests.


Performance Gap: A notable performance difference was observed between corporate-developed models and freely available ones Like, in the OS test, corporate models did well 65% of the time, but free ones only 9%. And in the KG test, corporate models had a score of 0.64, but free ones just 0.07.


Challenges: LLMs encountered different challenges in different tests. Like, in the OS test, pose difficulty in  making correct computer commands. the DB test presented challenges in formulating complex database questions. Each test had its own set of problems for the LLMs.


Furthermore, The paper also shared a toolset based on the “API & Docker” way of working to help people test their own LLMs as agents.


To sum it up, the study shed light on the capabilities and limitations of LLMs in multiple domains. While gpt-4 exhibited remarkable overall performance, gpt-3.5-turbo closely trailed behind. On the other hand, Baichuan-13b-chat encountered notable struggles.These results can help guide more studies and work in making LLMs act as agents.

Reference  2308.03688v1.pdf (

author avatar
No Comments
Add Comment