LLM Evaluation

Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection

Existing works have been using decoding-free candidate selection methods to obtain candidate probability from initial output logits over vocabulary. Though these estimation methods are widely used, they are not systematically evaluated, especially on end tasks. We introduce an evaluation of a comprehensive collection of decoding-free candidate selection approaches.

Jul 28, 2025

Orchestrating Tool Ecosystem of Drug Discovery with Intention-Aware LLM Agents

We introduce GenieAgent, a drug discovery agent that integrates a wide range of molecule design models and bridges the user intentions to concrete actions by navigating the large skill ecosystem. We also propose an evaluation framework simulating drug discovery conversations, based on real-world experiments. A large-scale assessment, validated by expert annotations, demonstrates that GenieAgent reliably meets the majority of molecular engineers' needs with high scientific accuracy and robustness.

Apr 15, 2025

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs consisting 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations).

Jan 1, 2025

CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making

We introduce CliBench, a novel benchmark offering a comprehensive and realistic assessment of LLMs' capabilities in clinical diagnosis. This benchmark not only covers diagnosis from a diverse range of medical cases across various specialties but also incorporates tasks of clinical significance: treatment procedure identification, lab test ordering and medication prescriptions.

Oct 1, 2024

CLIMB: A Benchmark of Clinical Bias in Large Language Models

We introduce a pioneering comprehensive benchmark to evaluate both intrinsic (within LLMs) and extrinsic (on downstream tasks) bias in LLMs for clinical decision tasks. Our experiments across popular and medically adapted LLMs, particularly from the Mistral and LLaMA families, unveil prevalent behaviors with both intrinsic and extrinsic bias. This work underscores the critical need to mitigate clinical bias and sets a new standard for future evaluations of LLMs' clinical bias.

Jul 7, 2024

MIRAI: Evaluating LLM Agents for Event Forecasting

We introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles.

Jul 1, 2024