Preprint

Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection

Existing works have been using decoding-free candidate selection methods to obtain candidate probability from initial output logits over vocabulary. Though these estimation methods are widely used, they are not systematically evaluated, especially on end tasks. We introduce an evaluation of a comprehensive collection of decoding-free candidate selection approaches.

Jul 28, 2025

GIVE: Structured Reasoning with Knowledge Graph Inspired Veracity Extrapolation

GIVE is a novel reasoning framework that integrates the parametric and non-parametric memories to enhance both knowledge retrieval and faithful reasoning processes on very sparse knowledge graphs. By leveraging the external structured knowledge to inspire LLM to model the interconnections among relevant concepts, our method facilitates a more logical and step-wise reasoning approach akin to human problem-solving, rather than gold answer retrieval.

Jul 13, 2025

Orchestrating Tool Ecosystem of Drug Discovery with Intention-Aware LLM Agents

We introduce GenieAgent, a drug discovery agent that integrates a wide range of molecule design models and bridges the user intentions to concrete actions by navigating the large skill ecosystem. We also propose an evaluation framework simulating drug discovery conversations, based on real-world experiments. A large-scale assessment, validated by expert annotations, demonstrates that GenieAgent reliably meets the majority of molecular engineers' needs with high scientific accuracy and robustness.

Apr 15, 2025

SpatialAgent: An Autonomous AI Agent for Spatial Biology

SpatialAgent integrates large language models with dynamic tool execution and adaptive reasoning. SpatialAgent spans the entire research pipeline, from experimental design to multimodal data analysis and hypothesis generation.

Apr 3, 2025

BIASINSPECTOR: Detecting Bias in Structured Data through LLM Agents

We introduce the first end-to-end, multi-agent synergy framework, BIASINSPECTOR, designed for automatic bias detection in structured data based on specific user requirements. It first develops a multi-stage plan to analyze user-specified bias detection tasks and then implements it with a diverse and well-suited set of tools. It delivers detailed results that include explanations and visualizations.

Apr 1, 2025

Entropy-Based Adaptive Weighting for Self-Training

We propose Entropy-Based Adaptive Weighting for Self-Training (EAST), an adaptive weighting strategy designed to prioritize uncertain data during self-training. Specifically, EAST employs a mapping function with a tunable parameter that controls the sharpness of the weighting, assigning higher weights to data where the model exhibits greater uncertainty.

Feb 25, 2025

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs consisting 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations).

Jan 1, 2025

CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making

We introduce CliBench, a novel benchmark offering a comprehensive and realistic assessment of LLMs' capabilities in clinical diagnosis. This benchmark not only covers diagnosis from a diverse range of medical cases across various specialties but also incorporates tasks of clinical significance: treatment procedure identification, lab test ordering and medication prescriptions.

Oct 1, 2024

Are Large-Language Models Graph Algorithmic Reasoners?

We introduce a novel benchmark designed to evaluate LLM performance on classical algorithmic reasoning tasks on explicit graphs. Our findings highlight the persistent challenges LLMs face in this domain and underscore the necessity for advanced prompting techniques and algorithmic instruction to enhance their graph reasoning abilities.

Aug 25, 2024

CLIMB: A Benchmark of Clinical Bias in Large Language Models

We introduce a pioneering comprehensive benchmark to evaluate both intrinsic (within LLMs) and extrinsic (on downstream tasks) bias in LLMs for clinical decision tasks. Our experiments across popular and medically adapted LLMs, particularly from the Mistral and LLaMA families, unveil prevalent behaviors with both intrinsic and extrinsic bias. This work underscores the critical need to mitigate clinical bias and sets a new standard for future evaluations of LLMs' clinical bias.

Jul 7, 2024

MIRAI: Evaluating LLM Agents for Event Forecasting

We introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles.

Jul 1, 2024