Large Language Model

Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection

Existing works have been using decoding-free candidate selection methods to obtain candidate probability from initial output logits over vocabulary. Though these estimation methods are widely used, they are not systematically evaluated, especially on end tasks. We introduce an evaluation of a comprehensive collection of decoding-free candidate selection approaches.

Jul 28, 2025

GIVE: Structured Reasoning with Knowledge Graph Inspired Veracity Extrapolation

GIVE is a novel reasoning framework that integrates the parametric and non-parametric memories to enhance both knowledge retrieval and faithful reasoning processes on very sparse knowledge graphs. By leveraging the external structured knowledge to inspire LLM to model the interconnections among relevant concepts, our method facilitates a more logical and step-wise reasoning approach akin to human problem-solving, rather than gold answer retrieval.

Jul 13, 2025

Orchestrating Tool Ecosystem of Drug Discovery with Intention-Aware LLM Agents

We introduce GenieAgent, a drug discovery agent that integrates a wide range of molecule design models and bridges the user intentions to concrete actions by navigating the large skill ecosystem. We also propose an evaluation framework simulating drug discovery conversations, based on real-world experiments. A large-scale assessment, validated by expert annotations, demonstrates that GenieAgent reliably meets the majority of molecular engineers' needs with high scientific accuracy and robustness.

Apr 15, 2025

SpatialAgent: An Autonomous AI Agent for Spatial Biology

SpatialAgent integrates large language models with dynamic tool execution and adaptive reasoning. SpatialAgent spans the entire research pipeline, from experimental design to multimodal data analysis and hypothesis generation.

Apr 3, 2025

BIASINSPECTOR: Detecting Bias in Structured Data through LLM Agents

We introduce the first end-to-end, multi-agent synergy framework, BIASINSPECTOR, designed for automatic bias detection in structured data based on specific user requirements. It first develops a multi-stage plan to analyze user-specified bias detection tasks and then implements it with a diverse and well-suited set of tools. It delivers detailed results that include explanations and visualizations.

Apr 1, 2025

Entropy-Based Adaptive Weighting for Self-Training

We propose Entropy-Based Adaptive Weighting for Self-Training (EAST), an adaptive weighting strategy designed to prioritize uncertain data during self-training. Specifically, EAST employs a mapping function with a tunable parameter that controls the sharpness of the weighting, assigning higher weights to data where the model exhibits greater uncertainty.

Feb 25, 2025

Memorize and Rank: Elevating Large Language Models for Clinical Diagnosis Prediction

We introduce MERA, a clinical diagnosis prediction model that bridges pertaining natural language knowledge with medical practice. We apply hierarchical contrastive learning on a disease candidate ranking list to alleviate the large decision space issue. With concept memorization through fine-tuning, we bridge the natural language clinical knowledge with medical codes.

Jan 10, 2025

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs consisting 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations).

Jan 1, 2025

MetaScientist: A Human-AI Synergistic Framework for Automated Mechanical Metamaterial Design

We present a human-in-the-loop system that integrates advanced AI capabilities with expert oversight to accelerate the design of novel mechanical metamaterials. The system generate novel and scientifically sound hypotheses and synthesize 3D structures with high-quality.

Dec 20, 2024

CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making

We introduce CliBench, a novel benchmark offering a comprehensive and realistic assessment of LLMs' capabilities in clinical diagnosis. This benchmark not only covers diagnosis from a diverse range of medical cases across various specialties but also incorporates tasks of clinical significance: treatment procedure identification, lab test ordering and medication prescriptions.

Oct 1, 2024

GraphVis: Boosting LLMs with Visual Knowledge Graph Integration

GraphVis conserves the intricate graph structure through the visual modality to enhance the comprehension of KGs with the aid of Large Vision Language Models (LVLMs). Our approach incorporates a unique curriculum fine-tuning scheme which first instructs LVLMs to recognize basic graphical features from the images, and subsequently incorporates reasoning on QA tasks with the visual graphs.

Sep 20, 2024

Are Large-Language Models Graph Algorithmic Reasoners?

We introduce a novel benchmark designed to evaluate LLM performance on classical algorithmic reasoning tasks on explicit graphs. Our findings highlight the persistent challenges LLMs face in this domain and underscore the necessity for advanced prompting techniques and algorithmic instruction to enhance their graph reasoning abilities.

Aug 25, 2024

CLIMB: A Benchmark of Clinical Bias in Large Language Models

We introduce a pioneering comprehensive benchmark to evaluate both intrinsic (within LLMs) and extrinsic (on downstream tasks) bias in LLMs for clinical decision tasks. Our experiments across popular and medically adapted LLMs, particularly from the Mistral and LLaMA families, unveil prevalent behaviors with both intrinsic and extrinsic bias. This work underscores the critical need to mitigate clinical bias and sets a new standard for future evaluations of LLMs' clinical bias.

Jul 7, 2024

MIRAI: Evaluating LLM Agents for Event Forecasting

We introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles.

Jul 1, 2024

Improving Event Definition Following For Zero-Shot Event Detection

We aim to improve zero-shot event detection by training models to better follow event definitions. We hypothesize that a diverse set of event types and definitions are the key for models to learn to follow event definitions while existing event extraction datasets focus on annotating many high-quality examples for a few event types. Our experiments verify our hypothesis.

May 10, 2024

STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models

We propose STAR, a structure-to-text data generation method for complicated structure prediction tasks that first generates complicated event structures (Y) and then generates input passages (X), all with Large Language Models. We further reduce errors and improve data quality through self-reflection error identification and self-refinement with iterative revision. We show that the data generated by STAR significantly improves the performance of low-resource event extraction and relation extraction tasks, even surpassing the effectiveness of human-curated data.

Feb 22, 2024