Generative Language Model

Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection

Existing works have been using decoding-free candidate selection methods to obtain candidate probability from initial output logits over vocabulary. Though these estimation methods are widely used, they are not systematically evaluated, especially on end tasks. We introduce an evaluation of a comprehensive collection of decoding-free candidate selection approaches.

Oct 3, 2024

CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making

We introduce CliBench, a novel benchmark offering a comprehensive and realistic assessment of LLMs' capabilities in clinical diagnosis. This benchmark not only covers diagnosis from a diverse range of medical cases across various specialties but also incorporates tasks of clinical significance: treatment procedure identification, lab test ordering and medication prescriptions.

Oct 1, 2024

Memorize and Rank: Elevating Large Language Models for Clinical Diagnosis Prediction

We introduce MERA, a clinical diagnosis prediction model that bridges pertaining natural language knowledge with medical practice. We apply hierarchical contrastive learning on a disease candidate ranking list to alleviate the large decision space issue. With concept memorization through fine-tuning, we bridge the natural language clinical knowledge with medical codes.

Sep 30, 2024

GIVE: Structured Reasoning with Knowledge Graph Inspired Veracity Extrapolation

GIVE is a novel reasoning framework that integrates the parametric and non-parametric memories to enhance both knowledge retrieval and faithful reasoning processes on very sparse knowledge graphs. By leveraging the external structured knowledge to inspire LLM to model the interconnections among relevant concepts, our method facilitates a more logical and step-wise reasoning approach akin to human problem-solving, rather than gold answer retrieval.

Sep 29, 2024

GraphVis: Boosting LLMs with Visual Knowledge Graph Integration

GraphVis conserves the intricate graph structure through the visual modality to enhance the comprehension of KGs with the aid of Large Vision Language Models (LVLMs). Our approach incorporates a unique curriculum fine-tuning scheme which first instructs LVLMs to recognize basic graphical features from the images, and subsequently incorporates reasoning on QA tasks with the visual graphs.

Sep 20, 2024

Are Large-Language Models Graph Algorithmic Reasoners?

We introduce a novel benchmark designed to evaluate LLM performance on classical algorithmic reasoning tasks on explicit graphs. Our findings highlight the persistent challenges LLMs face in this domain and underscore the necessity for advanced prompting techniques and algorithmic instruction to enhance their graph reasoning abilities.

Aug 25, 2024

CLIMB: A Benchmark of Clinical Bias in Large Language Models

We introduce a pioneering comprehensive benchmark to evaluate both intrinsic (within LLMs) and extrinsic (on downstream tasks) bias in LLMs for clinical decision tasks. Our experiments across popular and medically adapted LLMs, particularly from the Mistral and LLaMA families, unveil prevalent behaviors with both intrinsic and extrinsic bias. This work underscores the critical need to mitigate clinical bias and sets a new standard for future evaluations of LLMs' clinical bias.

Jul 7, 2024

MIRAI: Evaluating LLM Agents for Event Forecasting

We introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles.

Jul 1, 2024

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs consisting 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations).

Jun 13, 2024

Improving Event Definition Following For Zero-Shot Event Detection

We aim to improve zero-shot event detection by training models to better follow event definitions. We hypothesize that a diverse set of event types and definitions are the key for models to learn to follow event definitions while existing event extraction datasets focus on annotating many high-quality examples for a few event types. Our experiments verify our hypothesis.

May 10, 2024

Mitigating Bias for Question Answering Models by Tracking Bias Influence

We propose BMBI, an approach to mitigate the bias of multiple-choice QA models. Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance by observing its influence on another instance. We then use the bias level detected as an optimization objective to form a multi-task learning setting in addition to the original QA task.

Mar 12, 2024

STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models

We propose STAR, a structure-to-text data generation method for complicated structure prediction tasks that first generates complicated event structures (Y) and then generates input passages (X), all with Large Language Models. We further reduce errors and improve data quality through self-reflection error identification and self-refinement with iterative revision. We show that the data generated by STAR significantly improves the performance of low-resource event extraction and relation extraction tasks, even surpassing the effectiveness of human-curated data.

Feb 22, 2024

DICE: Data-Efficient Clinical Event Extraction with Generative Models

We introduce DICE, a robust and data-efficient generative model for clinical event extraction, which specializes in clinical mention identification, and MACCROBAT-EE, the first clinical event extraction dataset with event argument annotation.

May 2, 2023

Parameter-Efficient Low-Resource Dialogue State Tracking by Prompt Tuning

We use soft prompt tokens to learn task properties, incorporate segment information and reiterate the task before predicting value. Our method drastically reduces the number of parameters needed to less than 0.5% of prior works while achieving better low-resource dialogue state tracking performance.

Jan 20, 2023