Mingyu Derek Ma

Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection

Mingyu Derek Ma^*, Yanna Ding^*, Zijie Huang, Jianxi Gao, Yizhou Sun, Wei Wang

ACL, 2025

Existing works have been using decoding-free candidate selection methods to obtain candidate probability from initial output logits over vocabulary. Though these estimation methods are widely used, they are not systematically evaluated, especially on end tasks. We introduce an evaluation of a comprehensive collection of decoding-free candidate selection approaches.

PDF Code

GIVE: Structured Reasoning with Knowledge Graph Inspired Veracity Extrapolation

Jiashu He, Mingyu Derek Ma, Jinxuan Fan, Dan Roth, Wei Wang, Alejandro Ribeiro

ICML, 2025

GIVE is a novel reasoning framework that integrates the parametric and non-parametric memories to enhance both knowledge retrieval and faithful reasoning processes on very sparse knowledge graphs. By leveraging the external structured knowledge to inspire LLM to model the interconnections among relevant concepts, our method facilitates a more logical and step-wise reasoning approach akin to human problem-solving, rather than gold answer retrieval.

PDF Code

Orchestrating Tool Ecosystem of Drug Discovery with Intention-Aware LLM Agents

Mingyu Derek Ma, Karina Zadorozhny, Jesse Swanson, Nathan C. Frey, Keunwoo Choi, Maksim Eremeev, Sabrina J Mielke, Wenmo Sun, Melody Liu, Jonathan Wickes, Vladimir Gligorijevic, Richard Bonneau, Henri Dwyer, Kyunghyun Cho, Stephen Ra

ICLR Workshop on Towards Agentic AI for Science, 2025

We introduce GenieAgent, a drug discovery agent that integrates a wide range of molecule design models and bridges the user intentions to concrete actions by navigating the large skill ecosystem. We also propose an evaluation framework simulating drug discovery conversations, based on real-world experiments. A large-scale assessment, validated by expert annotations, demonstrates that GenieAgent reliably meets the majority of molecular engineers' needs with high scientific accuracy and robustness.

PDF

SpatialAgent: An Autonomous AI Agent for Spatial Biology

Hanchen Wang^*, Yichun He^*, Paula P. Coelho^*, Matthew Bucci^*, Abbas Nazir, Bob Chen, Linh Trinh, Serena Zhang, Kexin Huang, Vineethkrishna Chandrasekar, Douglas C. Chung, Minsheng Hao, Ana Carolina Leote, Yongju Lee, Bo Li, Tianyu Liu, Jin Liu, Romain Lopez, Tawaun Lucas, Mingyu Derek Ma, Nikita Makarov, Lisa McGinnis, Linna Peng, Stephen Ra, Gabriele Scalia, Avtar Singh, Liming Tao, Masatoshi Uehara, Chenyu Wang, Runmin Wei, Ryan Copping, Orit Rozenblatt-Rosen, Jure Leskovec, Aviv Regev

bioRxiv, 2025

SpatialAgent integrates large language models with dynamic tool execution and adaptive reasoning. SpatialAgent spans the entire research pipeline, from experimental design to multimodal data analysis and hypothesis generation.

PDF Code

BIASINSPECTOR: Detecting Bias in Structured Data through LLM Agents

Haoxuan Li, Mingyu Derek Ma, Jen-Tse Huang, Zhaotian Weng, Wei Wang, Jieyu Zhao

arXiv, 2025

We introduce the first end-to-end, multi-agent synergy framework, BIASINSPECTOR, designed for automatic bias detection in structured data based on specific user requirements. It first develops a multi-stage plan to analyze user-specified bias detection tasks and then implements it with a diverse and well-suited set of tools. It delivers detailed results that include explanations and visualizations.

PDF

Entropy-Based Adaptive Weighting for Self-Training

Xiaoxuan Wang, Yihe Deng, Mingyu Derek Ma, Wei Wang

arXiv, 2025

We propose Entropy-Based Adaptive Weighting for Self-Training (EAST), an adaptive weighting strategy designed to prioritize uncertain data during self-training. Specifically, EAST employs a mapping function with a tunable parameter that controls the sharpness of the weighting, assigning higher weights to data where the model exhibits greater uncertainty.

PDF

Memorize and Rank: Elevating Large Language Models for Clinical Diagnosis Prediction

Mingyu Derek Ma, Xiaoxuan Wang, Yijia Xiao, Anthony Cuturrufo, Vijay S Nori, Eran Halperin, Wei Wang

AAAI, 2025

We introduce MERA, a clinical diagnosis prediction model that bridges pertaining natural language knowledge with medical practice. We apply hierarchical contrastive learning on a disease candidate ranking list to alleviate the large decision space issue. With concept memorization through fine-tuning, we bridge the natural language clinical knowledge with medical codes.

PDF

How Californians Tweet about Extreme Heat Events on Social Media: A Health Equity Perspective

Gomathi B. Sriperumbudur, Yihang Fan, Xiaozhen Liu, Mingyu Derek Ma, Wei Wang, Chen Li, Suellen Hopfer

Weather, Climate, and Society, 2025

We examined Twitter heat discourse between 2016 and 2022 across California and by HPI. From keyword-filtered tweets, we inductively identified eight heat discourse categories, listed in order of prevalence: perceived heat risk, life impact, coping, venting, heat warnings, community action, expressing relief, and climate change concern.

PDF DOI

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Fei Wang^*, Xingyu Fu^*, James Y. Huang^†, Zekun Li^†, Qin Liu^†, Xiaogeng Liu^†, Mingyu Derek Ma^†, Nan Xu^†, Wenxuan Zhou^†, Kai Zhang, Tianyi Lorena Yan, Wenjie Jacky Mo, Hsiang-Hui Liu, Pan Lu, Chunyuan Li, Chaowei Xiao, Kai-Wei Chang, Dan Roth, Sheng Zheng, Hoifung Poon, Muhao Chen

ICLR, 2025

We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs consisting 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations).

PDF Dataset

MetaScientist: A Human-AI Synergistic Framework for Automated Mechanical Metamaterial Design

Jingyuan Qi, Zian Jia, Minqian Liu, Wangzhi Zhan, Junkai Zhang, Xiaofei Wen, Jingru Gan, Jianpeng Chen, Qin Liu, Mingyu Derek Ma, Bangzheng Li, Haohui Wang, Adithya Kulkarni, Muhao Chen, Dawei Zhou, Ling Li, Wei Wang, Lifu Huang

NAACL Demonstrations, 2024

We present a human-in-the-loop system that integrates advanced AI capabilities with expert oversight to accelerate the design of novel mechanical metamaterials. The system generate novel and scientifically sound hypotheses and synthesize 3D structures with high-quality.

PDF Video

CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making

Mingyu Derek Ma, Chenchen Ye, Yu Yan, Xiaoxuan Wang, Peipei Ping, Timothy S Chang, Wei Wang

arXiv, 2024

We introduce CliBench, a novel benchmark offering a comprehensive and realistic assessment of LLMs' capabilities in clinical diagnosis. This benchmark not only covers diagnosis from a diverse range of medical cases across various specialties but also incorporates tasks of clinical significance: treatment procedure identification, lab test ordering and medication prescriptions.

Project PDF Code Dataset Leaderboard

GraphVis: Boosting LLMs with Visual Knowledge Graph Integration

Yihe Deng, Chenchen Ye, Zijie Huang, Mingyu Derek Ma, Yiwen Kou, Wei Wang

NeurIPS, 2024

GraphVis conserves the intricate graph structure through the visual modality to enhance the comprehension of KGs with the aid of Large Vision Language Models (LVLMs). Our approach incorporates a unique curriculum fine-tuning scheme which first instructs LVLMs to recognize basic graphical features from the images, and subsequently incorporates reasoning on QA tasks with the visual graphs.

PDF

Decoding Susceptibility: Modeling Misbelief to Misinformation Through a Computational Approach

Yanchen Lin, Mingyu Derek Ma, Wenna Qin, Azure Zhou, Jiaao Chen, Weiyan Shi, Wei Wang, Diyi Yang

EMNLP, 2024

We propose a computational model to infer users' susceptibility levels given their activities. Since user's susceptibility is a key indicator for their reposting behavior, we utilize the supervision from the observable sharing behavior to infer the underlying susceptibility tendency. Building upon such large-scale susceptibility labeling, we further conduct a comprehensive analysis of how different social factors relate to susceptibility.

PDF Cite DOI

Are Large-Language Models Graph Algorithmic Reasoners?

Alexander K. Taylor, Anthony Cuturrufo, Vishal Yathish, Mingyu Derek Ma, Wei Wang

arXiv, 2024

We introduce a novel benchmark designed to evaluate LLM performance on classical algorithmic reasoning tasks on explicit graphs. Our findings highlight the persistent challenges LLMs face in this domain and underscore the necessity for advanced prompting techniques and algorithmic instruction to enhance their graph reasoning abilities.

PDF

CLIMB: A Benchmark of Clinical Bias in Large Language Models

Yubo Zhang^*, Shudi Hou^*, Mingyu Derek Ma, Wei Wang, Muhao Chen, Jieyu Zhao

EMNLP Workshop on NLP for Positive Impact, 2024

We introduce a pioneering comprehensive benchmark to evaluate both intrinsic (within LLMs) and extrinsic (on downstream tasks) bias in LLMs for clinical decision tasks. Our experiments across popular and medically adapted LLMs, particularly from the Mistral and LLaMA families, unveil prevalent behaviors with both intrinsic and extrinsic bias. This work underscores the critical need to mitigate clinical bias and sets a new standard for future evaluations of LLMs' clinical bias.

PDF Code

MIRAI: Evaluating LLM Agents for Event Forecasting

Chenchen Ye^*, Ziniu Hu^*, Yihe Deng^*, Zijie Huang, Mingyu Derek Ma, Yanqiao Zhu, Wei Wang

arXiv, 2024

We introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles.

Project PDF Code Dataset Demo Video Demo Notebook

Improving Event Definition Following For Zero-Shot Event Detection

Zefan Cai^*, Po-Nien Kung^*, Ashima Suvarna, Mingyu Derek Ma, Hritik Bansal, Baobao Chang, P. Jeffrey Brantingham, Wei Wang, Nanyun Peng

ACL, 2024

We aim to improve zero-shot event detection by training models to better follow event definitions. We hypothesize that a diverse set of event types and definitions are the key for models to learn to follow event definitions while existing event extraction datasets focus on annotating many high-quality examples for a few event types. Our experiments verify our hypothesis.

PDF Cite DOI

Mitigating Bias for Question Answering Models by Tracking Bias Influence

Mingyu Derek Ma, Jiun-Yu Kao, Arpit Gupta, Yu-Hsiang Lin, Wenbo Zhao, Tagyoung Chung, Wei Wang, Kai-Wei Chang, Nanyun Peng

NAACL, 2024

We propose BMBI, an approach to mitigate the bias of multiple-choice QA models. Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance by observing its influence on another instance. We then use the bias level detected as an optimization objective to form a multi-task learning setting in addition to the original QA task.

PDF Cite Poster DOI

Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models

Jiashu Xu, Mingyu Derek Ma, Fei Wang, Chaowei Xiao, Muhao Chen

NAACL, 2024

Our studies demonstrate that an attacker can inject backdoors by issuing very few malicious instructions among thousands of gathered data and control model behavior through data poisoning. Through such instruction attacks, the attacker can achieve over 90% attack success rate across four commonly used NLP datasets, and cause persistent backdoors that are easily transferred to 15 diverse datasets zero-shot.

PDF Cite DOI

Instructional Fingerprinting of Large Language Models

Jiashu Xu, Fei Wang^*, Mingyu Derek Ma^*, Pang Wei Koh, Chaowei Xiao, Muhao Chen

NAACL, 2024

We present a pilot study on LLM fingerprinting as a form of very lightweight instruction tuning. Model publisher specifies a confidential private key and implants it as an instruction backdoor that causes the LLM to generate specific text when the key is present. Results on 11 popularly-used LLMs showed that this approach prevents publisher overclaim, maintains robustness against fingerprint guessing and parameter-efficient training, and supports multi-stage fingerprinting akin to MIT License.

Project PDF Cite Code DOI

STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models

Mingyu Derek Ma, Xiaoxuan Wang, Po-Nien Kung, P. Jeffrey Brantingham, Nanyun Peng, Wei Wang

AAAI, 2024

We propose STAR, a structure-to-text data generation method for complicated structure prediction tasks that first generates complicated event structures (Y) and then generates input passages (X), all with Large Language Models. We further reduce errors and improve data quality through self-reflection error identification and self-refinement with iterative revision. We show that the data generated by STAR significantly improves the performance of low-resource event extraction and relation extraction tasks, even surpassing the effectiveness of human-curated data.

PDF Cite Code Poster DOI

MIDDAG: Where Does Our News Go? Investigating Information Diffusion via Community-Level Information Pathways

Mingyu Derek Ma, Alexander K. Taylor, Nuan Wen, Yanchen Lin, Po-Nien Kung, Wenna Qin, Shicheng Wen, Azure Zhou, Diyi Yang, Xuezhe Ma, Nanyun Peng, Wei Wang

AAAI Demonstrations, 2024

We present MIDDAG, an intuitive, interactive system that visualizes the information propagation paths on social media triggered by COVID-19-related news articles accompanied by comprehensive insights including user/community susceptibility level, as well as events and popular opinions raised by the crowd while propagating the information.

Project PDF Cite Code Poster Video DOI