•

Focus on what matters: Enhancing medical vision-language models with automatic attention alignment tuning

Aofei Chang (1); Le Huang (2); Alex James Boyd (2); Parminder Bhatia (2); Taha Kass-Hout (2); Cao Xiao (2*); Fenglong Ma (1*) | 1 – Pennsylvania State University, USA; 2 – GE HealthCare, USA.

May 1, 2025

1 minute read

Accepted at ACL 2025

Medical Large Vision-Language Models (Med-LVLMs) often exhibit suboptimal attention distribution on visual inputs, leading to hallucinated or inaccurate outputs. Existing mitigation methods primarily rely on inference-time interventions, which are limited in attention adaptation or require additional supervision. To address this, we propose A3Tune, a novel fine-tuning framework for Automatic Attention Alignment Tuning. A3Tune leverages zero-shot weak labels from SAM, refines them into prompt-aware labels using BioMedCLIP, and then selectively modifies visually-critical attention heads to improve alignment while minimizing interference. Additionally, we introduce a A3MoE module, enabling adaptive parameter selection for attention tuning across diverse prompts and images. Extensive experiments on medical VQA and report generation benchmarks show that A3Tune outperforms state-of-the-art baselines, achieving enhanced attention distributions and performance in Med-LVLMs.

Read the paper.

Focus on what matters: Enhancing medical vision-language models with automatic attention alignment tuning

Share

Focus Area

Topics

Focus on what matters: Enhancing medical vision-language models with automatic attention alignment tuning

Share

Focus Area

Topics

Related Content

A framework to help advance Responsible AI in healthcare

Establishing robust protocols to help achieve scalable women’s health outcomes

Treat AI like a resident: Building trust in radiology