원문정보
초록
영어
Visual Question Answering (VQA) models suffer from a language bias problem, where they excessively rely on textual correlations. This study proposes a plausible counterfactual data generation method, named Plausible Counterfactual Data Generation (PCDG), which utilizes Grad- CAM-based visual importance to replace significant objects in a contextually appropriate manner. By synthesizing more contextually relevant samples than other existing augmentation methods, PCDS effectively strengthens visual-language alignment. In experiments on the VQA-CP v2 benchmark, our method achieved significant performance improvements, particularly a 10.56% increase in the 'Num' category and a 2.78% increase in the 'Other' category. This indicates that the proposed method enhances the VQA model's generalization ability and robustness through debiasing.
목차
I. INTRODUCTION
II. RELATED WORK
A. Retrieval Visual Contrastive Decoding
B. Counterfactcal sample synthesis
III. METHOD
A. Visual Importance
B. Dynamic Counterfactual Image Generation
IV. EXPERIMENTS
A. Experimental Settings
B. Training
C. Results
V. CONCLUSION
VI. FUTURE WORK
ACKNOWLEDGMENT
