Attention Mask Consistency (AMC)
Here we showcase our model for localizing arbitrary objects and phrases in images through tuning the ALBEF vision-language model with human-like explanations from the Visual Genome dataset. Particularly we finetune ALBEF with our recenlty proposed Attention Mask Consistency (AMC) objective which will appear at CVPR 2023: Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations. In this paper, we propose a max-margin fine-tuning objective that encourages gradient-based explanations to become soft-aligned with explanations provided by humans. This greatly improves a vision-language model's ability to "ground" or localize objects in images for arbitrary phrases. Try your own images and textual phrases below and see what it does.
Here we showcase our model for localizing arbitrary objects and phrases in images through tuning the ALBEF vision-language model with human-like explanations from the Visual Genome dataset. Particularly we finetune ALBEF with our recenlty proposed Attention Mask Consistency (AMC) objective which will appear at CVPR 2023: Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations.
Original Image
P("a man in blue shirt") = 0.65