LoCoRe: Image Re-ranking with Long-Context Sequence Modeling

Zilin Xiao; Pavel Suma; Ayush Sachdeva; Hao-Jen Wang; Giorgos Kordopatis-Zilos; Giorgos Tolias; Vicente Ordonez

← back to publications

publication

LoCoRe: Image Re-ranking with Long-Context Sequence Modeling

Zilin Xiao, Pavel Suma, Ayush Sachdeva, Hao-Jen Wang, Giorgos Kordopatis-Zilos, Giorgos Tolias, Vicente Ordonez.

Conf. on Computer Vision and Pattern Recognition. CVPR 2025. Nashville, TN.

paper github pdf raw bibtex

Lab News Desk

News Release Summary

This section is intentionally written in a reporter-style news release voice for general readers.

Researchers from Rice University and Czech Technical University in Prague have developed a new image retrieval system called LOCORE that rethinks how search engines narrow down and re-rank candidate images after an initial broad search. Traditional re-ranking systems compare a query image against each candidate image individually, one pair at a time, which means they miss useful relationships between the candidate images themselves — for instance, the fact that two gallery images might share features that together provide stronger evidence of a match. LOCORE instead processes the query alongside an entire shortlist of up to 100 candidate images simultaneously, using a long-context transformer model called Longformer, originally developed for lengthy text documents, to capture those cross-image dependencies at the level of fine-grained local visual descriptors. To handle situations where the shortlist exceeds what the model can fit in memory at once, the team designed a sliding window strategy that moves through the candidate list in overlapping chunks. In testing across five benchmark datasets covering landmarks, products, fashion items, and bird species, LOCORE consistently outperformed existing re-ranking methods, including pair-wise approaches using local descriptors and list-wise approaches using global descriptors, while running at comparable or lower latency and using significantly less memory. The work matters because better re-ranking directly improves the accuracy of image search systems, and the approach demonstrates that ideas from natural language processing — particularly long-context modeling and token-level classification — can be transferred effectively to visual retrieval tasks.

abstract

We introduce LOCORE, Long-Context Re-ranker, a model that takes as input local descriptors corresponding to an image query and a list of gallery images and outputs similarity scores between the query and each gallery image. This model is used for image retrieval, where typically a first ranking is performed with an efficient similarity measure, and then a shortlist of top-ranked images is re-ranked based on a more fine-grained similarity measure. Compared to existing methods that perform pair-wise similarity estimation with local descriptors or list-wise re-ranking with global descriptors, LOCORE is the first method to perform list-wise re-ranking with local descriptors. To achieve this, we leverage efficient long-context sequence models to effectively capture the dependencies between query and gallery images at the local-descriptor level. During testing, we process long shortlists with a sliding window strategy that is tailored to overcome the context size limitations of sequence models. Our approach achieves superior performance compared with other re-rankers on established image retrieval benchmarks of landmarks (ROxf and RPar), products (SOP), fashion items (In-Shop), and bird species (CUB-200) while having comparable latency to the pair-wise local descriptor re-rankers.

details

comment: CVPR 2025

citation

@inproceedings{xiao2025locore,
  title = {LoCoRe: Image Re-ranking with Long-Context Sequence Modeling},
  author = {Xiao, Zilin and Suma, Pavel and Sachdeva, Ayush and Wang, Hao-Jen and Kordopatis-Zilos, Giorgos and Tolias, Giorgos and Ordonez, Vicente},
  year = {2025},
  booktitle = {Conf. on Computer Vision and Pattern Recognition. CVPR 2025},
  url = {https://arxiv.org/abs/2503.21772},
}

automatically generated questions, main contributions and limitations of this paper

Questions this paper helps answer

What is LOCORE and what problem does it address? LOCORE is a long-context image re-ranking model that jointly processes a query image and a shortlist of gallery images using local descriptors, improving the second-stage ranking used in image retrieval systems.
How is LOCORE different from pair-wise re-rankers? Pair-wise methods compare the query to each gallery image independently, while LOCORE models the whole shortlist together so it can exploit relationships among gallery images as well as query-gallery matches.
Why does LOCORE use a long-context sequence model? Re-ranking up to 100 gallery images with local descriptors creates a long token sequence, and Longformer-style attention lets the model capture useful dependencies with manageable memory and latency.
How does LOCORE handle shortlists longer than its context window? It uses an overlapping sliding-window strategy that reuses the list-wise re-ranker across parts of the shortlist, allowing the method to improve rankings beyond the maximum list size seen in one forward pass.
What retrieval benchmarks does LOCORE improve? The paper reports leading or state-of-the-art re-ranking results on landmarks, products, fashion, and bird-species retrieval benchmarks, including ROxf/RPar, SOP, In-Shop, and CUB-200.

Main contributions

The paper introduces the first list-wise image re-ranking framework that operates at the local-descriptor level rather than relying on pair-wise local matching or list-wise global descriptors.
LOCORE recasts image re-ranking as a long-context token-level classification problem, transferring ideas from NLP span extraction and sequence tagging into visual retrieval.
The model uses query global attention, separator tokens, and gallery shuffled training to avoid positional shortcuts and learn meaningful cross-image descriptor interactions.
Across ROxf/RPar and their 1M distractor variants, LOCORE improves over prior local-descriptor re-rankers such as geometric verification, RRT, CVNet, and AMES under comparable descriptor settings.
The method also improves metric-learning retrieval benchmarks including CUB-200, SOP, and In-Shop, showing that list-wise local-descriptor re-ranking is useful beyond landmark retrieval.

Limitations and cautions

LOCORE is a second-stage re-ranker rather than a replacement for efficient first-stage retrieval, which is appropriate for large-scale search pipelines where a compact global descriptor first narrows the candidate list.
The method depends on high-quality local descriptors from systems such as DELG or DINOv2, but this makes it complementary to advances in local feature extraction rather than tied to a single backbone.
Long-context processing has a finite context window, so very long shortlists require sliding-window inference; the paper shows this strategy works well and can extend the benefits beyond the training list size.
Training requires care to avoid positional shortcuts from the initial global ranking, but gallery shuffled training is a simple and effective fix demonstrated in the ablations.
The evaluation focuses on established instance-level retrieval benchmarks, leaving broader production search settings and domain-specific image collections as natural next deployment studies.

How to read this result

This paper is best read as a strong contribution to image retrieval re-ranking: LOCORE shows that long-context list-wise modeling can make local descriptors more powerful, improving accuracy across diverse benchmarks while keeping latency and memory practical for second-stage retrieval.