Adaptive Re-Ranking with a Corpus Graph

Search systems often employ a re-ranking pipeline, wherein documents (or passages) from an initial pool of candidates are assigned new ranking scores. The process enables the use of highly-effective but expensive scoring functions that are not suitable for use directly in structures like inverted indices or approximate nearest neighbour indices. However, re-ranking pipelines are inherently limited by the recall of the initial candidate pool; documents that are not identified as candidates for re-ranking by the initial retrieval function cannot be identified. We propose a novel approach for overcoming the recall limitation based on the well-established clustering hypothesis. Throughout the re-ranking process, our approach adds documents to the pool that are most similar to the highest-scoring documents up to that point. This feedback process adapts the pool of candidates to those that may also yield high ranking scores, even if they were not present in the initial pool. It can also increase the score of documents that appear deeper in the pool that would have otherwise been skipped due to a limited re-ranking budget. We find that our Graph-based Adaptive Re-ranking (GAR) approach significantly improves the performance of re-ranking pipelines in terms of precision- and recall-oriented measures, is complementary to a variety of existing techniques (e.g., dense retrieval), is robust to its hyperparameters, and contributes minimally to computational and storage costs. For instance, on the MS MARCO passage ranking dataset, GAR can improve the nDCG of a BM25 candidate pool by up to 8% when applying a monoT5 ranker.


INTRODUCTION
Deep neural ranking models -especially those that use contextualised language models like BERT [6] -have brought significant benefits in retrieval effectiveness across a range of tasks [18].The most effective techniques tend to be those that first retrieve a pool of candidate documents2 using an inexpensive retrieval approach and then re-score them using a more expensive function.This process is called re-ranking, since the documents from the candidate pool are given a new ranked order.Re-ranking enables the use of sophisticated scoring functions (such as cross-encoders, which jointly model the texts of the query and document) that are incompatible with inverted indexes or vector indexes.Since the scoring function can be computationally expensive, re-ranking is often limited to a predefined maximum number documents that the system is willing to re-rank for each query (i.e., a re-ranking budget, such as 100).
The performance of a re-ranking pipeline is limited by the recall of the candidate pool, however.This is because documents that were not found by the initial ranking function have no chance of being re-ranked.Consequently, a variety of techniques are employed to improve the recall of the initial ranking pool, including documentrewriting approaches that add semantically-similar terms to an inverted index [30], or dense document retrieval techniques that enable semantic searching [12].
In this work, we explore a complementary approach to overcoming the recall limitation of re-ranking based on the long-standing clustering hypothesis [11], which suggests that closely-related documents tend to be relevant to the same queries.During the re-ranking process, our approach, called Graph-based Adaptive Re-Ranking (Gar), prioritises the scoring of the neighbours of documents that have received high scores up to this point.An overview of Gar is shown in Figure 1.The Gar feedback mechanism allows for documents to be retrieved that were not present in the initial reranking pool, which can improve system recall.It also allows for the re-ranking of the documents that may have otherwise been skipped from the pool when the re-ranking budget is low.Finally, by including the feedback within the re-ranking itself (as opposed to post-scoring feedback mechanisms, such as PRF), our approach can find documents that are multiple hops away (i.e., neighbours of neighbours).Gar achieves low online overhead through offline computation of a corpus graph that stores the nearest neighbours of each document.
On experiments over the TREC Deep Learning datasets, we find that Gar significantly improves precision-and recall-oriented evaluation measures.Gar can improve virtually any re-ranking pipeline, with the results largely holding across a variety of initial retrieval functions (lexical, dense retrieval, document expansion, and learned lexical), scoring functions (cross-encoding, late interaction), document similarity metrics (lexical, semantic), and re-ranking budgets (high, low).Impressively, a Gar pipeline that uses only BM25 for both the initial retrieval and the document similarity is able to achieve comparable or improved performance in terms of reranked precision and recall over the competitive TCT-ColBERT-v2-HNP [19] and DocT5Query [30] models -both of which have far higher requirements in terms of offline computation and/or storage capacities.We find that the online overhead of Gar is low compared to a typical re-ranking, usually only adding around 2-4ms per 100 documents re-ranked.We also find that Gar is largely robust to its parameters, with major deviations in performance only occurring with extreme parameter values.Finally, we find that despite using document similarity, Gar does not significantly reduce the diversity among the relevant retrieved documents.
In summary, we propose a novel approach to embed a feedback loop within the neural re-ranking process to help identify un-retrieved relevant documents through application of the clustering hypothesis.Our contributions can therefore be summarised as follows: (1) We demonstrate a novel application of the clustering hypothesis in the context of neutral re-ranking; (2) We show that our proposed approach can successfully improve both the precision and the recall of re-ranking pipelines with minimal computational overhead; (3) We demonstrate that the approach is robust across pipeline components and the parameters it introduces.The remainder of the paper is organised as follows: We first provide additional background and related work, positioning Gar in context with past work in neural retrieval, relevance feedback, and the clustering hypothesis (Section 2); We then briefly demonstrate that the clustering hypothesis still holds on a recent dataset to motivate our approach (Section 3); We formally describe our method (Section 4) and present our experiments that demonstrate its effectiveness (Sections 5 & 6); We wrap up with final conclusions and future directions of this promising area (Section 7).

BACKGROUND AND RELATED WORK
The recent advancements in deep neural ranking models have brought significant improvements on the effectiveness of ad-hoc ranking tasks in IR system [18].In particular, pre-trained language models such as BERT [6] and T5 [33] are able to lean semantic representations of words depending on their context, and these representations are able to better model the relevance of a document w.r.t. a query, with notable improvements w.r.t.classical approaches.However, these improvements have an high computational costs; BERT-based rankers [22,28] are reported to be slower than classical rankers such as those based on BM25 by orders of magnitude [10,22].Therefore, it is still usually infeasible to directly use pre-trained language models to rank all documents in a corpus for each query (even using various to reduce the computational cost [12,20,21].)Deep neural ranking models are typically deployed as re-rankers in a pipeline architecture, where a first preliminary ranking stage is deployed before the more expensive neural re-ranker, in a cascading manner.During query processing, the first ranking stage retrieves from the whole document corpus a candidate pool of documents using a simple ranking function, with the goal of maximising the recall effectiveness.The following re-ranking stage processes the documents in the candidate pool, reordering them by focusing on high precision results at the top positions, whose documents will be returned to the user [31,36].In this setting, there is an efficiency-effectiveness tradeoff on the number of documents retrieved by the first ranker.From the efficiency perspective, a smaller number of documents in the candidate pool will allow the re-ranker to reduce the time spent on re-ranking the documents, since the execution time is proportional to the candidate set size.From the effectiveness perspective, the larger the candidate pool, the higher the number of potentially relevant documents to be retrieved from the document corpus.In fact, relevant documents can be retrieved from the corpus only during first-stage processing.The recall effectiveness of the candidate pool has been investigated in previous IR settings, in particular in learning-to-rank pipelines.Tonellotto et al. [35] studied how, given a time budget, dynamic pruning strategies [36] can be use in first-stage retrieval to improve the candidate pool size on a per-query basis.Macdonald et al. [23] studied the minimum effective size of the document pool, i.e., when to stop ranking in the first stage, and concluded that the smallest effective pool for a given query depends, among others, on the type of the information need and the document representation.In the context of neural IR, learned sparse retrieval focuses on learning new terms to be included in a document before indexing, and the impact scores to be stored in the inverted index, such that the resulting ranking function approximates the effectiveness of a full transformer-based ranker while retaining the efficiency of the fastest inverted-index based methods [5,7,26].In doing so, first-stage rankers based on learned impacts are able to improve the recall w.r.t.BM25, but the end-to-end recall is still limited by the first-stage ranker.
Pseudo-Relevance Feedback (PRF) involves the reformulation of a query based on the top results (e.g., by adding distinctive terms from the top documents).This query is then re-issued to the engine, producing a new ranked result list.Adaptive Re-Ranking also makes use of these top-scoring documents, but differs in two important ways.First, the query remains unmodified, and therefore, ranking scores from the model need not be re-computed.Second, the top scores are used in an intermediate stage of the scoring process; the process is guided by the highest-scoring documents known up until a given point, which may not reflect the overall top results.Finally, we note that the output of an adaptive re-ranking operation could be fed as input into a PRF operation to perform query reformulation.This work can be seen as a modern instantiation of the clustering hypothesis, which Jardine and van Rijsbergen [11] stated as "Closely associated documents tend to be relevant to the same requests".Many works have explored the clustering hypothesis for various tasks in information retrieval, such as for visualisation of the corpus (e.g., [17]), visualisation of search results (e.g., [4]), enriching document representations [16] and fusing rankings (e.g., [15]).Most related to our application is the usage of the clustering hypothesis for first-stage retrieval (i.e., document selection), in which the documents to rank are identified by finding the most suitable cluster for a query [13].However, these works focus on identifying the most suitable clusters for a given query and transforming the constituents into a ranking.Moreover, while our approach also takes a soft clustering approach [14] where each 'cluster' is represented by a document and its neighbours, instead of ranking clusters, we identify "good" clusters as when the representative document is scored highly by a strong neural scoring function.We also address the problem of transforming the documents into a ranking by letting the neural scoring function do that job as well.Overall, our novel approach is the first to embed a feedback loop within the re-ranking process to help identify un-retrieved relevant documents.

PRELIMINARY ANALYSIS
We first perform a preliminary check to see whether the clustering hypothesis appears to hold on a recent dataset and using a recent model.Namely, we want to check whether the passages from the MS MARCO corpus [2] are more likely to distributed closer to those with the same relevance labels than those with differing grades.We explore two techniques for measuring similarity: a lexical similarity score via BM25, and a semantic similarity via TCT-ColBERT-HNP [19].For the queries in the TREC DL 2019 dataset [3], we compute similarity scores between each pair of judged documents.Then, akin to the Voorhees' cluster hypothesis test [37], we calculate the distribution of the relevance labels of the nearest neighbouring passage by relevance label (i.e., we calculate   (ℎ ()) = | () =  for all pairs of relevance labels  and .) Table 1 presents the results of this analysis.We observe a clear trend: passages with a given relevance label are far more likely to be closer to the passages with the same label (among judged passages) than those with other labels (in the same row).This holds across both lexical (BM25) and semantic (TCT-ColBERT) similarity measures, and across all four relevance labels (ranging from nonrelevant to perfectly relevant).
⊲ Update frontier This analysis suggests that the clustering hypothesis holds on TREC DL 2019.Therefore, it follows that the neighbours of passages that a scoring function considers most relevant are a reasonable place to look for additional relevant passages to be scored -which is the core motivation of our proposed method.

GRAPH-BASED ADAPTIVE RE-RANKING
We now introduce the document re-ranking scenario, and we present a description of our proposed re-ranking algorithm.Let  0 denote an initial ranked pool of | 0 | documents produced by a first-stage ranker, and let  1 denote a subsequent re-ranked pool of | 1 | = | 0 | documents.A certain number of top ranked documents from the  1 pool will subsequently be returned to the user who issued the query.In re-ranking, we assume that the documents from  0 are processed in batches of  documents at maximum (the size of the last batch depends on the re-ranking budget).A scoring function Score() takes as input a batch of documents, e.g., the top scoring  documents in  0 and re-scores them according to a specific re-ranking stage implementation.The re-scored batch is added the final re-ranked pool  1 , and then removed from the initial ranked pool  0 .Note that by setting  = 1, we are re-ranking one document at a time, as in classical learning-to-rank scenarios; in contrast, when  > 1, we allow for more efficient re-ranking function implementations leveraging advanced hardware, such as GPUs and TPUs.
Since the time available for re-ranking is often small, and given that it is directly proportional to the number of documents reranked, the re-ranking process can be provided with a budget , denoting the maximum number of documents to be re-ranked given the proportional time constraint.If the budget does not allow to re-rank all the document in the initial ranked pool, the Backfill function returns the documents in  0 that have not been re-ranked, i.e., not in  1 , that are used to fill up the final re-ranked pool  1 to contain all the documents initially included in  0 .For example, if  0 contains 1000 documents and, due to the budget, only 100 documents can be re-scored, the 900 top ranked documents in  0 but not re-ranked in  1 are appended to  1 in the same order as in  0 , to obtain a re-ranked list of 1000 documents.The uncoloured lines in Alg. 1 illustrate this re-ranking algorithm, which corresponds to the common re-ranking adopted in a pipelined cascading architecture.
In our adaptive re-ranking algorithm, we leverage a corpus graph  = ( , ).This directed graph encodes the similarity between documents, and can be computed offline, using lexical or semantic similarity function between two documents.Every node in  represents a document in the corpus, and every pair of documents may be connected with an edge in , labelled with the documents' similarity.To address the graph's quadratic space (and time) complexity, we limit to a small value  the number of edges for each node in the corpus graph, i.e., || =  | |.The top  edges are selected according to their similarity scores, in decreasing order.
Our adaptive re-ranking algorithm, illustrated in Alg. 1, receives an initial ranking pool of documents  0 , a batch size , a budget , and the corpus graph  as input.We consider a dynamically updated re-ranking pool , initialised with the contents of  0 ( ←  0 ), and a dynamically updated graph frontier  , initially empty ( ← ∅).After the re-ranking of the top  documents selected from  and subject to the constraint  (called batch , where  = | |), we update the initial and re-ranked pools  0 and  1 .The documents in the batch are removed from the frontier  because there is no need to re-rank them again.Now we consider the documents in the batch , and we look up in the corpus graph for documents whose nodes are directly connected to the documents in .These documents (except any that have already been scored) are added to the frontier ( ∪ (Neighbours(, ) \ 1 )), prioritised by the computed ranking score of the source document.Note that the neighbours may occur later in the ranking list.Next, instead of using the current contents of the initial pool  0 for the next batch evaluation, we alternate between  0 and the current frontier  .In doing so, we ensure that  1 contains documents from  0 and newly identified documents not included in  0 .The algorithm proceeds alternating between these two options, populating the frontier at each step, until the budget allows, then backfills the final pool of initial candidates as before.
We note that alternating between the initial ranking and the frontier is somewhat naïve; perhaps it is better to score more/fewer documents from the frontier, or to dynamically decide whether to select batches from the frontier or the initial ranking based on recent scores.Indeed, we investigated such strategies in pilot experiments but were unable to identify a strategy that consistently performed better than the simple alternating technique.We therefore decided to leave the exploration of alternative techniques to future work.

EXPERIMENTAL SETUP
We experiment to answer the following research questions: RQ1 What is the impact of Gar on retrieval effectiveness compared to typical re-ranking?RQ2 What is the computational overhead introduced by Gar? (Section 6.2) RQ3 How sensitive is Gar to the parameters it introduces: the number of neighbours included in the corpus graph  and the batch size ? (Section 6.3) RQ4 What is the impact of Gar on retrieval effectiveness compared to state-of-the-art neural IR systems?
Finally, because Gar is based on scoring similar documents, we recognise that it has the potential to reduce the diversity of the retrieved passages (i.e., it could make the retrieved passages more homogeneous).Therefore, we ask: RQ5 Does Gar result in more homogeneous relevant passages than existing techniques?

Datasets and Evaluation
Our primary experiments are conducted using the TREC Deep Learning 2019 (DL19) and 2020 (DL20) test collections [3].DL19 is used throughout the development and for the analysis of Gar , and therefore acts as our validation set.DL20 is held out until the final evaluation, allowing us to confirm that our approach has not over-fit to DL19.Both datasets use the MS MARCO passage ranking corpus, which consists of 8.8M passages [2].DL19 consists of 43 queries and an average of 215 relevance assessments per query; DL20 has 54 queries with 211 assessments per query.We evaluate our approach using nDCG, MAP, and Recall at rank 1000.
For the binary measures (MAP and Recall), we use the standard practice of setting a minimum relevance score of 2, which counts answers that are highly or perfectly relevant.In our experiments we are concerned with both precision and recall, so we focus on nDCG without a rank cutoff, though we also report the official task measure of nDCG with a rank cutoff of 10 (nDCG@10) to provide meaningful comparisons with other works.We select DL19 and DL20 because they provide more complete relevance assessments than the MS MARCO development set; this is especially important given that Gar is designed to retrieve documents that were not necessarily in the initial re-ranking pool.For completeness, we also report performance on the small subset of MS MARCO dev, which consists of 6980 queries, each with 1.1 relevance assessments per query on average.For this dataset, we report the official measure of Mean Reciprocal Rank at 10 (MRR@10) and the commonly-reported value of Recall at 1000.

Retrieval and Scoring Models
To test the effect of Gar under a variety of initial ranking conditions, we conduct experiments using four retrieval functions as first stage rankers, each representing a different family of ranking approaches.
• BM25, a simple and long-standing lexical retrieval approach.We retrieve the top 1000 BM25 results from a PISA [27] index using default parameters.• TCT, a dense retrieval approach.We conduct exact (i.e., exhaustive) retrieval of the top 1000 results using a TCT-ColBERT-HNP model [19] trained on MS MARCO. 3 This is among the most effective dense retrieval models to date.• D2Q, a document expansion approach.We retrieve the top 1000 BM25 results from a PISA index of documents expanded using a docT5query model [30] trained on MS MARCO.We use the expanded documents released by the authors.This is the most effective document expansion model we are aware of to date.• SPLADE, a learned sparse lexical retrieval model.We retrieve the top 1000 results for a SPLADE++ model [7] trained on MS MARCO (CoCondenser-EnsembleDistil version).We use code released by the authors for indexing and retrieval. 4This is the most effective learned lexical retrieval model we are aware of to date.
Similarly, we experiment with the following neural re-ranking models to test the effect of the scoring function on Gar.
• MonoT5, a sequence-to-sequence scoring function.We test two versions of the MonoT5 model [29] trained on MS MARCO from two base language models: MonoT5-base, and MonoT5-3b.The 3b model has the same structure as the base model, but has more parameters (13× more; 2.9B, compared to base's 223M) so it is consequently more expensive to run.These models are among the most effective scoring functions reported to date. 5 ColBERT (scorer only), a late interaction scoring function.Although ColBERT [12] can be used in an end-to-end fashion (i.e., using its embeddings to perform dense retrieval), we use it as a scoring function over the aforementioned retrieval functions.The model represents two paradigms: one where representations are pre-computed to reduce the query latency, and another where the representations are computed on-the-fly.
We use the implementations of the above methods provided by PyTerrier [24].Following PyTerrier notation, we use » to denote a re-ranking pipeline.For instance, "BM25»MonoT5-base" retrieves using BM25 and re-ranks using MonoT5-base.

Corpus Graphs
In our experiments, we construct and exploit two corpus graphs, namely a lexical similarity graph and a semantic similarity graph.
The lexical graph (denoted as Gar BM25 ) is constructed by retrieving the top BM25 [34] results using the text of the passage as the query.
We use PISA to perform top  + 1 lexical retrieval (discarding the passage itself).Using a 4.0 GHz 24-core AMD Ryzen Threadripper Processor, the MS MARCO passage graph takes around 8 hours to construct.The semantic similarity graph (denoted as Gar TCT ) is constructed using the TCT-ColBERT-HNP model.We perform an exact (i.e., exhaustive) search over an index to retrieve the top  + 1 most similar embeddings to each passage (discarding the passage itself).
Using an NVIDIA GeForce RTX 3090 GPU to compute similarities, the MS MARCO passage graph takes around 3 hours to construct.We construct both graphs using  = 8 neighbours, and explore the robustness to various values of  in Section 6.3.Because the number of edges (i.e., neighbours) per node (i.e., passage) is known, the graphs are both stored as a uncompressed sequence of docids.Using unsigned 32-bit integer docids, only 32 bytes per passage are needed, which amounts to 283 MB to store an MS MARCO graph. 6e note that there are likely approaches that reduce the computational overhead in graph construction by making use of approximate searches; we leave this for future work.The two graphs differ substantially in their content. 7We release these graphs through our implementation to aid other researchers and enable future works.

Other Parameters and Settings
We use a Gar batch size of  = 16 by default, matching a typical batch size for a neural cross-encoder model.We explore the robustness of Gar to various values of  in Section 6.3.We explore two budgets:  = 100 (a reasonable budget for a deployed re-ranking system, e.g., [9]) and  = 1000 (the de facto default threshold commonly used in shared tasks like TREC).

RESULTS AND ANALYSIS
We now present the results of our experiments and conduct associated analysis to answer our research questions.

Effectiveness
To understand whether Gar is generally effective, it is necessary to test the effect it has on a variety of retrieval pipelines.Therefore, we construct re-ranking pipelines based on every pair of our initial ranking functions (BM25, TCT, D2Q, and SPLADE) and scoring functions (MonoT5-base, MonoT5-3b, and ColBERT).These 12 pipelines collectively cover a variety of paradigms.Table 2 presents the results of Gar on these pipelines for TREC DL 2019 and 2020 using both the lexical BM25-based graph and the semantic TCT-based corpus graph.We report results using both re-ranking budgets  = 100 and  = 1000.
Each box in Table 2 allows the reader to inspect the effect on retrieval effectiveness that Gar has on a particular re-ranking pipeline and re-ranking budget.In general, we see that the greatest improvement when the initial retrieval pool is poor.In particular, BM25 only provides a R@1k of 0.755 and 0.805 on DL19 and DL20, respectively, while improved retrieval functions offer up to 0.872 and 0.899, respectively (SPLADE).Gar enables the pipelines to find additional relevant documents.Using BM25 as the initial pool, our approach reaches a R@1k up to 0.846 and 0.892, respectively (BM25»MonoT5-3b w/ Gar TCT and  = 1000).Perhaps unsurprisingly, this result is achieved using both a corpus graph (Gar TCT ) that differs substantially from the technique used for initial retrieval (BM25) and using the most effective re-ranking function (MonoT5-3b).However, we also note surprisingly high recall in this setting when using the Gar BM25 corpus graph: up to 0.831 (DL19) and 0.881 (DL20).These results are on par with the recall achieved by TCT and D2Q -an impressive feat considering that this pipeline only uses lexical signals and a single neural model trained with a conventional process. 8he pipelines that use a BM25 initial ranker also benefit greatly in terms of nDCG, which is likely due in part to the improved recall.
Significant improvements are also observed in all other pipelines, particularly in terms of nDCG when there is a low re-ranking budget available ( = 100) and in recall when a high budget is available ( = 1000).In general, the corpus graph that is least similar to the initial ranker is most effective (e.g., the BM25 graph when using a TCT ranking).However, we note that both corpus graphs improve every pipeline, at least in some settings.For instance, the Gar TCT corpus graph consistently improves the nDCG of pipelines that use TCT as an initial ranker, but rarely the recall.
We also note that Gar can nearly always improve the precision of the top results, as measured by nDCG, in settings with a limited re-ranking budget ( = 100), even when R@1k remains unchanged.This is likely due to the fact that Gar is able to pick out documents from lower depths of the initial ranking pool to score within the limited available budget.For instance, in the case of the strong SPLADE»MonoT5-base pipeline with  = 100, which offers high recall to begin with (0.872 on DL19 and 0.899 on DL20), Gar BM25 improves the nDCG from 0.750 to 0.762 (DL19) and from 0.748 to 0.757 (DL20), while leaving the R@1k unchanged.
In a few rare cases, we observe that Gar can yield a lower mean performance than the baseline (e.g., MAP for the D2Q»MonoT5base pipeline with  = 1000).However, these differences are never Table 2: Effectiveness of Gar on TREC DL 2019 and 2020 in a variety of re-ranking pipelines and re-ranking budgets ().The top result for each pipeline is in bold.Significant differences with the baseline (typical re-ranking) are marked with *, while insignificant differences are in grey (paired t-test,  < 0.05, using Bonferroni correction).statistically significant and are usually accompanied by significant improvements to other measures (e.g., the R@1k improves).We note that the same trends appear for both our validation set (DL19) and our held-out test set (DL20), suggesting that Gar is not over-fitted to the data that we used during the development of Gar.

DL19 (valid
Finally, we test Gar on the MS MARCO dev (small) set.This setting differs from the TREC DL experiments in that each of the queries has only a few (usually just one) passages that are labeled as relevant, but has far more queries (6,980 compared to 43 in DL19 and 54 in DL20).Thus, experiments on this dataset test a pipeline's capacity to retrieve a single (and somewhat arbitrary) relevant passage for a query. 9Due to the cost of running multiple versions of highly-expensive re-ranking pipelines, we limit this study to a low re-ranking budget  = 100 and to the two less expensive scoring functions (MonoT5-base and ColBERT).Table 3 presents the results.We find that Gar offers the most benefit in pipelines that suffer from the lower recall -namely, the BM25-based pipelines.In this setting, the improved R@1k also boosts the RR@10.In the TCT, D2Q, and SPLADE pipelines, R@1k often significantly improved, but this results in non-significant (or marginal) changes to RR@10.
To answer RQ1, we find that Gar provides significant benefits in terms of precision-and recall-oriented measures.The results hold across a variety of initial retrieval functions, re-ranking functions, and re-ranking budgets.The most benefit is apparent when the initial pool has low recall, though we note that Gar also improves over systems with high initial recall -particularly by enabling Table 3: Effectiveness of Gar on the MS MARCO dev (small) set with a re-ranking budget of  = 100.The top result for each pipeline is in bold.Significant differences with the baseline (typical re-ranking) are marked with * (paired t-test,  < 0.05, using Bonferroni correction).

»MonoT5-base »ColBERT
Pipeline RR@10 R@1k RR@10 R@1k higher precision at a lower re-ranking budget.Overall, we find that Gar is safe to apply to any re-ranking pipeline (i.e., it will not harm the effectiveness), and it will often improve performance (particularly when the re-ranking budget is limited or when a lowcost first stage retriever is used).
To illustrate the ability of Gar to promote low-ranked documents under limited ranking budgets, Figure 2 plots the initial rank (x-axis) of documents and their final rank (y-axis), for a particular query.Each point represents a retrieved document, with colour/size indicative of the relevance label.Lines between points indicate links followed in the corpus graph.It can be seen that by leveraging the corpus graph, Gar is able to promote highly relevant documents that were lowly scored in the initial ranking, as well as retrieve 'new' relevant documents, which are not retrieved in the initial BM25 pool.For instance, GAR is able to select five rel=2 documents from around initial rank 250-300, and ultimately score them within the top 40 documents.Meanwhile, it retrieves two rel=2 and one rel=3 documents that were not found in the first stage.

Computational Overhead
Gar is designed to have a minimal impact on query latency.By relying on a pre-computed corpus graph that will often be small enough to fit into memory (283MB with  = 8 for MS MARCO), neighbour lookups are performed in  (1) time.With the frontier  stored in a heap, insertions take only  (1), meaning that finding neighbours and updating the frontier adds only a constant time for each scored document.Sampling the top  items from the heap takes  ( log ), since the number of items in the heap never needs to exceed the budget .
To obtain a practical sense of the computational overhead of Gar, we conduct latency tests.To isolate the effect of Gar itself, we find it necessary to factor out the overhead from the re-ranking model itself, since the variance in latency between neural scoring runs often exceeds the overhead introduced by Gar.To this end, we pre-compute and store all the needed query-document scores and simply look them up as they would be scored.We then test various re-ranking budgets () for DL19, and take 10 latency measurements of the typical re-ranking and Gar processes.Table 4 reports the differences between the latency of Gar and the typical re-ranking results, isolating the overhead of Gar itself.We find that Gar introduces less than 37.37ms overhead per 1000 documents scores (i.e., 2.68-3.73msoverhead per 100 documents scored), on average, using 16 documents per batch.We report results using the semantic TCTbased corpus graph, though we find little difference when using the lexical BM25-based corpus graph.The overhead can be further reduced (down to 3.1ms per 100 documents) by using a larger batch size, i.e., 64 documents per batch; we explore the effect of the batch size parameter on effectiveness in Section 6.3.When compared to the cost of monoT5 scoring (rightmost column in Table 4), the Gar process adds negligible overhead, typically amounting to less than a 2% increase in latency and falls within the variance of the scoring function's latency for low re-ranking budgets.
This experiment answers RQ2: the online computational overhead of Gar is minimal.It can be efficiently implemented using a heap, and adds only around 3-4ms per 100 documents in the reranking budget.This overhead is negligible when compared with the latency of a leading neural scoring function, though it will represent a higher proportion for more efficient scoring functions.

Robustness to Parameters
Recall that Gar introduces two new parameters: the number of nearest neighbours in the corpus graph  and the batch size .In this section, we conduct experiments to test whether Gar is robust to the settings of these parameters. 10We separately sweep  ∈ [1,16] and  ∈ [1,512] (by powers of 2) over DL19 with  = 1000 for all Gar pipelines, and present the different effectiveness metrics in Figure 3.
With regard to the number of graph neighbours , the nDCG, MAP and recall metrics are relatively stable from around  = 6 to  = 16 for almost all pipelines.The MAP performance appears to be the least stable in this range, with some fluctuations in performance between  = 7 and  = 13.Recall appears to be most affected, with sharp gains for some pipelines between  = 1 to  = 4.This trend is present also for nDCG.
The batch size  is remarkably stable from  = 1 to  = 128, with only a blip in effectiveness for the BM25 graph at  = 16.The most prominent shift in performance occurs at large batch sizes, e.g.,  = 512.We note that, when  = 512, the corpus graph can only be traversed for a single hop -the neighbours of the top-scoring documents from the frontier batch are not able to be fed back into the re-ranking pool.This validates our technique of incorporating the feedback mechanism into the re-ranking process itself, which gives the model more chances to traverse the graph.While it may be tempting to prefer the stability of the system with very low batch sizes, we note that this has an effect on the performance: as seen in Section 6.2, lower batch sizes reduces the speed of Gar itself.Further, and more importantly,  imposes a maximum batch size of the scoring function itself; given that neural models benefit considerably in terms of performance with larger batch sizes (since the operations on the GPU are parallelised), larger values of  (e.g.,  = 16 to  = 128) should be preferred for practical reasons.
To answer RQ3, we find that the performance of Gar is stable across various pipelines when the number of neighbours is sufficiently large ( ≥ 6) and the batch size is sufficiently low ( ≤ 128).

Baseline Performance
Section 6.1 established the effectiveness of Gar as ablations over a variety of re-ranking pipelines.We now explore how the approach fits into the broader context of the approaches proposed for passage retrieval and ranking.We explore two classes of pipelines: 'Kitchen Sink' approaches that combine numerous approaches and models together, and 'Single-Model' approaches that use only involve a single neural model at any stage.We select representative Gar variants based on the nDCG@10 performance on DL19 (i.e., as a validation set), with DL20 again treated as the held-out test set.All systems use a re-ranking budget of  = 1000.In this table, we report nDCG@10 to allow comparisons against prior work.We also report the judgment rate at 10 to provide context about how missing information in the judgments may affect the nDCG@10 scores.
The Kitchen Sink results are reported in the top section of Table 5.All systems involve three ranking components: an initial   retriever  0 , a Mono-scorer  1 (which assigns a relevance score to each document), and a Duo-scorer  2 (which scores and aggregates pairs of documents).The Duo-style models are known to improve the ranking of the top documents [32].Although we leave the exploration of how Gar can be used to augment the Duo process directly for future work, we still want to check what effect Gar has on these pipelines.We ablate two Duo systems (either based on D2Q or SPLADE) using Gar for the first-stage re-ranker and a DuoT5-3b-based second-stage re-ranker (second stage uses the suggested cutoff of 50 from [32]).We observe that there is no significant difference in terms of precision of the top 10 results.However, Gar can still provide a significant improvement in terms of nDCG later in the ranking and in terms of recall.These results suggest that although Gar identifies more relevant documents, the Duo models are not capable of promoting them to the top ranks.
We next explore Single-Model systems, which are shown in the bottom section of Table 5.Having only a single models likely has some practical advantages: pipelines that use a single model tend to be simpler, and practitioners only need to train a single model.Here, we compare with a variety of systems that fall into this category, most notably the recently-proposed ColBERT-PRF approaches that operate over dense indexes [39].A Gar BM25 pipeline that operates over BM25 results also falls into this category, since only a single neural model (the scorer) is needed.Among this group, Gar performs competitively, outmatched only by ColBERT-PRF [39] and the recent SPLADE [7] model (though the differences in performance are not statistically significant).Compared to these methods, though, Gar requires far less storage -the corpus graph for Gar is only around 283MB, while the index for SPLADE is 8GB, and the vectors required for ColBERT-PRF are 160GB.
To answer RQ4: We observe that Gar can be incorporated into a variety of larger, state-of-the-art re-ranking pipelines.It frequently boosts the recall of systems that it is applied to, though the scoring functions we explore tend to have difficulty in making use of the additional relevant passages.This motivates exploring further improvements to re-ranking models.For instance, cross-encoder Table 5: Performance of Gar , compared to a variety of other baselines.Significant differences are computed within groups, with significance denoted as superscript letters − (paired t-test,  < 0.05, Bonferroni correction).Rows marked with † are given to provide additional context, but the metrics were copied from other papers so do not include statistical tests.

Figure 1 :
Figure 1: Overview of Gar.Traditional re-ranking exclusively scores results seeded by the retriever.Gar (in green) adapts the re-ranking pool after each batch based on the computed scores and a pre-computed graph of the corpus.

Figure 2 :
Figure 2: Plot of the initial and final rankings of BM25»MonoT5-base using Gar TCT with  = 100 for the DL19 query 'how long is life cycle of flea'.The colour/size of dots indicate the relevance label.Lines between points indicate links followed in the corpus graph.

Figure 3 :
Figure 3: Performance of Gar when the number of neighbours in the corpus graph  and the batch size  vary.Each line represents a system from Table2.The dashed blue (solid green) lines are for the BM25 (TCT) graph.

Table 1 :
Distribution of nearest neighbouring passages, among pairs of judged passages in TREC DL 2019, based on BM25 and TCT-ColBERT-HNP similarity scores.Each cell represents the percentage that a passage with a given relevance label () has a nearest neighbour with the column's relevance label (); each row sums to 100%.
Algorithm 1 Graph-based Adaptive Re-Ranking Input: Initial ranking  0 , batch size , budget , corpus graph  Output: Re