ColBERT-PRF: Semantic Pseudo-Relevance Feedback for Dense Passage and Document Retrieval

Pseudo-relevance feedback mechanisms, from Rocchio to the relevance models, have shown the usefulness of expanding and reweighting the users’ initial queries using information occurring in an initial set of retrieved documents, known as the pseudo-relevant set. Recently, dense retrieval – through the use of neural contextual language models such as BERT for analysing the documents’ and queries’ contents and computing their relevance scores – has shown a promising performance on several information retrieval tasks still relying on the traditional inverted index for identifying documents relevant to a query. Two different dense retrieval families have emerged: the use of single embedded representations for each passage and query, e.g., using BERT’s [CLS] token, or via multiple representations, e.g., using an embedding for each token of the query and document (exemplified by ColBERT). In this work, we conduct the first study into the potential for multiple representation dense retrieval to be enhanced using pseudo-relevance feedback and present our proposed approach ColBERT-PRF. In particular, based on the pseudo-relevant set of documents identified using a first-pass dense retrieval, ColBERT-PRF extracts the representative feedback embeddings from the document embeddings of the pseudo-relevant set. Among the representative feedback embeddings, the embeddings that most highly discriminate among documents are employed as the expansion embeddings, which are then added to the original query representation. We show that these additional expansion embeddings both enhance the effectiveness of a reranking of the initial query results as well as an additional dense retrieval operation. Indeed, experiments on the MSMARCO passage ranking dataset show that MAP can be improved by up to 26% on the TREC 2019 query set and 10% on the TREC 2020 query set by the application of our proposed ColBERT-PRF method on a ColBERT dense retrieval approach.We further validate the effectiveness of our proposed pseudo-relevance feedback technique for a dense retrieval model on MSMARCO document ranking and TREC Robust04 document ranking tasks. For instance, ColBERT-PRF exhibits up to 21% and 14% improvement in MAP over the ColBERT E2E model on the MSMARCO document ranking TREC 2019 and TREC 2020 query sets, respectively. Additionally, we study the effectiveness of variants of the ColBERT-PRF model with different weighting methods. Finally, we show that ColBERT-PRF can be made more efficient, attaining up to 4.54× speedup over the default ColBERT-PRF model, and with little impact on effectiveness, through the application of approximate scoring and different clustering methods.


INTRODUCTION
When searching for information, users often formulate queries in a different way to the relevant documents. For instance, a user may search for information about "surname meaning" using a query "where do last names come from". However, a relevant document may describe the "last name" using "family name" or "surname" and may use terms such as "originate" or "history" instead of "come from". Thus, a relevant document and the user query might form a lexical mismatch gap during retrieval, which must be bridged for effective retrieval.
Query expansion approaches, which rewrite the user's query, have been shown to be an effective approach to alleviate the vocabulary discrepancies between the user query and the relevant documents, by modifying the user's original query to improve the retrieval effectiveness. Many approaches follow the pseudo-relevance feedback (PRF) paradigm -such as Rocchio's algorithm [37], the RM3 relevance language model [1], or the DFR query expansion models [4] -where terms appearing in the top-ranked documents for the initial query are used to expand it. Query expansion (QE) approaches have also found a useful role when integrated with effective BERTbased neural reranking models, by providing a high quality set of candidate documents obtained using the expanded query, which can then be reranked [35,42,47].
On the other hand, many studies have focused on the use of static word embeddings, such as Word2Vec, within query expansion methods [12,19,39,40]. Indeed, most of the existing embedding-based QE methods [12,19,39,40,49] are based on static embeddings, where a word embedding is always the same within different sentences, and hence they do not address contextualised language models such as BERT. Recently, CEQE [29] was proposed, which makes use of contextualised BERT embeddings for query expansion. The resulting refined query representation is then used for a further round of retrieval using a traditional (sparse) inverted index. In contrast, in this paper, we focus on implementing contextualised embedding-based query expansion for dense retrieval.
Indeed, the BERT models have demonstrated further promise in being a suitable basis for dense retrieval. In particular, instead of using a classical inverted index, in dense retrieval, the documents and queries are represented using embeddings. Then, the documents can be retrieved using an approximate nearest neighbour algorithm -as exemplified by the FAISS toolkit [15]. Two distinct families of approaches have emerged: single representation dense retrieval and multiple representation dense retrieval. In single representation dense retrieval, as used by DPR [16] and ANCE [46], each query or document is represented entirely by the single embedding of the [CLS] (classification) token computed by BERT. Query-document relevance is estimated in terms of the similarity of the corresponding [CLS] embeddings. In contrast, in multiple representation dense retrievalas proposed by ColBERT [17] -each term of the queries and documents is represented by a single Bo1 [4], KL [2], and RM3 relevance models [1] have demonstrated their effectiveness on many test collections. Typically, these models identify and weight feedback terms that are frequent in the feedback documents and infrequent in the corpus, by exploiting statistical information about the occurrence of terms in the documents and in the whole collection. In all cases, the reformulated query is then re-executed on the traditional (so-called sparse) inverted index.
Recently, deep learning solutions based on transformer networks have been used to enrich the statistical information about terms by rewriting or expanding the collection of documents. For instance, DeepCT [10] reweights terms occurring in the documents according to a fine-tuned BERT model to highlight important terms. This results in augmented document representations, which can be indexed using a traditional inverted indexer. Similarly, doc2query [33] and its more modern variant docT5query [32] apply text-to-text translation models to each document in the collection to suggest queries that may be relevant to the document. When the suggested queries are indexed along with the original document, the retrieval effectiveness is enhanced.
More recently, instead of leveraging (augmented) statistical information such as the indocument and collection frequency of terms to model a query or a document, dense representations, also known as embeddings, are becoming commonplace. Embeddings encode terms in queries and documents by learning a vector representation for each term, which takes into account the word semantic and context. Instead of identifying the related terms in the pseudorelevance feedback documents using statistical methods, embedding-based query expansion methods [12,19,39,40,49] expand a query with terms that are closest to the query terms in the word embedding space. However, the expansion terms may not be sufficiently informative to distinguish relevant documents from non-relevant documents -for instance, the embedding of "grows" may be closest to "grow" in the embedding space, but adding "grows" to the query may not help to identify more relevant documents. Moreover, all these embedding-based method are based on non-contextualised embeddings, where a word embedding is always the same within different sentences, and hence they do not address contextualised language models. Pre-trained contextualised language models such as BERT [11] have brought large effectiveness improvements over prior art in information retrieval tasks. In particular, deep learning is able to successfully exploit general language features in order to capture the contextual semantic signals allowing to better estimate the relevance of documents w.r.t. a given query.
Query expansion approaches have been used for generating a high quality pool of candidate documents to be reranked by effective BERT-based neural reranking models [35,42,47]. However, the use of BERT models directly within the pseudo-relevance feedback mechanism has seen comparatively little use in the literature. The current approaches leveraging the BERT contextualised embeddings for PRF are Neural PRF [20], BERT-QE [51], and CEQE [29].
In particular, Neural PRF uses neural ranking models, such as DRMM [14] and KNRM [45], to score the similarity of a document to a top-ranked feedback document. BERT-QE is conceptually similar to Neural PRF, but it measures the similarity of each document w.r.t. feedback chunks that are extracted from the top-ranked feedback documents. This results in an expensive application of many BERT computations -approximately 11× as many GPU operations than a simple BERT reranker [51]. Both Neural PRF and BERT-QE approaches leverage contextualised language models to rerank an initial ranking of documents retrieved by a preliminary sparse retrieval system. However, they cannot identify any new relevant documents from the collection that were not retrieved in the initial ranking.
Meanwhile, Rocchio's relevance feedback algorithm has also been implemented for a learned sparse index by SNRM [50]. However, this model relies on a sparse index representation, which loses the advantages of dense retrieval. CEQE exploits BERT to compute contextualised representations for the query as well as for the terms in the top-ranked feedback documents, and then selects as expansion terms those which are the closest to the query embeddings according to some similarity measure. In contrast to Neural PRF and BERT-QE, CEQE is used to generate a new query of terms for execution upon a traditional (sparse) inverted index. This means that the contextual meaning of an expansion term is lost -for instance, a polysemous word added to the query can result in a topic drift.
In contrast to the aforementioned approaches, our proposed ColBERT-PRF approach can be exploited in a dense retrieval system, both in end-to-end ranking and reranking scenarios. Dense retrieval approaches, exemplified by ANCE [46] and ColBERT [17], are of increasing interest, due to their use of the BERT embedding(s) for representing queries and documents. By using directly the BERT embeddings for retrieval, topic drifts for polysemous words can be avoided. Concurrently to our work, ANCE-PRF [22,48] has been proposed to improve the effectiveness for a single representation ANCE model by retraining the query encoder using pseudo-relevance feedback information. In contrast, our work doesn't require any further training. To the best of our knowledge, ColBERT-PRF is the first work investigating PRF for a multiple representation dense retrieval setting.

MULTI REPRESENTATION DENSE RETRIEVAL
The queries and documents are represented by tokens from a vocabulary V . Each token occurrence has a contextualised real-valued vector with dimension d, called an embedding. More formally, let f : V n → R n×d be a function mapping a sequence of terms {t 1 , . . . , t n }, representing a query q, composed by |q| tokens into a set of embeddings {ϕ q 1 , . . . , ϕ q |q | } and a document composed by |d | tokens into a set of embeddings {ϕ d 1 , . . . , ϕ d |d | }.
Khattab & Zaharia [17] recommended that the number of query embeddings be 32, with extra [MASK] tokens being used as query augmentation. Indeed, these mask tokens are a differentiable mechanism that allows documents to gain score contributions from embeddings that do not actually occur in the query, but which the model assumes could be present in the query. In practice, as we later show in Section 4.4, the [MASK] embeddings are very similar to embeddings of the existing query tokens, and hence cannot be considered as a form of query expansion. Moreover, they do not make use of pseudo-relevance feedback information obtained from the top-ranked documents of the original query, which has repeatedly been shown to be an effective source to improve query representations.
The similarity of two embeddings is computed by the dot product. Hence, for a query q and a document d, their similarity score s(q, d) is obtained by summing the maximum similarity between the query token embeddings and the document token embeddings [17]: Indeed, Formal et al. [13] showed that the dot product ϕ T q i ϕ d j used by ColBERT implicitly encapsulates token importance, by giving higher scores to tokens that have higher IDF values.
To obtain a first set of candidate documents, Khattab & Zaharia [17] make use of FAISS, an approximate nearest neighbour search library, on the pre-computed document embeddings. Conceptually, FAISS allows to retrieve the k documents containing the nearest neighbour document embeddings to a query embedding ϕ q i , i.e., it provides a function F d (ϕ q i , k ) − → (d, . . . ) that returns a list of k documents, sorted in decreasing approximate scores.
However, these approximate scores are insufficient for accurately depicting the similarity scores of the documents, hence the accurate final document scores are computed using Equation (1) in a second pass. Typically, for each query embedding, the nearest k = 1, 000 documents are identified. The set formed by the union of these documents are reranked 1 using Equation (1). A separate index data structure (typically in memory) is used to store the uncompressed embeddings for each document. To the best of our knowledge, ColBERT [17] exemplifies the implementation of an endto-end IR system that uses multiple representation. Algorithm 1 summarises the ColBERT retrieval algorithm for the end-to-end dense retrieval approach proposed by Khattab & Zaharia, while the top part of Table 1 summarises the notation for the main components of the algorithm. The easy access to the document embeddings used by ColBERT provides an excellent basis for our dense retrieval pseudo-relevance feedback approach. Indeed, while the use of embeddings in ColBERT addresses the vocabulary mismatch problem, we argue that identifying more related embeddings from the top-ranked documents may help to further refine the document ranking. In particular, as we will show, this permits representative embeddings from a set of pseudo-relevance documents to be used to refine the query representation ϕ.

DENSE PSEUDO-RELEVANCE FEEDBACK
The aim of a pseudo-relevance feedback approach is typically to generate a refined query representation by analysing the text of the feedback documents. In our proposed ColBERT-PRF approach, we are inspired by conventional PRF approaches such as Bo1 [4] and RM3 [1], which assume that good expansion terms will occur frequently in the feedback set (and hence are somehow representative of the information need underlying the query), but infrequent in the collection as a whole (therefore are sufficiently discriminative). Therefore, we aim to encapsulate these intuitions while operating in the contextualised embedding space R d , where the exact counting of frequencies is not actually possible. In particular, by operating entirely in the embedding space rather than directly on tokens, we conjecture that we can identify similar embeddings (corresponding to tokens with similar contexts), which can be added to the query representation for improved effectiveness. 2 The bottom part of Table 1 summarises the main notations that we use in describing ColBERT-PRF.
In this section, we detail how we identify representative (centroid) embeddings from the feedback documents (Section 4.1), how we ensure that those centroid embeddings are sufficiently discriminative (Section 4.2), and how we apply these discriminative representative centroid embeddings for (re)ranking (Section 4.3). We conclude with an illustrative example (Section 4.4) and a discussion of the novelty of ColBERT-PRF (Section 4.5).

Representative Embeddings in Feedback Documents
First, we need to identify representative embeddings {υ 1 , . . . ,υ K } among all embeddings in the feedback documents set. A typical "sparse" PRF approach -such as RM3 -would count the frequency of terms occurring in the feedback set to identify representative ones. However, in a dense embedded setting, the document embeddings are not countable. Instead, we resort to clustering to identify patterns in the embedding space that are representative of embeddings.
Specifically, let Φ(q, f b ) be the set of all document embeddings from the f b top-ranked feedback documents. Then, we apply a clustering approach, e.g., the KMeans clustering algorithm, to Φ(q, f b ): ( By applying the clustering algorithm, we obtain K representative centroid embeddings of the feedback documents. The embeddings forming each cluster may or may not correspond to the exact same tokens spread across the feedback documents. In this way, a cluster can represent one or more tokens that appear in similar contexts, rather than a particular exact token. This is a key advantage of ColBERT-PRF.

Identifying Discriminative Embeddings among Representative Embeddings
Many of the K representative embeddings may represent stopwords and therefore are not sufficiently informative when retrieving documents. Typically, identifying informative and discriminative expansion terms from feedback documents would involve examining the collection frequency or the document frequency of the constituent terms [6,38]. However, there may not be a one-toone relationship between query/centroid embeddings and actual tokens, hence we seek to map each centroid υ i to a possible token t. We resort to FAISS to achieve this, through the function F t (υ i , r ) − → (t, . . .) that, given the centroid embedding υ i and r , returns the list of the r token ids corresponding to the r closest document embeddings to the centroid. 3 From a probabilistic viewpoint, the likelihood P(t |υ i ) of a token t given an embedding υ i can be obtained as: where 1[] is the indicator function. For simplicity, we choose the most likely token id, i.e., t i = arg max t P(t |υ i ). Mapping back to a token id allows us to make use of Inverse Document Frequency (IDF), which can be prerecorded for each token id. The importance σ i of a centroid embedding υ i is obtained using a traditional IDF formula: 4 where N i is the number of passages containing the token t i and N is the total number of passages in the collection. While this approximation of embedding informativeness is obtained by mapping back to tokens, as we shall show, it is very effective. In addition, we will discuss different derivations of a tailored informativeness measure in Section 7, including Inverse Collection Term Frequency and Mean Cosine Similarity methods. Finally, we select the f e most informative centroids as expansion embeddings based on the σ i importance scores as follows: where TopScoring(A, c) returns the c elements of A with the highest importance score.

Ranking and Reranking with ColBERT-PRF
Given the original |q| query embeddings and the f e expansion embeddings, we incorporate the score contributions of the expansion embeddings in Equation (1) as follows: where β > 0 is a parameter weighting the contribution of the expansion embeddings, and the score produced by each expansion embedding is further weighted by the IDF weight of its most likely token, σ i . Note that Equation (5) can be applied to rerank the documents obtained from the initial query, or as part of a full re-execution of the full dense retrieval operation including the additional f e expansion embeddings. In both ranking and reranking, ColBERT-PRF has four parameters: f b , the number of feedback documents; K, the number of clusters; f e ≤ K, the number of expansion embeddings; and β, the importance of the expansion embeddings during scoring. Figure 1 presents the five stages of ColBERT-PRF in its ranking configuration. Furthermore, we provide the pseudo-code of our proposed ColBERT PRF ReRanker in Algorithm 2. The ColBERT-PRF Ranker can be easily obtained by inserting lines 3-4 of Algorithm 1 at line 10 of Algorithm 2 to perform retrieval using both the original query embeddings and the expansion embeddings, and similarly adapting the max-sim scoring in Equation (1) to encapsulate the original query embeddings as well as the expansion embeddings.

Illustrative Example
We now illustrate the effect of ColBERT-PRF upon one query from the TREC 2019 Deep Learning track, 'do goldfish grow'. We use PCA to quantize the 128-dimension embeddings into two dimensions purely to allow visualisation. Firstly, Figure 2  query terms (##fish, gold, grow). Meanwhile, document embeddings extracted from 10 feedback documents are shown as light blue ellipses in Figure 2(a). There appear to be visible clusters of document embeddings near the query embeddings, but also other document embeddings exhibit some clustering. The mass of embeddings near the origin is not distinguishable in PCA. Figure 2(b) demonstrates the application of KMeans clustering upon the document embeddings; we map back to the original tokens by virtue of Equation (3). In Figure 2(b), the point size is indicative of the IDF of the corresponding token. We can see that the cluster centroids with high IDF correspond to the original query tokens ('gold', '##fish', 'grow'), as well as the related terms ('tank', 'size'). In contrast, a centroid with low IDF is 'the'. This illustrates the utility of our proposed ColBERT-PRF approach in using KMeans to identify representative clusters of embeddings, as well as using IDF to differentiate useful clusters. Furthermore, Figure 2(b) also includes, marked by an × and denoted 'tank (war)', the embedding for the word 'tank' when placed in the passage "While the soldiers advanced, the tank bombarded the troops with artillery". It can be seen that, even in the highly compressed PCA space, the 'tank' centroid embedding is distinct from the embedding of 'tank (war)'. This shows the utility of ColBERT-PRF when operating in the embedding space, as the PRF process for the query 'do goldfish grow' will not retrieve documents containing 'tank (war)', but will focus on a fish-related context, thereby dealing with the polysemous nature of a word such as 'tank'. To the best of our knowledge, this is a unique feature of ColBERT-PRF among PRF approaches.

Discussion
To the best of our knowledge ColBERT-PRF is the first investigation of pseudo-relevance feedback for multiple representation dense retrieval. Existing works on neural pseudo-relevance feedback, such as Neural PRF [20] and BERT-QE [51] only function as rerankers. Other approaches such as DeepCT [10] and doc2query [32,33] use neural models to augment documents before indexing using a traditional inverted index. CEQE [29] generates words to expand the initial query, which is then executed on the inverted index. However, returning the BERT embeddings back to textual word forms can result in polysemous words negatively affecting retrieval. In contrast, ColBERT-PRF operates entirely on an existing dense index representation (without augmenting documents), and can function for both ranking as well as reranking. By retrieving using feedback embeddings directly, ColBERT-PRF addresses polysemous words (such as 'tank', illustrated above). It is also of note that it also requires no additional neural network training beyond that of ColBERT. Indeed, while ANCE-PRF requires further training of the refined query encoder, ColBERT-PRF does not require any further retraining. Furthermore, compared to the single embedding of ANCE-PRF, ColBERT-PRF is also more explainable in nature, as the expansion embeddings can be mapped to tokens (as shown in Figure 2), and their contribution to document scoring can be examined, as we will show in Section 5.3.4.
In the following, we first show the retrieval effectiveness of ColBERT-PRF for passage ranking and document ranking tasks in Sections 5 and 6, respectively. In particular, in Section 5, we examine the characteristics of ColBERT-PRF, including how ColBERT-PRF addresses polysemous words, how ColBERT-PRF demonstrates compared with the traditional query expansion techniques and how to quantify the extent of the semantic matching ability of ColBERT-PRF. Next, we discuss three variants of ColBERT-PRF with different discriminative power measure methods in Section 7, and we address the effectiveness and efficiency trade-off of ColBERT-PRF in Section 8.

PASSAGE RANKING EFFECTIVENESS OF COLBERT-PRF
In this section, we analyse the performance of ColBERT-PRF for passage ranking. In particular, we evaluated the performance of ColBERT-PRF on TREC 2019 and TREC 2020 query sets. Section 5.1 describes the research question addressed by our passage ranking experiments. The experimental setup and the obtained results are detailed in Sections 5.2 and 5.3, respectively.

Research Questions
Our passage ranking experiments address the four following research questions: • RQ1: Can a multiple representation dense retrieval approach be enhanced by pseudorelevance feedback, i.e., can ColBERT-PRF outperform ColBERT dense retrieval?

Dataset & Measures.
Experiments are conducted on the MSMARCO passage corpus, using the TREC 2019 Deep Learning track topics (43 topics with an average of 215.35 relevance judgements per query) and the TREC 2020 Deep Learning track topics (54 topics with an average of 210.85 relevance judgements per query) from TRECDL passage ranking task. We omit topics from the MSMARCO Dev set, which have only sparse judgements, ∼1.1 per query. Indeed, pseudorelevance feedback approaches are known to be not effective on test collections with few judged passages [3].
We report the commonly used metrics for the TREC 2019 and TREC 2020 query sets following the corresponding track overview papers [7,8]: we report mean reciprocal rank (MRR) and normalised discounted cumulative gain (NDCG) calculated at rank 10, as well as Recall and Mean Average Precision (MAP) at rank 1000 [8]. For the MRR, MAP and Recall metrics, we treat passages with label grade 1 as non-relevant, following [7,8]. In addition, we also report the Mean Response Time (MRT) for each retrieval system. For significance testing, we use the paired t-test (p < 0.05) and apply the Holm-Bonferroni multiple testing correction.

Implementation and Settings.
We conduct experiments using PyTerrier [25] and, in particular using our PyTerrier_ColBERT plugin, 5 which includes ColBERT-PRF as well as our adaptations of the ColBERT source code. ColBERT and ColBERT-PRF are expressed as PyTerrier transformer operations -the source code of the ColBERF-PRF ranker and re-ranker pipelines is shown in the Appendix A.2.
In terms of the ColBERT configuration, we train ColBERT upon the MSMARCO passage ranking triples file for 44,000 batches, applying the parameters specified by Khattab & Zaharia in [17]: Maximum document length is set to 180 tokens and queries are encoded into 32 query embeddings (including [MASK] tokens); We encode all passages to a FAISS index that has been trained using 5% of all embeddings; At retrieval time, FAISS retrieves k = 1000 passage embeddings for every query embedding. ColBERT-PRF is implemented using the KMeans implementation [5] of sci-kit learn (sklearn). For query expansion settings, we follow the default settings of Terrier [34], which is 10 expansion terms obtained from three feedback passages; we follow the same default setting for ColBERT-PRF, additionally using representative values, namely K = 24 clusters, 6 and β = {0.5, 1} for the weight of the expansion embeddings. We later show the impact of these parameters when we address RQ3.

Baselines.
To test the effectiveness of our proposed dense PRF approach, we compare with five families of baseline models, for which we vary the use of a BERT-based reranker (namely BERT or ColBERT). For the BERT reranker, we use OpenNIR [24] and capreolus/ bert-base-msmarco fine-tuned model from [21]. For the ColBERT reranker, unless otherwise noted, we use the existing pre-indexed ColBERT representation of passages for efficient reranking. The five families are: Lexical Retrieval Approaches: These are traditional retrieval models using a sparse inverted index, with and without BERT and ColBERT rerankers, namely: (i) BM25 (ii) BM25+BERT (iii) BM25+ColBERT, (iv) BM25+RM3, (v) BM25+RM3+BERT, and (vi) BM25+RM3+ColBERT. Neural Augmentation Approaches: These use neural components to augment the (sparse) inverted index: (i) BM25+DeepCT and (ii) BM25+docT5query, both without and with BERT and ColBERT rerankers. For BM25+docT5query+ColBERT, the ColBERT reranker is applied on expanded passage texts encoded at querying time, rather than the indexed ColBERT representation. The response time for BM25+docT5query+ColBERT reflects this difference. Dense Retrieval Models: This family consists of the dense retrieval approaches: (i) ANCE: The ANCE [46] model is a single representation dense retrieval model. We use the trained models provided by the authors trained on MSMARCO training data. (ii) ANCE-PRF: The ANCE-PRF [48] is a PRF variant of ANCE model -we use the results released by the authors. (iii) ColBERT E2E: ColBERT end-to-end (E2E) [17] is the dense retrieval version of ColBERT, as defined in Section 3.

BERT-QE Models:
We apply BERT-QE [51] on top of a strong sparse baseline and our dense retrieval baseline, ColBERT E2E, i.e., (i) BM25+RM3+ColBERT+BERT-QE and (ii) ColBERT E2E+BERT-QE; Where possible, we use the ColBERT index for scoring passages; for identifying the top scoring chunks within passages, we use ColBERT in a slower "text" mode, i.e., without using the index. For the BERT-QE parameters, we follow the settings in [51], in particular using the recommended settings of α = 0.4 and β = 0.9, which are also the most effective on MSMARCO. Indeed, to the best our knowledge, this is the first application of BERT-QE upon dense retrieval, the first application of BERT-QE on MSMARCO and the first application using ColBERT. We did attempt to apply BERT-QE using the BERT re-ranker, but we found it to be ineffective on MSMARCO, and exhibiting a response time exceeding 30 seconds per query, hence we omit it from our experiments. CEQE Models: This family consists of three CEQE variants [29], i.e., CEQE-Max, CEQE-Centroid, and CEQE-Mul. We apply each CEQE query expansion variant on top of the documents retrieved by BM25. Compared with the original CEQE, we apply the pipeline BM25 + RM3 + BM25 rather than the Dirichlet LM + RM3 + BM25 pipeline for generating the expansion terms. For reproducibility, ColBERT-PRF and the baselines results are available in our virtual appendix. 7

Results for RQ1 -Overall Effectiveness of ColBERT-PRF.
In this section, we examine the effectiveness of a pseudo-relevance feedback technique for the ColBERT dense retrieval model on passage ranking task. On analysing Table 2, we first note that the ColBERT dense retrieval approach outperforms the single representation based dense retrieval models, i.e., ANCE and its PRF variant ANCE-PRF for all metrics on both test query sets, probably because the single representation used in ANCE provides limited information for matching queries and documents [23]. In particular, compared with ANCE-PRF, ColBERT-PRF shows markedly improvement on all metrics for both query sets and shows significant improvement in terms of MAP on TREC 2019 and NDCG@10 on TREC 2020. This indicates that the PRF mechanism that explicitly expands query with expansion embeddings to refine the query representation is superior to implicitly learning from PRF information to form a better query representation.
Based on this, we then compare the performances of our proposed ColBERT-PRF models, instantiated as ColBERT-PRF Ranker & ColBERT-PRF ReRanker, with the more effective ColBERT E2E model. We find that both the Ranker and ReRanker models outperform ColBERT E2E on all the metrics for both used query sets. Typically, on the TREC 2019 test queries, both the Ranker and ReRanker models exhibit significant improvements in terms of MAP over the ColBERT E2E model. In particular, we observe a 26% increase in MAP on TREC 2019 8 and 10% for TREC 2020 over ColBERT E2E for the ColBERT-PRF Ranker. In addition, both ColBERT-PRF Ranker and ReRanker exhibit significant improvements over ColBERT E2E in terms of NDCG@10 on TREC 2019 queries.
The high effectiveness of ColBERT-PRF anker (which is indeed higher than ColBERT-PRF ReRanker) can be explained in that the expanded query obtained using the PRF process introduces more relevant passages, thus it increases recall after re-executing the query on the dense index. As can be seen from Table 2, ColBERT-PRF Ranker exhibits significant improvements over both ANCE and ColBERT E2E models on Recall. On the other hand, the effectiveness of ColBERT-PRF ReRanker also suggests that the expanded query provides a better query representation, which can which can better rank documents in the existing candidate set. Overall, in response to RQ1, we conclude that our proposed ColBERT-PRF model is effective compared to the ColBERT E2E dense retrieval model.

Results for RQ2 -Comparison to
Baselines. Next, to address RQ2(a)-(c), we analyse the performances of the ColBERT-PRF Ranker and ColBERT-PRF ReRanker approaches in comparison to different groups of baselines, namely sparse (lexical) retrieval approaches, neural augmented baselines, and BERT-QE.    For RQ2(a), we compare the ColBERT-PRF Ranker and ReRanker models with the lexical retrieval approaches. For both query sets, both Ranker and ReRanker provide significant improvements on all evaluation measures compared to the BM25 and BM25+RM3 models. This is mainly due to the more effective contexualised representation employed in the ColBERT-PRF models than the traditional sparse representation used in the lexical retrieval approaches. Furthermore, both ColBERT-PRF Ranker and ReRanker outperform the sparse retrieval approaches when reranked by either the BERT or the ColBERT models -e.g., BM25+(Col)BERT and BM25+RM3+(Col)BERTon all metrics. In particular, ColBERT-PRF Ranker exhibits marked improvements over the BM25 with BERT or ColBERT reranking approach for MAP on the TREC 2019 queries. This indicates that our query expansion in the contextualised embedding space produces query representations that result in improved retrieval effectiveness. Hence, in answer to RQ2(a), we find that our proposed ColBERT-PRF models show significant improvements in retrieval effectiveness over sparse baselines.
To further gauge the extent of improvements brought by the PRF additional information in the sparse retrieval and the dense retrieval paradigms, we compare the amount of performance improvements in terms of MAP for ColBERT-PRF vs. ColBERT, ANCE-PRF vs. ANCE, and BM25+RM3 vs. BM25 in Figure 3. We observe that more queries improved, and by a larger margin, by ColBERT-PRF compared to both RM3 and ANCE-PRF. Furthermore, from Figure 3, we find that among the failed queries for ColBERT-PRF, most of these queries also failed for the ANCE-PRF and RM3 approaches. These queries are hard queries that may struggle to be improved by a PRF technique. On the other hand, in Table 3, we present the number of queries whose performances are improved, unchanged and degraded when comparing a retrieval system with and without a PRF mechanism applied. We find that ColBERT-PRF has the highest number of improved queries and the lowest number of degraded queries. In the bottom half of Table 3, we compute Spearman's ρ correlation coefficient between the performance improvements of different PRF methods -a high positive correlation coefficient would be indicative that the two methods demonstrate a similar effect on different types of queries. From Table 3, we see that the correlation coefficient between ColBERT-PRF vs. ColBERT and ANCE-PRF vs. ANCE is highest among all the compared pairs (0.41). Overall, this tells us that while there is no strong correlations between the queries improved by applying PRF to each baseline, ColBERT-PRF and ANCE-PRF are the most correlated pair. Indeed, only moderate correlations are observed, showing that the approaches improve different queries. Moreover, from Figure 3 we see that ColBERT-PRF improves more queries and with further margin than ANCE-PRF.
For RQ2(b), on analysing the neural augmentation approaches, we observe that both the DeepCT and docT5query neural components could lead to effectiveness improvements over the corresponding lexical retrieval models without neural augmentation. However, despite their improved effectiveness, our proposed ColBERT-PRF models exhibit marked improvements over the neural augmentation approaches. Specifically, on the TREC 2019 query set, ColBERT-PRF Ranker significantly outperforms four out of six neural augmentation baselines and the BM25+DeepCT baseline  on MAP. Meanwhile, both ColBERT-PRF Ranker and ReRanker exhibit significant improvements over BM25+DeepCT and BM25+docT5query on MAP for TREC 2020 queries, and exhibit improvements up to 9.5% improvements over neural augmentation approaches with neural re-ranking (e.g., MAP 0.4671→0.5116). On analysing these comparisons, the effectiveness of the ColBERT-PRF models indicates that the query representation enrichment in a contextualised embedding space leads to a higher effectiveness performance than the sparse representation passage enrichment. Thus, in response to RQ2(b), the ColBERT-PRF models exhibit markedly higher performances than the neural augmentation approaches.
We further compare the ColBERT-PRF models with the recently proposed BERT-QE Reranking model. In particular, we provide results when using BERT-QE to rerank both BM25+RM3 as well as ColBERT E2E. Before comparing the ColBERT-PRF models with the BERT-QE rerankers, we first note that BERT-QE doesn't provide benefit to MAP on either query set, but can lead to a marginal improvement for NDCG@10 and MRR@10. However, the BERT-QE reranker models still underperform compared to our ColBERT-PRF models. Indeed, ColBERT E2E+BERT-QE exhibits a performance significantly lower than both ColBERT-PRF Ranker and ReRanker on the TREC 2019 query set. Hence, in response to RQ2(c), we find that the ColBERT-PRF models significantly outperform the BERT-QE reranking models. Finally, we consider the mean response times reported in Table 2, noting that ColBERT PRF exhibits higher response times than other ColBERT-based baselines, and similar to BERT-based rerankers. There are several reasons for ColBERT PRF's speed: Firstly, the KMeans clustering of the feedback embeddings is conducted online, and the scikit-learn implementation we used is fairly slow -we tried other markedly faster KMeans implementations, but they were limited in terms of effectiveness (particularly for MAP), perhaps due to the lack of the KMeans++ initialisation procedure [5], which scikit-learn adopts; Secondly ColBERT PRF adds more expansion embeddings to the query -for the ranking setup, each feedback embedding can potentially cause a further k = 1000 passages to be scored -further tuning of ColBERT's k parameter may allow efficiency improvements for ColBERT-PRF without much loss of effectiveness, at least for the first retrieval stage. Based on this, we further investigate how to attain more of a balance between the effectiveness and the efficiency in leveraging techniques such as approximate scoring technique [26] and other clustering algorithms.

Results for RQ3 -Impact of ColBERT-PRF Parameters.
To address RQ3, we investigate the impact of the parameters of ColBERT-PRF. In particular, when varying the values of a specific hyper-parameter type, we fix all the other hyper-parameters to their default setting, i.e., f b = 3, f e = 10, β = 1 and k = 24. Firstly, concerning the number of clusters, K, and the number of expansion embeddings f e selected from those clusters (f e ≤ K), Figures 5(a) and (b) report, for ColBERT-PRF Ranker and ColBERT-PRF ReRanker, respectively, the MAP (y-axis) performance for different f e (x-axis) selected from K clusters (different curves). We observe that, with the same number of clusters and expansion embeddings, ColBERT-PRF Ranker exhibits a higher MAP performance than ColBERT-PRF ReRanker -as we also observed in Section 5.3.1.
Then, for a given f e value, Figures 5(a) and (b) show that the best performance is achieved by ColBERT-PRF when using K = 24. To explain this, we refer to Figure 4 together with Figure 2(b), which both show the centroid embeddings obtained using different numbers of clusters K. Indeed, if the number of clusters K is too small, the informativeness of the returned embeddings would be limited. For instance, in Figure 4(a), the centroid embeddings represent stopwords such as 'in' and '##'' are included, which are unlikely to be helpful for retrieving more relevant passages. However, if K is too large, the returned embeddings contain more noise, and hence are not suitable for expansion -for instance, using K = 64, feedback embeddings representing 'innocent' and 'stunt' are identified in Figure 4(b), which could cause a topic drift.
Next, we analyse the impact of the number of feedback passages, f b . Figure 5(c) reports the MAP performance in response to different number of f b for both ColBERT-PRF Ranker and ReRanker. We observe that, when f b = 3, both Ranker and ReRanker obtain their peak MAP values. In addition, for a given f b value, the Ranker exhibits a higher performance than the ReRanker. Similar to existing PRF models, we also find that considering too many feedback passages causes a query drift, in this case by identifying unrelated embeddings.
Finally, we analyse the impact of the β parameter, which controls the emphasis of the expansion embeddings during the final passage scoring. Figure 5(d) reports MAP as β is varied for ColBERT-PRF Ranker and ReRanker. From the figure, we observe that in both scenarios, the highest MAP is obtained for β ∈ [0.6, 0.8], but good effectiveness is maintained for higher values of β, which emphasises the high utility of the centroid embeddings for effective retrieval.
Overall, in response to RQ3, we find that ColBERT-PRF, similar to existing PRF approaches, is sensitive to the number of feedback passages and the number of expansion embeddings that are added to the query (f b & f e ) as well as their relative importance during scoring (c.f. β). However, going further, the K parameter of KMeans has a notable impact on performance: if too high, noisy clusters can be obtained; too low and the obtained centroids can represent stopwords. Yet, the stable and effective results across the hyperparameters demonstrate the overall promise of ColBERT-PRF.

Results for RQ4 -Semantic Matching by ColBERT-PRF.
We now analyse the expansion embeddings and the retrieved passages in order to better understand the behaviour of ColBERT-PRF, and why it demonstrates advantages over traditional (sparse) QE techniques. The symbol | denotes that there are multiple tokens that are highly likely for a particular expansion embedding. Token with darker red colour indicate its higher effectiveness contribution.
Firstly, it is useful to inspect tokens corresponding to the expansion embeddings. Table 4 9 lists three example queries from both the TREC 2019 and 2020 query sets and their tokenised forms as well as the expansion tokens generated by the ColBERT-PRF model. For a given query, we used our default setting for the ColBERT-PRF model, i.e., selecting ten expansion embeddings; Equation (3) is used to resolve embeddings to tokens. On examination of Table 4, it is clear to see the relation of the expansion embeddings to the original query -for instance, we observe that expansion embeddings for the tectonic concept of active margin relate to 'oceanic', 'volcanoes' and 'continental' 'plate'. Overall, we find that most of the expansion tokens identified are credible supplementary information for each user query and can indeed clarify the information needs.
To answer RQ4, we further conduct analysis to measure the ability to perform semantic matching within the ColBERT Max-Sim operation. In particular, we examine which of the query embeddings match most strongly with a passage embedding that corresponds to exactly the same tokena so called exact match; in contrast a semantic match is a query embedding matching with a passage embedding which has a different token id. Indeed, in [13], the authors concluded that ColBERT is able to conduct exact matches for important terms based on their embedded representations. In contrast, little work has considered the extent that ColBERT-based models perform semantic (i.e., non-exact) matching. Thus, firstly, following [28], we look into the interaction matrix between the query and passage embeddings. Figure 6 describes the interaction matrix between the query "why did the us voluntarily enter ww1" expanded with 10 expansion embeddings and its top returned passage embeddings. 10 From Figure 6, we see that some query tokens, such as 'the', 'us', 'w', and '##w', experience exact matching as these tokens are in the same form with their corresponding 3:20 X. Wang et al. Fig. 6. ColBERT-PRF interaction matrix between query (qid: 106375) and passage (docid: 4337532) embeddings. The darker shading indicate a higher similarity. The highest similarity among all the passage embeddings for a given query embedding is highlighted with an X symbol. The histogram depicts the magnitude of contribution for each query embedding to the final score of the passage. returned highest Max-Sim scored passage tokens. In contract, the remaining query tokens are performing semantic matching to the passage as their corresponding passage tokens with the highest Max-Sim score are in different lexical forms, for instance, query token 'why' matches with passage token 'reason'. In particular, the expansion token 'revolution' and 'entered', which does not exist in the original token but expanded using ColBERT-PRF, also performs the exact matching. In addition, the expansion tokens such as 'attacked' and 'harbour' further perform semantic matching to the passages. This further indicates the usefulness of the expansion tokens to improve the matching performance between query and passage pairs.
To quantify the extent that semantic matching takes place, we follow [43] and employ a recent measure that inspects the Max-Sim, and determines whether each query embedding is matched with the same token (exact match) vs. an inexact (semantic) match with a different token. Formally, let t i and t j respectively denote the token id of the i-th query embedding and j-th passage embedding, respectively. Given a query q and the set R k of the top ranked k passages, the Semantic Match Proportion (SMP) at rank cutoff k w.r.t. q and R k is defined as: where toks(q) returns the indices of the query embeddings that correspond to BERT tokens, i.e., not  observe that, when the expansion embeddings are added to the original query by ColBERT-PRF, SMP is increased for most of the queries over the original ColBERT E2E model. Next, on both TREC 2019 and TREC 2020 query sets, we investigate the impact of the rank cutoff k to the semantic match proportion on ColBERT-PRF model instantiated as Ranker and ReRanker models as well as the ColBERT E2E model, which is portrayed in Figure 8. In general, from Figure 8, we can see that Mean SMP grows as the rank cutoff k increases -this is expected, as we know that ColBERT prefers exact matches, and the number of exact matches will decreased by rank (resulting in increasing SMP). However, that ColBERT-PRF (both Ranker and Reranker) have, in general, higher SMP than the original ColBERT ranking. This verifies the results from Figure 7. The interesting exception is at the very highest ranks, where both ColBERT-PRF approaches exhibit lower SMP than the baseline. This suggests that at the very top ranks ColBERT-PRF exhibits higher preference for exact token matches than the E2E baseline. However, overall, the higher SMPs exhibited by ColBERT-PRF indicates that, at deeper ranks, the embedding-based query expansion has the ability to retrieve passages with less lexical exact match between the query and passage embeddings.
In addition, we further investigate the potential for topic drift when applying ColBERT-PRF with different number of expansion embeddings on the TREC 2019 queries. In particular, in Figure 9(a) 12 we measure retrieval effectiveness (MAP) as the number of expansion embeddings is varied and, in Figure 9(b), we present Mean SMP (y-axis) calculated upon the retrieved results after PRF, at different rank cutoffs (curves), also as the number of expansion embeddings is varied (x-axis).
From Figure 9(a), we can see that f e = 8 gives the highest (MAP) effectiveness (as also shown earlier in Figure 5(b)). At the same time, from Figure 9(b), we observe that (1) for 2 ≤ f e ≤ 8, Mean SMP falls; (2) however, for f e > 8, Mean SMP rises again. This trend is apparent when Mean SMP is analysed for five or more retrieved passages. This suggests that when more than   eight expansion embeddings are selected, excessive semantic matching occurs (Figure 9(b)) and effectiveness approaches MAP 0.50 (Figure 9(a)). As expansion embeddings are selected by using the IDF of the corresponding token, this suggests that given the size of the feedback set (three passages, with length up to 180 tokens and on average 77 tokens), for more than eight embeddings we are starting to select non-informative expansion embeddings that can only be semantically matched in the retrieved passages, and hence there is no further positive benefit in terms of effectiveness. However, as effectiveness does not markedly decrease for f e > 8, this indicates that there is little risk of topic drift with ColBERT-PRF, due to the contextualised nature of the expansion embeddings. Overall, these analyses answer RQ4.

DOCUMENT RANKING EFFECTIVENESS OF COLBERT-PRF
After assessing the effectiveness of ColBERT-PRF on passage ranking in the previous section, we further demonstrate the performance of ColBERT-PRF on document ranking. In this task, documents are longer than passages, hence they need to be divided into smaller chuncks, with lengths comparable to those of passages. Moreover, in document ranking we do not fine tune the ColBERT model on the new collection due to the limited number of queries available; hence, we leverage the ColBERT model trained on the MSMARCO as detailed in Section 5.2, e.g., in a zero shot setting. Thus, in this section, we focus on testing the effectiveness of our proposed ColBERT-PRF for MSMARCO document retrieval task and the TREC Robust04 document retrieval task. Research questions and experimental setup for document ranking experiments are detailed in Sections 6.1 and 6.2, respectively. Results and analysis are discussed in Section 6.3.

Research Questions
Our document ranking experiments address the following research questions: • RQ5: Can our pseudo-relevance feedback mechanism enhance over the retrieval effectiveness of dense retrieval models, i.e.,can the ColBERT-PRF model outperform ColBERT, ANCE and ANCE-PRF dense retrieval models for document retrieval task? • RQ6: How does ColBERT-PRF compare to other existing baseline and state-of-the-art approaches for document retrieval task, namely: (a) lexical (sparse) baselines, including using PRF, In addition, we also conduct the evaluation using 250 title-only and description-only query sets from TREC Robust04 document ranking task.
We report the following metrics for MSMARCO document ranking tasks, namely the normalised discounted cumulative gain (NDCG) calculated at rank 10, Mean Average Precision (MAP) at rank 1000 as well as Recall calculated at ranks 100 and 1000. For the Robust04 experiments, we use the same metrics used for passage ranking tasks in Section 5.2. For significance testing, we use the paired t-test (p < 0.05) and apply the Holm-Bonferroni multiple testing correction.

Implementation and Settings.
As the length of documents in these corpora are too long to be fitted into the BERT [11] model, and in particular our trained ColBERT model 13 (limited to 512 and 180 BERT WordPiece tokens, respectively), we split long documents into smaller passages and index the generated passages following [9]. In particular, when building the index for each document corpora, a sliding window of 150 tokens with a stride of 75 tokens is applied to split the documents into passages. All the passages are encoded into a FAISS index. At retrieval time, FAISS retrieves k = 1000 document embeddings for every query embedding. The final score for each document is obtained by taking its highest ranking passage, a.k.a., its max passage.
To ensure a fair comparison, we apply passaging for all other indices used in this section, including the Terrier inverted index, i.e., the ANCE dense index. 14 Similarly, all PRF methods are applied on feedback passages, and max passage applied on the final ranking of passages. Finally, we follow the same ColBERT-PRF implementation as introduced in Section 5.2. For query expansion settings, we follow the default settings for passage ranking task in Section 5.3, which is 10 expansion terms obtained from three feedback passages 15 and K = 24 clusters.

Baselines.
To test the effectiveness of our ColBERT-PRF model on document ranking task, we compare with the all the baseline models we used for passage ranking task except the Neural Augmentation Approaches, due to the high GPU indexing time require for performing the doc2query and DeepCT processing for these large document corpora.

Document Ranking Results
In this section, we further investigate the effectiveness of our proposed ColBERT-PRF for document ranking task. Tables 5 and 6 present the performance of ColBERT-PRF models as well as the baselines on the MSMARCO document dataset and the Robust04 dataset, respectively.
6.3.1 Results for RQ5. Similar to the passage retrieval task, in this section we validate the effectiveness of the pseudo-relevance feedback technique for the ColBERT dense retrieval model on the document retrieval task. On analysing Table 5, we found that both ColBERT-PRF Ranker and ReRanker models significantly outperform both the single representation dense retrieval, namely ANCE, and the multiple representation dense retrieval model, namely ColBERT E2E, in terms of MAP and Recall on both TREC 2019 and TREC 2020 query sets. In particular, the application of ColBERT-PRF leads to up to 21% and 14% improvements over ColBERT E2E in terms of MAP for TREC 2019 and TREC 2020 query sets, respectively.
Indeed, ColBERT-PRF outperforms all document retrieval runs to the TREC 2019 Deep Learning track, exceeding the highest observed MAP by 23% in terms of MAP. Similarly, on the TREC 2020 query set, the MAP observed is markedly above that attained by the second-ranked group on the leaderboard [7]. 16 In terms of NDCG@10, ColBERT-PRF outperforms over both the ANCE and ColBERT E2E models on both MSMARCO query sets. Moreover, both the ColBERT-PRF Ranker and ReRanker models significantly outperform the ColBERT and ANCE models w.r.t. Recall@100, indicating the effectiveness of the ColBERT-PRF refined query representations.
Similarly, when comparing the performances of ColBERT-PRF with the dense retrieval models without pseudo-relevance feedback on Robust04 in Table 6, we note that both ColBERT-PRF Ranker and ReRanker models are markedly improved over the ANCE and ColBERT E2E models on MAP, NDCG@10, and Recall on both title-only and description-only type of queries. Overall, between the Ranker and ReRanker ColBERT-PRF models, we find that ColBERT-PRF Ranker is more effective than ColBERT-PRF ReRanker, likely due to its increased Recall, consistent with those obtained from the passage ranking task (Section 5). Thus, in response to RQ5, we conclude that our ColBERT-PRF is effective at improving ColBERT E2E on document ranking tasks, similar to the improvements observed in Section 5.

Results for RQ6.
In the following, we compare the effectiveness of the ColBERT-PRF model with various baselines. From Table 5, we find that ColBERT-PRF instantiated as the Ranker model significantly improves over the BM25-based lexical retrieval baselines and the ColBERT E2E with BERT-QE as the reranker, as well as all the CEQE variants models in terms of the NDCG@10 and Recall@100 metrics on the TREC 2019 query set. In addition, for the TREC 2020 query set, ColBERT-PRF significantly improves over all the baselines except those with BERT-based neural 15 We also tried filtering passages from the same document before applying PRF. We observed no significant improvements across multiple measures. 16 The first ranked group used expensive document expansion techniques.  reranking models, namely BERT, ColBERT and BERT-QE, in terms of the MAP and Recall@100 metrics. Now let's analyse the performance of ColBERT-PRF models on Robust04 query sets. From Table 6, we observe that ColBERT-PRF models significantly outperforms the BM25 on both query sets and markedly outperforms over BM25 + RM3 on title-only queries. In addition, ColBERT-PRF show the similar performance with CEQE models in terms of MAP but exhibit markedly improvements in terms of NDCG@10 and MRR@10. Moreover, when comparing with the models with neural rerankers, both ColBERT-PRF Ranker and ReRanker models significantly outperform the ColBERT E2E + BERT-QE baseline and exhibits comparable performance than the other neural reranker models. However, we argue that the limited performance of ColBERT-PRF compared with the BERT-based reranking models for the Robust04 query sets comes from the two 3:26 X. Wang et al.
following aspects: firstly, we used a zero-shot setting of ColBERT model for the document ranking tasks, in that the ColBERT model was not trained on the larger document datasets; second, we didn't perform further parameter tuning for ColBERT-PRF on the document ranking task. Thus, in response to RQ6, we find that ColBERT-PRF is more effective than most of the baseline models and comparable to the BERT based neural reranking models.

MEASURING THE INFORMATIVENESS OF EXPANSION EMBEDDINGS OF
COLBERT-PRF In this section, we investigate the effectiveness of the three variants of the ColBERT-PRF model using different techniques to measure the informativeness of the expansion embeddings. The strategies are detailed in Section 7.1. Accordingly, a research question is posed in Section 7.2, with a corresponding experimental setup. Finally, Section 7.3 presents the performance and analysis of the three ColBERT-PRF variants.

Methodology
In Section 4.2 we proposed to map each expansion embedding back to its most likely token, and use the IDF of that token to measure the importance σ of each expansion embedding υ i generated by ColBERT-PRF. This results in a weight, σ (υ i ), that is used in the expanded max-sim calculation (Equation (5)). Indeed, notions of document frequency or collection frequency are commonly used in PRF models to measure expansion terms [2]. The intuition behind this is that if a term appears more frequently in the feedback documents than in the whole corpus, the term is taken as an informative term. In contrast, terms that occur frequently in the corpus will not discriminate well relevant documents from other documents in the collection. In this section, we revisit the use of IDF in ColBERT-PRF, by additionally using collection frequency of the token, while also examining the corresponding embeddings of the tokens.
Indeed, in addition to the document frequency focus of IDF, the collection frequency is also useful to reflect the informativeness of a term within the whole collection, measured as follows: where |D| is the number of terms in the collection D and t f (t, D) is the number of occurrences of expansion term t in the whole collection D. However using either IDF or ICTF as expansion embedding weights does not consider the contextualised nature of the embeddings -that different tokens can have distinct meanings, and these may be more or less useful for retrieval. Use of IDF or ICTF can mask such distinctions.
Hence, we examine a further method based directly on the embedded representations. In particular, for each token, we examine all corresponding embeddings in the index, and determine how 'focused' these are -we postulate that a token with more focused embeddings will only have a single meaning (and therefore less polysemous), and hence is more likely to be a good expansion embedding. Specifically, we measure the Mean Cosine similarity (MCos) for the embeddings of each token compared to the mean of all those embeddings: where ϒ is the element-wise average embedding of all embeddings in the index for token t. MCos is intended to approximate the semantic coherence of the embeddings for a given token. The expansion embeddings of more coherent tokens are given a higher weight in ColBERT-PRF.

Research Question & Experimental Setup
Our informativeness measurement experiments address the following research question: • RQ7: What is the impact of the effectiveness ColBERT-PRF using different informativeness of expansion embedding measurements methods, namely the IDF weighting method, ICTF weighting method, and the MCos weighting method?
In our experiments addressing RQ7, while testing IDF, ICTF and MCos importance measures, we vary the parameter of ColBERT-PRF that controls the overall weight of the expansion embeddings, β. We do not normalise the various importance measures σ I DF (t), σ ICT F (t) and σ MCos (t) -their inherent differences in scales are addressed by varying β. Dataset: The query sets we used to demonstrate the effectiveness of the three variants of ColBERT-PRF proposed are the MSMARCO passage TREC 2019 and TREC 2020 passage query sets for passage retrieval task and the Robust title and description query sets for document retrieval task. Measures: Mean Average Precision (MAP) is used as the main metric. Figure 10 shows the impacts of the retrieval effectiveness of the different weighting methods while β is varied, in terms of MAP, for ColBERT-PRF for both the MSMARCO passage ranking task and the Robust04 document ranking task. Specifically, for the passage ranking task, we measure the retrieval effectiveness on both the TREC 2019 and TREC 2020 passage ranking queries, and using title-only and description-only types of queries of Robust04.

Results
On analysing the figure, we see that, for both TREC 2019 and TREC 2020 query sets, the peak MAP scores for all the three weighting methods are the same, approximately with MAP=0.54 and MAP=0.51, respectively. In addition, according to the Figures 10(a) and 10(b), the overall trend for IDF and ICTF weighting methods are the same and both reaches the highest MAP score with β ∈ [0.4, 0.8]. When we compare with IDF and ICTF, we see that MCos with β ∈ [4.0, 6.0] exhibits the highest MAP performance. These trends allow us to draw the following observations: the lines for IDF and ICTF are very similar, varying only in terms of the β value needed to obtain the highest MAP; In contrast, the MCos weighting method achieves a similar maximum MAP, but at a larger β value -this is due to the lack of common normalisation. Indeed, as the maximum MAP values obtained are similar for IDF, ICTF and MCos, this suggests that the MCos is correlated with IDF, and that the statistical approaches are sufficient for measuring expansion embedding importance. A closer analysis of IDF and ICTF, as calculated on the BERT tokens, found that they exhibit a very high correlation (Spearman's ρ of ∼1.00 on the MSMARCO passage corpus). This is indeed higher than the correlation observed on a traditional sparse Terrier inverted index (which uses a more conventional tokeniser) of 0.95 on the MSMARCO document index. The differences in correlations can be explained as follows: firstly, due to the use of WordPieces by the BERT tokeniser, which reduces the presence of long-tail tokens (which are tokenised to smaller WordPieces); secondly, passage corpora use smaller indexing units than document corpora, so it is less likely for terms to occur multiple times -this results in collection frequency being more correlated to document frequency.
For the Robust04 queryset (Figures 10(c) and 10(d)), we see that while the peak MAP values for IDF and ICTF are again similar, the MCos weighting method gives lower MAP scores on the Robust04 title and description query sets. This suggests that using the coherence of a token's embeddings may not well indicate the utility of the expansion embedding. Indeed, some tokens with high embedding coherence could be stopword-like in nature. This motivates the continued use of IDF and ICTF for identifying important expansion embeddings.
Overall, to address RQ7, we find that the statistical information, based IDF and ICTF weighting methods, is more stable than the MCos weighting method for different retrieval tasks. Use of IDF and ICTF were shown to be equivalent, due to the higher correlation between document frequency and collection frequency on passage corpora.

EFFICIENT VARIANTS OF COLBERT-PRF
In Section 5.3, we noted the high mean response time of the ColBERT PRF approach. Higher response times are a feature of many PRF approaches, due to the need to analyse the contents of the feedback documents, and decide upon the expansion terms/embeddings. In this section, we investigate several efficient variants of our ColBERT-PRF model, by experimenting with different clustering approaches, as well as different retrieval configurations of ColBERT.
In particular, we describe different variants in Section 8.1. Two research questions and the implementation setup are detailed in Section 8.2. Results and analysis are discussed in Section 8.3.

ColBERT-PRF Variants
The overall workflow of a ColBERT-PRF Ranker model can be described in five stages, as shown in Figure 1. These stages can be summarised as follows (for the ColBERT-PRF ReRanker model, the fourth stage ANN retrieval is omitted):  In the following, we discuss changes to the clustering (Stage 3 above, Section 8.1.1) and ANN retrieval (Stages 1 & 4, Section 8.1.2).

Clustering.
The default clustering technique in Stage 3 is the KMeans clustering algorithm. KMeans clustering is a widely used clustering method, which groups the samples into k clusters according to their Euclidean distance to each other. Hence, in ColBERT-PRF, given a set of document embeddings and the number of clusters expected to be returned, KMeans clustering is employed to return a list of representative centroid embeddings. Figure 11(a) provides an illustration of the KMeans clustering method. Indeed, as shown in Figure 11(a), we notice that both cluster centroids (which can be applied as expansion embeddings for PRF) are distinct from the input embeddings. As a consequence, while measuring the importance and selecting the most informative ones among the representative centroid embeddings using IDF (or ICTF or MCos), we require to map each centroid embedding to a corresponding token id. As the representative centroid embedding, by definition, is not an actual document embedding, we turn to the FAISS ANN index and apply Equation (3) to obtain a list of token ids (see Section 4.2).
However, the main drawback of the above KMeans clustering method in ColBERT-PRF is that the procedure of looking up the most likely token for each of the K centroid embeddings requires another K FAISS lookups. To address this issue, we propose variants that avoid these additional FAISS lookups, by using the most likely token within each cluster -to do so, we recognise that the expansion embedding (which is added to the query) needs not perfectly alignment with the embedding used to measure informativeness, which we call the indicative embedding.
Our first proposed alternative strategy is called KMeans-Closest, which is still based on KMeans clustering but does not rely on additional FAISS lookups to obtain the most likely tokens. Once the K centroid embeddings are computed, for each centroid we identify the closest feedback document embedding in the corresponding cluster -the indicative embedding for each cluster -and we use its token id to measure the importance score, such as IDF of the expansion embeddings. As shown in Figure 11(b), the indicative embeddings (the diamonds) are the closest actual document embeddings to the KMeans centroid embeddings (the blue stars).
Our second proposed clustering strategy is KMedoids [18]. The KMedoids algorithm returns the medoid of each cluster -the medoid is the most centrally located embedding of the input document embeddings. Thus, after applying clustering upon the feedback document embeddings, for each cluster, we obtain the medoid (an indicative embedding for the cluster) that is also an actual document embedding, and hence can be mapped back to a token id, without requiring additional FAISS lookups for each centroid. Figure 11(c) depicts both the expansion embeddings and the indicative embeddings are the returned medoid embeddings of the KMedoids clustering algorithm.
Overall, while the use of the KMeans-Closest and KMedoids methods can speed up the third stage of ColBERT-PRF, there might exist some potential risks (e.g., token id mismatch), thus hindering the effectiveness -hence, we report effectiveness as well as efficiency in our experiments.

ANN Retrieval. The overall
ColBERT-PRF Ranker process encapsulates a total of five stages, as shown in Figure 1. An ANN retrieval stage is used in both stages 1 & 4, and hence forms a significant part of the workflow. Indeed, as highlighted in Section 3, for each given query embedding, the approximate nearest neighbour search produces k document embeddings for each query embedding, which are then mapped to the corresponding documents, thereby forming an unordered set of candidate documents. However, the contribution of the different query embeddings to the final score of the document varies (c.f. the contribution histogram in Figure 6). 17 Therefore, it is not efficient to take upto k = 1000 documents for each query embedding forward to the 2nd stage for accurate MaxSim scoring, as not all of these documents will likely receive high scores.
To this end, we experiment with using Approximate Scoring [26] at the first stage, as well as in the later stage 4 retrieval. In particular, this approach makes use of the MaxSim operator applied on the approximate cosine scores of the ANN algorithm, to generate a ranking of candidates from the first stage. Indeed, as this is a ranking, rather than a set, then the number of the candidates k can be directly controlled, rather than indirectly through k . While this requires more computation in stage 1 (and has a small negative impact on the response time of that stage), its has marked overall efficiency benefits [26] for ColBERT dense retrieval, as a smaller number of candidates can be passed to MaxSim without loss of recall.
More specifically, for the ColBERT-PRF instantiated as Ranker model, we apply the Approximate Scoring technique only in the first stage or in both the first and fourth stage of the ColBERT-PRF-Ranker model. Indeed, as we only require the most relevant three feedback passages for effective PRF, accurately scoring thousands of passages retrieved by the 1st ANN stage is superfluous. For the ColBERT-PRF instantiated as the ReRanker model, we apply Approximate Scoring in the first stage. In addition, we further investigate the efficiency and effectiveness trade-off when implementing the different clustering technique and the Approximate Scoring technique in the various ColBERT stages.

Research Question & Experimental Setup
• RQ8: What is the impact on efficiency and effectiveness of the ColBERT-PRF model using different clustering methods, namely the KMeans and KMeans-Closest clustering methods and the KMedoids clustering method?  Figure 1) and its overall MRT. Mean response times are measured with one Nvidia Titan RTX GPU (using a single thread for retrieval). In addition, we report the effectiveness performance with the metrics used in Section 5.2, namely MAP, NDCG@10, MRR and Recall. For significance testing, we use the paired t-test (p < 0.05) and apply the Holm-Bonferroni multiple testing correction technique.
Experimental setup: For both KMeans-Closest and KMedoids clustering, we reuse the default setting of the KMeans clustering algorithm, i.e., the number of clusters K = 24, the number of feedback documents f b = 3, and the number of expansion embeddings f e = 10. As for β, based on the conclusions obtained from Section 5, we pick the appropriate β for each query set, namely β = 1 and β = 0.5 for the TREC 2019 and TREC 2020 passage ranking query sets, respectively. For the Approximate Scoring experiments, let k 1 denote the number of passages retrieved in the Stage 1 ANN, and k 4 denote the number of passages retrieved in the Stage 4 ANN. Then, for (i) the ColBERT-PRF Ranker model, we apply with rank cutoff of k 1 = 300 and k 4 = 1000, 18 and for (ii) the ReRanker model, we apply with rank cutoff k 1 = 1000 in the first stage only, to ensure sufficient recall of relevant passages to be upranked after applying PRF. We later vary k 1 and k 4 to demonstrate their impact upon efficiency and effectiveness. Table 7 lists the effectiveness and the efficiency performance for ColBERT E2E and the ColBERT-PRF instantiated as Ranker and ReRanker models on both the TREC 2019 and TREC 2020 passage ranking query sets. In terms of efficiency, we measure the MRT of the different ColBERT-PRF stages as well as the overall MRT for each model variant. From Table 7, we note that, for both the TREC 2019 and TREC 2020 query sets, both the ColBERT-PRF Ranker and ReRanker model variants implemented with KMeans-Closest and KMedoids clustering methods are much faster than the KMeans clustering method model, without markedly compromising their effectiveness. In particular, both KMeans-Closest and KMedoids still exhibit enhanced NDCG@10 and MAP (significantly so) over the ColBERT E2E baseline. Moreover, this speed benefit is obtained by omitting the FAISS lookup step in the default ColBERT-PRF with KMeans-Closest and KMedoids clustering algorithms, as large efficiency improvements can be observed in the Stage 3 column of Table 7 (e.g., on TREC 2019, ∼900ms for KMeans-Closest vs. ∼3000ms for KMeans). Going further, KMedoids is faster still (218ms on TREC 2019), demonstrating the benefit of a fast clustering algorithm, with no further loss of effectiveness compared to KMeans-Closest.

RQ8 -Clustering Variants.
Overall, in a reranking scenario, KMeans-Closest and KMedoids clustering methods experience upto 2.48× and 4.54× speedups, respectively. Indeed, the mean response times of KMedoids of 766ms (TREC 2020) is very respectable compared to the ColBERT E2E baseline, despite the normally expensive application of a PRF technique. Thus, in response to RQ8, we conclude that for both the ColBERT-PRF Ranker and ReRanker models with KMeans-Closest or KMedoids clustering are more efficient than the KMeans clustering method without compromising the effectiveness.

RQ9 -Variants using Approximate
Scoring. Next, we consider the application of Approximate Scoring within ColBERT-PRF. Again, efficiency and effectiveness results are reported in Table 7. We report response times only for KMeans. Firstly, on examining the table, we find that Approximate Scoring applied in both the first stage and the fourth stage of the ColBERT-PRF Ranker model exhibits similar effectiveness performance but much more efficient than the original ColBERT-PRF Ranker model. In addition, deploying Approximate Scoring within the ColBERT-PRF ReRanker model also reduces the response time while still outperforming the ColBERT E2E model (but not by a significant margin). From Table 7, we see that rows with Approximate Scoring techniques applied exhibit increased Stage 1 times (43ms → 95ms/90ms for Ranker, as MaxSim takes time to compute), but are much faster in Stage 2, as the exact scoring only occurs in the selected high quality candidates (344ms → 22ms/23ms for Ranker). The next effect of replacing both of the set retrieval ANN stages with Approximate Scoring in Ranker is an up to 18% speedup 18 Indeed, [26] suggest k = 300 is sufficient for high precision retrieval. in response times (4103ms → 3466ms), while still maintaining high effectiveness, e.g., significant improvements in MAP over the baseline ColBERT E2E.
Next, we further study the trade-off between the efficiency and the effectiveness of ColBERT-PRF applied with Approximate Scoring, as well as the benefits brought by the different clustering techniques. Aligned with the table, Figure 12 presents both the effectiveness and efficiency of the following three strategies on the TREC 2019 query set: (i) ColBERT-PRF Ranker applied with Approximate Scoring in stage 1 using three different clustering techniques; (ii) ColBERT-PRF Ranker applied with Approximate Scoring in both stage 1 and stage 4 using three different clustering techniques, and (iii) ColBERT-PRF ReRanker applied with Approximate Scoring in stage 1 using three different clustering techniques. In each figure, we vary the cutoff, k 1 or k 4 , of Approximate Scoring to produce curves for each setting (100 ≤ {k 1 , k 3 } ≤ 7300 19 ). We provide separate figures for MAP and NDCG@10. Each figure has two asterisk points ( ) denoting the performance of ColBERT E2E, and the ColBERT-PRF default setting (KMeans, ANN set retrieval). For the points in each curve, the marker • indicates the corresponding performance is significantly improved (and × indicates not significantly) over the ColBERT E2E baseline.
Firstly, we analyse ColBERT-PRF Ranker when only the Stage 1 Approximate Scoring is applied. From Figure 12(b), we observe that, for the smaller k 1 , there is some minor degradation of NDCG@10; but the impact on MAP (Figure 12(a)) is indistinguishable. In terms of efficiency, it can easily be seen that KMedoids is the most efficient technique, followed by the KMeans-Closest technique and finally the KMeans clustering technique.
We next consider Figures 12(c) and 12(d), where we applied the Approximate Scoring technique for both the first and fourth stages for ColBERT-PRF Ranker model, with the different clustering methods. More specifically, k 1 , the rank cutoff of first stage Approximate Scoring is fixed to 300, while k 4 is varied. From Figure 12(c), we find that all of the three clustering techniques exhibit correlations between efficiency and effectiveness, in that increased MRT also exhibits increased effectiveness. Moreover, reducing k 4 results in more marked degradations for MAP than However, using sufficiently large k 1 can still result in significantly enhanced MAP (denoted using •), even with response times around 1000ms. This is markedly faster than the default ColBERT-PRF ReRanker setting, which attains 3500ms (shown as ) and much closer to the default response time of ColBERT E2E ( ).
Overall, in response to RQ9, we conclude that the Approximate Scoring technique is useful to attain a better balance of effectiveness and efficiency for ColBERT-PRF model, by reducing the number of documents being re-ranked, and can also be combined with the more efficient clustering techniques.

CONCLUSIONS
This work is the first to propose a contextualised pseudo-relevance feedback mechanism for multiple representation dense retrieval. Based on the feedback documents obtained from the firstpass retrieval, our proposed ColBERT-PRF approach extracts representative feedback embeddings using a clustering technique. It then identifies discriminative embeddings among these representative embeddings and appends them to the query representation. ColBERT-PRF can be effectively applied in both ranking and reranking scenarios, and requires no further neural network training beyond that of ColBERT. Indeed, our passage ranking experimental results -on the TREC 2019 and 2020 Deep Learning track passage ranking query sets -show that our proposed approach can significantly improve the retrieval effectiveness of the state-of-the-art ColBERT dense retrieval approach. In particular, our ColBERT-PRF outperforms ColBERT E2E model by 26% and 10% on TREC 2019 and TREC 2020 passage ranking query sets. Our proposed ColBERT-PRF is a novel and extremely promising approach into applying PRF in dense retrieval. It may also be adaptable to further multiple representation dense retrieval approaches beyond ColBERT. We further validate the effectiveness of the proposed ColBERT-PRF approach on the MSMARCO document ranking task and TREC Robust04 document ranking task, where ColBERT-PRF is observed to exhibit up to 21% and 14% improvements over ColBERT E2E model on TREC 2019 and TREC 2020 document ranking query sets, respectively. Moreover, we investigate ColBERT-PRF variants with different weighting approaches for measuring the usefulness of the expansion embeddings. Finally, in order to trade-off the efficiency and the effectiveness, we explore the efficient variants of ColBERT-PRF using the approximate scoring technique and/or different clustering algorithms, bringing up to 4.54× speedup without compromising the retrieval effectiveness.
In conclusion, the main findings of this work can be summarised as follows: • The pseudo-relevance feedback information from the top-returned documents in multiple representation dense retrieval is beneficial for improving the retrieval effectiveness on passage retrieval (Section 5) and document retrieval (Section 6). Indeed, our proposed pseudorelevance feedback mechanism can significantly improve the retrieval effectiveness over than ColBERT end-to-end model, the single representation dense retrieval models, as well as most of the baselines for both passage ranking and document ranking tasks; • Techniques based on statistical information, namely IDF and ICTF, and on embedding coherency, namely Mean Cosine Similarity, can be used to measure the informativeness of expansion embeddings of ColBERT-PRF (Section 7); • The trade-off of the retrieval effectiveness and efficiency of ColBERT-PRF can be attained using different clustering techniques and/or candidate selection techniques based on approximate scoring (Section 8).
Overall, our work makes it feasible to implement the pseudo-relevance feedback technique in a multiple-representation dense retrieval setting. In particular, the provided extensive experimental results demonstrate the effectiveness of our proposed ColBERT-PRF model. However, how this proposed dense PRF technique can be applied to the single-representation dense retrieval models remains an open problem. In addition, while the performance of most of the queries can benefit from the expansion embeddings, the performance of some of the queries is still degraded. Thus, a more cautious design that applies selective query embedding expansion will likely alleviate this issue. We leave this as one of our future works.

A APPENDIX
In the following, Appendix A.1 firstly details how variants of ColBERT-PRF can be implemented with weight token occurrences, specifically using the Bo1 and RM3 query expansion models. These are compared with ColBERT-PRF when implemented using the KMeans clustering technique. Next, in Appendix A.2, we demonstrate the experimental pipelines for ColBERT-PRF.

A.1 ColBERT-PRF (Bo1 or RM3) Variants
As discussed in Section 4.1, in ColBERT-PRF, embeddings are clustered, rather than the frequency of the corresponding tokens. In this section, we analyse this choice, by separating the clustering from the embedded representation. In particular, we use traditional token counting to measure the informativeness of tokens in the feedback documents, but then expand the query using the corresponding embedded representation of the selected token. Therefore, for these variants, the informativeness of each feedback embedding is measured using the Bo1 or RM3 technique, then the highest informativeness feedback embeddings are selected as the expansion embeddings. For instance, for the ColBERT-PRF (Bo1) implementation, the expansion embeddings are weighted according to the following equation: where λ = t f r el /N r el , t f r el denotes the frequency of (BERT WordPiece) token t in the pseudorelevant feedback documents and N r el denotes the number of feedback documents. t f x denotes the number of unique tokens in the pseudo-relevant document set. Similarly, for the ColBERT-PRF (RM3) variant, the expansion embeddings are selected using: W RM3 (t) = λ score exp (t) + (1 − λ) score orig (t), where 0 ≤ λ ≤ 1. score orig (t) = 1 denotes the weights for the original query embeddings and score exp (t) denotes the weights for the expansion embeddings. score exp (t) = S (t ) d ∈P RD t ∈d S (t ) , where S(t) is calculated as follows: S(t) = P(t, q 1 , . . . , q |q | ) where MaxSim(q, d) = |q | i=1 max j=1, ..., |d | ϕ T q i ϕ d j denotes the maximum similarity score and PRD is the set of pseudo-relevant feedback documents.
In Table 8, we report the ColBERT-PRF models with various expansion embedding selection techniques on both the TREC 2019 and 2020 query sets. From Table 8, we find that both ColBERT-PRF Ranker and ReRanker with the Bo1 selection technique can outperform the ColBERT E2E model in terms of NDCG@10, MRR@10 and Recall on TREC 2019 while the improvements are not observed on the TREC 2020 query set. In particular, on the TREC 2020 query set, the ColBERT-PRF models with the RM3 expansion embeddings selection approach exhibit lower performance  than the ColBERT E2E model. More importantly, we observe that, by comparing the ColBERT-PRF model with the KMeans clustering technique with the Bo1 and RM3 selection variants, the KMeans clustering technique significantly outperforms both the Bo1 and RM3 variants. Figure 13 below shows the impact of β for the Bo1 and KMeans variants, in both the ranking and reranking settings, for MAP and NDCG@10. From the figures, it is clear that KMeans always outperforms Bo1, regardless of the setting of β. Thus, we conclude that the KMeans clustering selection technique is more effective than the traditional Bo1 and RM3 selection approaches for the ColBERT-PRF model. This is because the Bo1 and RM3 query expansion techniques rely solely on the word occurrence statistics for selecting expansion embeddings rather than the semantic coherence of the embeddings, and hence select embeddings for tokens that occur frequently, rather than for frequently occurring semantic concepts. Selecting semantically coherent concepts is a key advantage of ColBERT-PRF for a dense retrieval environment.

A.2 ColBERT-PRF Pipeline
In this appendix, we demonstrate the stages of ColBERT-PRF, when defined as PyTerrier [25,27] pipelines. In particular, in PyTerrier, the >> operator is used to delineate different stages of a retrieval pipeline. In Listing 1, we portray the experimental pipelines for ColBERT E2E and ColBERT-PRF. The original source code can be found in the PyTerrier_ColBERT repository. 20