White Box: On the Prediction of Collaborative Filtering Recommendation Systems’ Performance

Collaborative Filtering (CF) recommendation algorithms are a popular solution to the information overload problem, aiding users in the item selection process. Relevant research has long focused on refining and improving these models to produce better (more effective) recommendations, and has converged on a methodology to predict their effectiveness on target datasets by evaluating them on random samples of the latter. However, predicting the efficiency of the solutions—especially with regard to their time- and resource-hungry training phase, whose requirements dwarf those of the prediction/recommendation phase—has received little to no attention in the literature. This article addresses this gap for a number of representative and highly popular CF models, including algorithms based on matrix factorization, k-nearest neighbors, co-clustering, and slope one schemes. To this end, we first study the computational complexity of the training phase of said CF models and derive time and space complexity equations. Then, using characteristics of the input and the aforementioned equations, we contribute a methodology for predicting the processing time and memory usage of their training phase. Our contributions further include an adaptive sampling strategy, to address the tradeoff between resource usage costs and prediction accuracy, and a framework that quantifies both the efficiency and effectiveness of CF. Finally, a systematic experimental evaluation demonstrates that our method outperforms state-of-the-art regression schemes by a considerable margin, with an overhead that is a small fraction of the overall requirements of CF training.


INTRODUCTION
Recommendation systems have been extensively used to aid users in the item selection process by producing tailored content in line with the users' tastes and needs [66]. This, in turn, has also impacted the way providers deliver content to the users, as they need to strike a balance between

Problem Formulation
As discussed in the previous section, the aim of this work is to estimate resource consumption, alongside recommendations' quality, for CF models using a white-box approach. In particular, determining when a CF algorithm could have an expensive training time or memory usage is an essential challenge, leading us to our problem formulation: given a CF algorithm A in the families of algorithms considered in this work and an input dataset B, predict the time and memory required to train algorithm A on B. This problem fits within the system resource tracking and management topic, which has been broadly studied in the context of computational performance [65,77].
Our goal is to produce a framework and methodology that allows the accurate prediction of the processing time and memory usage during the training phase of CF models, based on measurements acquired on samples of the input dataset. To this end, we focus on well-known CF algorithms including singular value decomposition, k-nearest neighbours, slope-one schemes, matrix factorization, and so forth [45]. In our study, we included all CF models implemented in the Surprise framework [45] covering a wide area of the design space. The selection criterion for these CF models was based on their popularity and frequency in recent and related benchmarking works, such as [17][18][19]24]. Furthermore, we propose using the time and space complexity analysis of these CF algorithms combined with curve-fitting primitives as a reliable solution toward the accurate prediction of their training times and memory usage. Then, we sample the input (i.e., the user-item rating matrix (URM)) and extract characteristics of each sample (e.g., number of users/items/ratings, density of the rating matrix, etc.). As part of the sampling process, we investigate what sampling strategies could be used to make the resource utilization prediction problem viable (i.e., how many samples are enough for accurate predictions of processing time or memory usage versus the processing cost of training a given algorithm on those samples). To this end, we formulate the following central research question:

RQ:
Given the processing times and memory usage of a CF algorithm on a subset of the data, how can we quantify its expected time/memory consumption for the full dataset?
As part of the central RQ, we have examined the following secondary RQs: (a) How can we use complexity analysis to estimate the resource consumption of a CF algorithm? (b) Can the efficiency of a CF model on the full dataset be predicted using solely characteristics of the input (i.e., URM) and the efficiency of the CF model on a set of samples? (c) Given an upper sample size S%, how do we determine when to stop sampling based on the number of samples for which we obtain consistent resource usage measurements (i.e., the time/memory values are in a tight interval)? (d) How should we sample the base data, such that the quality of the predicted efficiency/effectiveness of a CF model is not harmed?

Contributions
As discussed in the next section, a large body of literature investigated the effectiveness of the CF, including how the quality (i.e., precision, recall, accuracy) of the recommendations changes with respect to different dataset characteristics [15,24,44]. Therefore, this does not fall within the main scope of our work, but we refer to the relevant literature [14,22,42,70,78,80,83] and benchmarks [45] for comparing the selected CF algorithms regarding their accuracy in recommendation tasks. However, to demonstrate the flexibility of our sampling strategy (Section 3.4) for both efficiency (e.g., training processing time and memory) and effectiveness estimation, we also incorporate in our framework a component that predicts the quality of the recommendations based on the characteristics of the input (Section 4.5).
To the best of our knowledge, this is the first study that explores the CF algorithms' efficiency performance, addressing the processing time and memory estimation problem through an adaptive sampling-based strategy and different curve-fitting approaches. Our efforts led to the following contributions: C1: An adaptive sampling strategy that dynamically draws samples to jointly satisfy a userdefined accuracy/error threshold and/or a predefined resource constraint (e.g., maximum time quota). C2: A methodology that assesses the efficiency of a CF algorithm through training processing time and memory cost models based on its computational complexity. C3: A tool and framework that predicts the training processing time and memory usage, and effectiveness of the CF algorithms through sampling-based probabilistic analysis and characteristics of the input. C4: An extensive experimental evaluation, comparing our framework above against state-of-theart regression models.
Finally, although in this work we have focused on applying the above on CF models, we believe that the proposed methodology and prediction tools are flexible enough to also be applicable to other algorithms and use cases; exploring such directions is outside the scope of this work.

BACKGROUND AND RELATED WORK
CF evaluation is a major area of interest, as numerous studies and projects tried to determine the best metrics and practices in this field. So far, both efficiency and effectiveness have been established as the critical areas toward assessing the CF performance [39]. While there are still ongoing debates about online versus offline evaluation [35,39], it is notably harder (if not impossible) to reproduce online studies. Consequently, offline assessment has been used as a primary tool for establishing the overall performance of a CF and gaining insights into its behavior under certain constraints and limitations [35,39]. Traditionally, the evaluation of the CF focuses on splitting the dataset/input into training and testing collections, which are then used to assess the output of the CF model. The limitation of this approach is that often sparsity and popularity biases affect the evaluation protocol [9]. This issue can be alleviated using "random" sampling. Random sampling techniques have been intensively used in the past decades in various contexts and applications. For example, in databases, the size of the results for a given query was predicted using random sampling [36,40,55]. Other works focused on computing the optimal bound for the number of samples needed for satisfying a predefined error metric [4,21,33]. Furthermore, efficient sampling techniques have been developed to address the limited availability of computational resources for analyzing large datasets [59]. All these efforts have contributed toward better ways of drawing samples, which is critical for predicting a chosen quantity since the number of samples and their distribution impacts the accuracy of the predictions [62].
One of the drawbacks of CF evaluation is that the studies only report the quality of the recommendations through effectiveness metrics [35,39]. However, recent work [56] also presents some insights into the observed efficiency (processing time) of the models. Thus, and in the context of environmental awareness [76], we speculate that as more complex models will be developed, the community will move their attention and efforts to (1) report the (resource) cost of new models and (2) incorporate ways of minimizing the hardware usage. This is yet another reason it is essential to be able to predict the efficiency cost of CF by performing the training of the models on only small samples of the target datasets.
Recent interest has been shown toward investigating how dataset characteristics (e.g., number of users, items, ratings density/sparsity) affect the quality of the recommendations and their impact on the CF effectiveness. In [1,24], the authors explore the effect of the structural properties of the user-item rating matrices regarding the accuracy and robustness of the CF algorithms used in the studies. Their results confirm a relationship between dataset characteristics and the CF models' behavior and highlight the standard practice of using samples to evaluate the effectiveness of CF while alleviating the high processing costs of testing on the complete dataset. In [63], we argued that properties of the input data further affect the inherent tradeoff between the efficiency and effectiveness of a CF and that the choice of the algorithm should be based on the latter as well.
Lately, CF models have also been evaluated with respect to their accuracy using sampling-based probabilistic analysis methods [17,18]. To this end, the standard practice consists of training the selected CF on a sample of the dataset and using its offline measured accuracy as a proxy for the effectiveness over the complete dataset [24]. This approach could also be extended to quantify the efficiency of these CF algorithms as discussed in [63,64]. However, several challenges are associated with sampling for efficiency and effectiveness prediction, as demonstrated in the experimental evaluation (Section 4.5). Thus, we argue that this is a combination of two factors: (1) the samples produced by the standard practice sampling strategies usually employed for effectiveness purposes do not lend themselves well to efficiency predictions, and (2) due to the inherent (often) non-linear scaling of computational requirements over the dataset size. This work aims to address these gaps, and the resource consumption prediction problem, for a set of highly popular and impactful CF models.
In the past decades, algorithms' processing time prediction problem has been extensively studied across different communities with numerous results. For example, in parallel computing, linear regression models have been used to predict the processing time of different library implementations for multiprocessors [11]. Other works focused on predicting the runtime of various planning algorithms, for selecting which algorithm to run and for how long [28,41,68]. Predicting the processing time of parameterized algorithms has drawn high interest from the research community, with existing solutions incorporating the parameters as additional inputs for the prediction models [6,67]. Another area explored consists of runtime prediction applications, such as determining instance hardness [52] and parameter optimization/tuning [67]. Additionally, in Database Management Systems (DBMSs), in the past 15 years, the number of potential database designs (e.g., indexes, table partitions) and configurations (i.e., knobs to turn and fine-tune) has grown by 3× for Postgres and by nearly 6× for MySQL [82], making the Database Administrators' (DBAs) job very challenging. The main issues with configuration knobs are that there is a large number of parameters that have to be optimized and they control many aspects of the database system (e.g., disk I/O, memory, etc.) [82]. To this end, existing solutions [82] focus on auto-tuning these knobs and suggest configuration plans that are better than those derived by human experts. To further optimize DBMSs, other works [48,53] focus on computing cost models for estimating the resource consumption of processing different types of queries. We believe the same scenarios apply to CF, as there are numerous algorithms that can be used, and each one has different configurations and parameters to optimize/tune, resource usage costs, and accuracy rates as showcased in our experimental evaluation. Therefore, the problem of quantifying the performance of CF models w.r.t. both resource consumption and effectiveness becomes more interesting.

THE WHITE-BOX APPROACH
When faced with the task of predicting the effectiveness of a CF on a dataset, based on its behavior on a sample of that dataset, the standard practice consists of building a regression model over data points gathered through iteratively (1) randomly sampling over all ratings in the dataset, (2) training the CF over the sample, or (3) evaluating its effectiveness over the sample [24]. We follow a similar strategy with a few notable changes summarized in Figure 1. The proposed pipeline covers the steps that our users need to follow to estimate the processing time and memory usage for training a CF algorithm on a chosen dataset. In the following subsections, we discuss each of the steps in more depth, and show how using the expected complexity of a CF algorithm, samples of the data, and a simple regression model can lead to fast, accurate, and interpretable predictions of the efficiency and effectiveness of the CF model.
At the core of our method is the idea that if we understand the computational complexity of each CF algorithm (captured using big-O time and space complexity analysis), we can predict its efficiency (i.e., processing time and memory usage). Throughout the article, we refer to time complexity as the processing time taken by an algorithm to train on a particular input. Similarly, we name space complexity the memory usage incurred when a CF model is trained on an input. For all the CF analyzed, we define the characteristics ( Figure 1, step 1a) of their input (i.e., the user-item rating matrix), with respect to the number of users, m; the number of items, n; the total number of ratings, ρ; and the density of the rating matrix, δ (= ρ m×n ).

Algorithms
A CF algorithm is considered explicit if its input is based on fixed ratings (usually from 1 to 5) emerging from scores awarded by users to items. In this work, we analyze the following CF categories: (i) Basic algorithms; chosen representatives include a Baseline algorithm derived from [46], as well as a Maximum Likelihood Estimation-based Random approach [61]. (ii) Algorithms based on K-nearest neighbors; chosen representatives include Basic KNN (KNN) [38], KNN taking into account the mean rating of each user (Centered KNN) [25], and KNN taking into account a baseline rating (KNN Baseline) (Equation (3), Section 2.2 in [46]). The former two use Mean Squared Difference (MSD) [16] as the distance metric, while the latter uses Pearson correlation coefficients [38] centered using baseline scores. (iii) Variants of matrix factorization (MF); chosen algorithms include Non-negative Matrix Factorization (NMF) [57] and Singular Value Decomposition (SVD) derived from [66]. Given the input data (user-item rating matrix (URM)), we (step 1a) extract features such as the number of users, items, ratings, the density of the matrix, and so forth and (step 1b) sample the URM following the strategy described in Section 3.4. In step 2 we train the various classes of CF models (Section 3.2) on the samples drawn while (step 3) gathering efficiency metrics, such as processing time and memory overhead, and effectiveness metrics for the quality of the recommendations (i.e., RMSE for the predicted rating values). In steps 4a and 4b, we train our proposed prediction models, detailed in Section 3.3, given the recorded metrics, then learn and predict the efficiency (processing time, memory) and effectiveness (RMSE) of the CF on the full dataset. This process (steps 2-4) is repeated until the user-defined termination condition (prediction accuracy, time budget, etc.) is met (step 5).
(v) Co-clustering approaches; chosen representatives include the algorithm presented in [32].
For further details on these algorithms and their implementation, we refer interested readers to the documentation of the popular Surprise framework [45], which was also used for our experiments. Surprise is a Python-based CF engine that allows users to build and test CF algorithms, which work on explicit feedback datasets. This framework allows researchers to quickly set up their experimental evaluations, provides complete control over the experiments, and contains various tool and metrics to assess the CFs' performance. Surprise also allows users to experiment with built-in datasets (e.g., Movielens [37]), but also to incorporate their bespoke collections. The Surprise engine comprises many ready-to-use traditional CF models, described in the previous paragraph, for solving the rating prediction problem [2], and is frequently used as a benchmark in the research community [85]. Our methodology naturally extends to implicit CF algorithms, which rely on inferring users' preferences based on their interactions with the items, such as which pages they visited and for how long, where they clicked, and so forth; we omit discussing them in this work due to space constraints.

Complexity Analysis
Traditionally, the performance of an algorithm is captured through asymptotic worst-case complexity equations using big-O notation [23]. This method allows us to determine an upper bound to the way an algorithm's processing time grows or declines as a function of characteristics of its input. The CF models studied in this work are based on well-known algorithms, for which big-O analysis has been provided by the relevant literature [14,22,42,70,78,80,83]. However, it is often the case that design decisions may make the complexity characteristics of particular implementations diverge from the theoretical bounds-a fact often hidden behind constant factors or terms ignored during big-O analysis [3]. We thus further formulate and propose time and space complexity equations based on the actual implementation of said CF models. In the following paragraphs, we list the algorithmic complexities based on (1) literature [14,22,42,70,78,80,83] and (2) implementation. For the latter, we used Surprise's [45] documentation and implementation and derived the expected time and space complexity of CF models captured through big-O notation and the characteristics of the input outlined in Section 3. For the purpose of our approach, the number f of latent factors as well as the number e of epochs, where applicable, are considered constants set to the predefined/recommended values by [45].
The baseline CF is based on the Alternating Least Squares (ALS) algorithm, the naive solver 3 version, which has a complexity of O (mnf). If we further fix fto a default/recommended number, the complexity can be further abstracted to O ( mn) [42]. However, by examining the implementation of the baseline CF algorithm, for a given number of epochs e, first the users' baseline is computed in m 2 steps, followed by the items' baseline, which takes n 2 operations. If we fix e to a predefined/recommended value, the baseline's overall time complexity is O ( m 2 + n 2 ). For memory usage, apart from the size of the URM, which in all cases takes O (ρ) of memory, the baseline algorithm stores the users' (items', respectively) baseline in an array of size O ( m) (O ( n), respectively); since both baselines are used by this CF, the overall space complexity is O ( m + n).
The random algorithm, based on Maximum Likelihood Estimation (MLE), predicts the missing ratings over a normal distribution, computed in maximum O ( mn) steps [80]. The implementation reveals that the random CF computes a global mean and standard deviation during its training phase. These are typically done in two stages (first, compute the mean, then the standard deviation), each of which scans over all rating values. As such, the algorithm's time complexity is in O (ρ). The random CF does not require additional memory during its training phase as there are no auxiliary data structures that need to be allocated for the computing of the mean and standard deviation. Therefore, its space complexity matches the size of the URM, namely O (ρ).
For the neighborhood-based CF algorithms (i.e., KNN, centered KNN, and KNN baseline), the training phase computes the distance of every user to every other user (or every item to every other item, depending on whether the approach is user or item centric), taking into account only the items (users, respectively) that are common across users (items, respectively). This leads to a complexity of O ( m 2 n 2 ) [83,89]. However, at the implementation level, for KNN we derived a time complexity of O ( ρ 2 x + x 2 ), where x can be either m for user-based KNN or n for item-based KNN, respectively. At the core of the KNN-based CF, the similarity function computes the distance across the relevant users or items with respect to (1) the ratings they gave (for users) and (2) the rated items. By investigating the rating frequency distribution of all datasets outlined in Section Section 4.1, we concluded that the number of per-user and/or per-item ratings follows a uniform distribution (i.e., ρ m ratings per user, or ρ n ratings per items). We thus make the following simplifying assumption: in the similarity function, the distances are computed in x × ρ 2 x 2 , which can be simplified to ρ 2 x . Then, the distance is computed for pairs of common users/items in x 2 time (m 2 for users or n 2 for items, respectively). Centered KNN has a similar complexity to KNN, as they use the same similarity metric (MSD), but takes an extra (ρ) step to compute the mean ratings of each user (item, respectively), which brings the overall time complexity to O ( ρ 2 x + x 2 + ρ). KNN Baseline is also based on KNN and computes distances across users (items, respectively) using Pearson correlation coefficients [38], and takes into account baseline ratings. Its overall time complexity, as derived from its implementation, is the same as the one for Centered KNN, i.e., O ( To compute the distances and similarities between pairs of users (items, respectively), the KNNbased CF utilizes additional matrices of size O ( m 2 ) (O ( n 2 ), respectively). Therefore, the memory usage for KNNs during training is O ( m 2 ) or O ( n 2 ) depending on whether distances/similarities are computed across users or across items.
The NMF model is based on the SGD algorithm, which achieves a computational complexity of O (emρ) [70]. If we fix the number of epochs, the complexity can be reduced to O ( mρ). In the Surprise framework [45], for a fixed number of epochs and factors, NMF decomposes a given useritem ratings matrix, with respect to the number of users (m), items (n), and ratings (ρ). Therefore, the missing ratings are computed in O (ρ + m + n) (or O (ef(ρ + m + n)), including the number of epochs, e, and factors, f). The implementation also reveals that during training additional memory is allocated for two matrices, one for the user latent factors, which takes O ( mf) of memory, and the other one for item latent factors, which is stored in O ( nf). This brings NMF's overall memory usage to O ( mf + nf).
SVD, a popular CF-based approach, has been intensively used to produce recommendations on explicit datasets. Over time, multiple variations of SVD have developed [22], leading to a significant number of implementations. However, most of them converge to a complexity of O ( mn 2 ), even though other studies, such as [47], claim that the overall complexity of SVD is close to O (n 2 m + m 2 n). The SVD's implementation found in Surprise [45] uses the user-item ratings matrix in O (eρf) time to factorize the corresponding user and item factors. SVD's time complexity can be simplified to O (ρ) for a fixed number of epochs (e) and factors (f). For memory requirements, during training, SVD follows similar storage requirements as NMF, having a space complexity of O ( mf + nf).
For slope-based solutions, the Slope One algorithm has a generic time complexity of O ( mn 2 ), as it computes the average difference between pairs of relevant items as described in [78]. At the implementation level, Slope One first computes the frequency of the pairs of items (i, j), followed by the deviation between item i's ratings and item j's ratings. This is achieved in O ( ρ 2 m + n 2 ). Then, the relevant ratings are predicted using the users' mean ratings combined with the aforementioned frequency and deviation arrays, which means another O (ρ), leading to an overall time complexity of O ( ρ 2 m + n 2 + ρ). For memory requirements, during training, Slope One allocates two additional matrices to compute the frequency and deviation between pairs of items across the dataset. Consequently, the expected memory usage is O ( n 2 ).
Lastly, the Co-clustering CF with a fixed number of user-item clusters converges toward a computation complexity of O ( mn) [14]. By examining its implementation, Co-clustering splits users and items into clusters in O (m) + O (n) and co-clusters in O (ρ) steps, using an assignment technique similar to K-means. This makes Co-clustering train in O ( m + n + ρ) time. During training, the Co-clustering CF computes the users' and items' mean across the entire collection and stores them in arrays of size O ( m) and O ( n), respectively. Then, users and items clusters and co-clusters are built, which require another O ( m), O ( n), and O ( mn), respectively, of memory. This brings the overall space complexity to O ( m + n + mn).

Prediction Models
Knowing the worst-case (big-O) complexity can help determine the upper bound on the number of resources used by an algorithm while being executed against all possible inputs [23]. However, in practice, the likelihood of encountering inputs that elicit the worst-case processing time of an algorithm is relatively low [23]. Therefore, the computational complexity theory has devised the average-case complexity to measure the efficiency of an algorithm through its expected processing time/memory averaged over all inputs. Computing average-case complexity is often a hard problem since the distribution of all possible inputs is required to derive theoretical bounds analytically. Instead, our methodology is based on approximating probabilistic analysis, through an adaptive sampling strategy (Figure 1, step 1b) that predicts the expected processing time/memory of a given CF algorithm over an input/dataset. We employ this strategy for determining both the processing time and memory usage requirements.
Using the above algorithmic complexities, the processing times and memory usage measured across different inputs (Figure 1, step 3), and the characteristics of the data, we propose building the following types of models for time/memory prediction. Our approach is based on estimating the hidden factors (or unknown parameters) in the time and space complexity equations derived from the algorithms' implementations, mentioned in Section 3.2. Given that the processing time/memory estimation is based on an overdetermined system, with more sets of equations than unknowns, we constructed our models based on the least-squares approach [10]. This technique is based on minimizing the sum of the squares of the residuals (i.e., the difference between the observed/measured processing times/memory usage and the predicted/fitted values) computed in the equations.

Linear Models.
Since for each sample of the input we know the fixed number of users, items, and ratings, we encapsulate the performance (i.e., processing time and memory) of the CF algorithms through complexity equations summarized as follows: where X is a combination of the independent variables m, n, and ρ, while α and β are the slope and intercept that were computed using linear least-squares regression. For example, the equation for computing processing time for baseline becomes α (m 2 +n 2 )+β, that of NMF becomes α (ρ+m+n)+β, and so forth. Similarly, for predicting memory usage of SVD, we compute α and β using α (mf + nf )+β. This approach allows us to quickly compute the hidden factors of the complexity equations, while capturing the characteristics of the URM, as evidenced by our experiments. Additionally, we also compute the prediction error interval [27,60], which quantifies the uncertainty of our predicted time and memory usage. This allows us to provide upper and lower bounds on the estimates at each sample size. Furthermore, we compute these intervals using a combination of the variance of the outcome variable (i.e., time, memory) and the estimated variance of the model 4 [27,73].

Bayesian Models.
Another approach that we explored was to estimate the performance of the CF algorithms using a Bayesian inference [20]. In this setup, our aim is still to compute the hidden coefficients of the complexity equations from Section 3.2, but using probability distributions rather than point estimates. Therefore, our predicted variable (i.e., processing time, memory) will be drawn from a probability distribution. To this end, we infer the performance of the CF using a normal (Gaussian) distribution [20], characterized by mean and variance, as seen below: where α, β, σ 2 are also coming from distributions. Since σ 2 will always be a positive number, we chose a prior distribution, which yields only positive values, such as the Exponential distribution [20], where σ 2 ∼ Exp (1). For α and β coefficients, we used normal (Gaussian) distributions and restricted the parameter space using priors learned with the previous linear regression models [31]. In Bayesian inference, the main goal is to use sampling methods to draw samples from the posterior to approximate the posterior [20]. According to the standard practice [20,31], we can use Monte Carlo methods [71] to draw random samples from a distribution to approximate the said distribution. While there are several ways to perform Monte Carlo sampling, the most common and currently used [20,31] is Markov Chain Monte Carlo (MCMC) sampling [12]. 5 One of the challenges of fitting the Bayesian models is to ensure that all parameters show convergence. This can be checked by computing the potential scale reduction (R), which should always have a value below 1.1 [20]. The rule of thumb is that convergence has been achieved whenR is very close to 1.0 [31].
As with the linear regression models, for Bayesian regressors, we can compute the Monte Carlo Standard Error (MCSE), which is an estimate of the inaccuracy of Monte Carlo samples in MCMC algorithms [31]. MCSE can be used to quantify the uncertainty of the predictions for processing time and memory usage in MCMC models by computing the standard deviation and variance around the posterior mean of the samples 6 [31,75].

Baselines and Contenders.
We compare our proposed processing time and memory prediction models (white-box approach) against two types of baselines: (1) a hard baseline using linear regression to learn the hidden factors in the complexity equations described in the literature and (2) a soft baseline, which assumes that the complexity of the algorithms is unknown, and therefore the processing time and memory predictions will be computed using just the characteristics of the input (black-box method). The latter was tested using several off-the-shelf state-of-the-art regression algorithms available through the H2O analytics platform. 7 A few examples of the tested regressors include Random Forest, Deep Neural Net, Support Vector Machine (SVM), and Adaptive Boosting. In the experimental evaluation, we only report on the results of the best performer with regards to prediction accuracy (i.e., lowest RMSE), namely Gradient Boosting Machine (GBM) [30]. GBM was ranked as the best state-of-the-art regression model since it acquired the lowest RMSE, following the K-fold cross-validation procedure described in [49].
In our experimental evaluation (Section 4.5), we also predict the effectiveness (i.e., quality of the recommendations) of the CF given a sample of the input data and characteristics of the URM. To this end, we have employed state-of-the-art regressors, which include GBM [30], AdaBoost Regressor (ABR) [29], and Support Vector Regressor (SVR) [5]. These models were selected by running the H2O AutoML tool [49] in a process similar to the one described in the previous paragraph. Since the effectiveness of the CF algorithms is a well-studied area [1,8,24], it is not within the main scope of this article. However, as demonstrated in our results (Section 4.5), by using our adaptive sampling technique (Section 3.4), we do not need to employ different sampling strategies to draw samples for predicting efficiency and effectiveness. Consequently, using the sampler in one go, we gain information about the recommendations' quality and the performance and resources used while training the CF algorithms.

Adaptive Sampling
The standard practice of drawing samples for assessing the effectiveness of the recommendations involves choosing random triplets of the form (user, item, ratinд) from the input, and then filtering them based on a predefined characteristic of the URM (e.g., the density of the sample needs to be above/below a certain threshold) [1,24]. However, this approach does not work well for drawing samples to predict the efficiency (i.e., processing time, memory) of the CF algorithms; see  Instead, we propose a simple yet reliable sampling strategy, which provides good samples that can be used for predicting both the efficiency and effectiveness of the CF at the same time. It should be noted that drawing samples from a dataset is not for free, and therefore, we believe that it is essential to use a strategy that not only can be used to determine the accuracy of the recommendations but also reflects the complexity characteristics of the input data.
When sampling the URM, we asked ourselves how many samples and how large they be should to get a representative measure of the CF algorithms' performance. To this end, we propose the following sampling strategy (Figure 1, step 1b) described in Algorithm 1. Initially, the user of our system provides us with an upper sample size-say S (%)-as well as with a time budget T for our method; however, if these are not provided or unknown, the default values, as outlined in Section 4.3, will be used for sampling the URM. We then draw an initial sample by uniformly at random selecting a S% subset of the users and S% subset of the items, and including in the sample all associated ratings. We then use a strategy similar to Monte Carlo rejection sampling [79] to recursively subsample to produce even smaller samples. This strategy has two key characteristics: (1) sub-sampling allows us to produce a number of samples at different sampling rates at a fraction of the cost of sampling the complete dataset, and (2) by sampling user/item IDs, the sample better reflects the complexity characteristics of the base data. For each sample drawn, we train the CF models and record its training time (Figure 1, steps 2 and 3); we then decide whether to proceed with more samples given the so-far cumulative execution time of the above process and the time budget T (Figure 1, step 5). Furthermore, the number of samples we draw using Algorithm 1 should be dynamic and based on (1) a user-defined upper limit and (2) a user-defined accuracy/error threshold. Ideally, we would like to have a number of samples, which provides good accuracy for the processing time and memory prediction models. However, we should not use too many samples such that their total processing time would be larger than the training time of the entire dataset (see Section 4.5 for the related results).
For sampling the URM using the proposed algorithm, the user-defined upper limit is captured through a maximum time budget T and a maximum sample size S. As each sample is drawn, the remaining time budget and sample size are (re)calculated for each iteration through the adapt() function, which takes into account the previous time budget T and sample size S . The adapt() function gradually reduces the next sample size that will be drawn (e.g., in decrements of 10%) and the amount of time allocated for this operation, based on the time t spent so far on drawing the current sample and the available time budget. The measurements are repeated multiple times for each (sub)sample to compute the average m of the resource usage (processing time or memory overhead) values recorded while training on the current sample and the confidence interval c selected by the user (or default 99%). In this computation, we discard the measurement for the first iteration to avoid effects of cold caches and overheads of the language runtime. Thus, the algorithm is filtering high-end outliers that can skew the accuracy of the processing time and memory prediction models. The proposed sampling algorithm stops drawing sub-samples for a given sample size if either there is not enough time left within the time budget T or the accuracy_threshold has been met. Furthermore, our sampling algorithm can also use a user-defined accuracy_threshold as part of its input. This accuracy_threshold controls the variance of the measurements of the processing times recorded by training the CF on the sub-samples within a sample size. The iterative execution on a single sample then may terminate early if the ratio c/m satisfies said threshold. Thus, we ensure that the processing times or memory overhead measured on the samples have low variance (i.e., the recorded time/memory values are in a tight interval). This is an important aspect to be considered while sampling, as the quality of the samples can impact the accuracy of the resource cost models as demonstrated in Section 4.5. The proposed adaptive sampling algorithm allows us to minimize the number of samples needed for processing time and memory overhead prediction without sacrificing the accuracy of the resource cost models. Moreover, using an accuracy threshold and a fixed confidence interval leads to a reliable stopping strategy for drawing random samples and collecting time and memory measurements. Another idea that we investigated for determining when we have acquired a large enough sample is to analyze the prediction error interval in LR models and/or the uncertainty in Bayesian models described in Section 3.3 and showcased in Section 4.5. However, as this method depends on the resource prediction model used (i.e., works only with LR and Bayesian models), we prefer and propose a model-agnostic sampling strategy based on an upper limit for time and sample size, as well as an accuracy/error threshold that reflects the quality of the samples.
Moreover, this tool can be easily customized by the users, allowing them to determine the optimal bounds for the number of needed samples based on how much variance is allowed to be present in the generated samples. In our experiments, we used a 0.1 accuracy threshold, corresponding to 10% to 15% variance and 99% confidence interval (as described in Section 4.3). However, depending on the constraints imposed on the random samples, the users could alter the accuracy threshold values to be in line with their needs and requirements. We note that a smaller accuracy threshold will map to less variance in the processing time values, and hence the predictions are more accurate.

EXPERIMENTAL EVALUATION
This section describes our experimental evaluation methodology and discusses our results. In brief, our extensive experiments show that by using a relatively small subset of a dataset, D s , and the time it takes to train a CF algorithm A on D s , we can accurately predict A's expected processing time and memory usage on the full dataset D. Additionally, we show that there is an inherent efficiency-effectiveness tradeoff between the quality of the predictions and their procurement cost. Further, we provide insights into the cost of sampling compared to the cost of training the CF models, as well as the cost of running our prediction models on the base data. Lastly, we discuss the advantages of the proposed sampling strategy compared to current standard practices.

Datasets and Recommendation Task
For this study, we used the MovieLens (ML) 100K, 1M, and 20M collections [37], as well as the GoodBooks (GB) 10K dataset [88]. Each of these datasets consists of explicit ratings, from 1 to 5, given by users to items (i.e., films for ML and books for GB). While ML 100K (610 users and 9,724 items) and 1M (6,040 users and 3,706 items) are smaller collections with densities of 0.017 and 0.045, respectively, GB 10K (53,424 users and 10,000 items) is a larger collection with a density of 0.012. In addition, we also probed our efficiency cost models with ML 20M, a very large dataset containing 138,493 users, 26,744 items, and a density of 0.0043. The different datasets' characteristics and sizes allowed us to experiment with our proposed performance prediction tool and methodology, showing that it can be successfully used on collections with different properties.
The recommendation task investigated in this article refers to predicting the usefulness or relevance of a given item to a user [72]. In short, after an item is selected, the CF model estimates the rating the user would give to this item, and if the rating is above a particular threshold value, then the item is presented to the user as a recommendation.

Evaluation Protocol
All experiments were carried out on Linux servers, each having two Intel Xeon E5-2660 CPUs (eight physical cores each with two-way hyper-threading (HT)) and 64GB of RAM, running Ubuntu Linux 14.04.6. As the GoodBooks dataset is significantly larger and denser, we ran the corresponding experiments on a higher-spec Linux server with four Intel Xeon E7-4870 v2 CPUs (15 physical cores each with two-way HT) and 512 GB of RAM, running Ubuntu Linux 16.04.7. During the experimental evaluation, all resource-intensive processes were suspended to avoid interference with our measurements.
We measured the processing times of the various CF algorithms on inputs sampled as described above. To this end, we utilized the getrusage method from Python's resource module, with the overall training time for each sample computed as the sum of the time spent executing in user mode (ru_utime) and system mode (ru_stime). 8 To retrieve the memory usage of the CF, we recorded the information from the proc filesystem via the "/proc/ [pid]/status" file, 9 which contains the utilized memory by the current process (identified by pid) reported directly by the Linux kernel. From this file, we based our memory usage computations on the "VmSize" field, which returns the overall memory used by a specific process. To cross-check our approach regarding memory measurements, we also recorded the memory usage while training the CF model using a memory profiler. For this task, we used the memory-profiler 10 module from Python and ensured that the results reported by the profiler match the ones recorded using the proc filesystem.
We gathered measurements for samples as provided by our sampling strategy, trained our models on the produced statistics, and used them to predict the respective resource usage over the White Box 8:15 complete dataset. We evaluated our predictions using normalized RMSE (NRMSE) computed as where y is the mean of the actual time/memory values in the corresponding sample. As KNN Baseline and Centered KNN followed the same time and space complexity and showed similar processing time and memory usage behavior in all experiments, we only show results for the latter.

Sampling Strategy
We gathered measurements for values of S (upper limit to the sample size) ranging from 10% all the way to 100% in increments of 10% (i.e., 10%, 20%, . . . , 100%) using Algorithm 1. For each draw, we also used a fixed random seed s from a predefined set of seeds to ensure reproducibility. For the scope of our experimental evaluation, in Algorithm 1, we set the confidence to 99% and use a predefined variance of the processing times and memory overhead in each sample size (e.g., 10%, . . . , 100% of the data) in the range of 10% to 15% coupled with the default accuracy threshold of 0.1. Last, we set the overall time quota T to the default values of 50,000, 2,000, 500, and 100 seconds per sample for ML20M, GB 10K, ML 1M, and ML 100K, respectively. The results (Section 4.5) indicate that this setup offered a good tradeoff between the number of samples needed and their quality. Other values were explored but led to comparable results and are omitted for space reasons. Furthermore, to highlight the performance of our resource consumption prediction models we sampled the input data all the way to 100%, which accounts for the full dataset; however, our empirical findings (Section 4.5) indicate that a default sample size of 30% to 40% produces accurate predictions of the processing time and memory overhead of CF required during training. Finally, we also analyze whether the standard practice sampling strategy [1,24] used for determining the effectiveness of a CF performs for predicting its efficiency (i.e., training time and memory), and we show that our proposed sampling approach leads to good accuracy for both efficiency and effectiveness prediction purposes.

Contenders
For RQ (a) and (b), we investigated whether the dependent variable (i.e., the expected training time and memory usage of a CF algorithm on a dataset D) can be predicted using samples of D and the characteristics of the input representing the independent variables as described in Section 3. As part of RQ (a), we conducted an extensive experimental evaluation of our White-Box approach, as discussed in Section 3.2 (denoted in the remainder of this section as WB/LR and WB/Bayes, when using a linear regression or Bayesian prediction approach, respectively). We compare our solution against two types of baseline approaches: (1) a hard baseline (denoted WB/Lit/LR), in the form of a white-box linear regressor built using algorithmic complexities provided by the relevant literature, and (2) a soft baseline, in the form of the best-performing state-of-the-art black-box regressor (GBM).
In the latter case, the regression model is trained on characteristics resulting from training the CF models on samples of the input dataset, without having any knowledge of the inner space/time complexity of the latter. To this end, we augmented the structural input features (m, n, ρ) with rating distribution/frequency-related features. Specifically, in line with the practice in the state of the art [1,24], we modeled the concentration of users' (items', respectively) ratings by using the Gini coefficient [34] as described in Equation (4), where w is the number of users (items, respectively), ρ k is the number of ratings given by a user (or received by an item, respectively), and ρ total is the total number of ratings: We thus compute the Gini coefficient for users, Gini user s , and items, Gini items , and include them as extra features for the black-box approach. We experimented with multiple state-of-the-art regressors and report results for the best performer, GBM. Figures 2(a)-2(c) depict the predicted training time for the complete dataset, as predicted when using different upper sample sizes. Specifically, the curves on these figures show the predicted fulldataset training time (y-axis) versus the upper sample size limit (x-axis) on which the contenders were applied. The horizontal black line represents the actual training time over the entire dataset.

Results
In other words, the closer a curve is to the black line, the more accurate the prediction, and the earlier a curve approaches the black line, the smaller a sample is required to achieve this result. The orange and purple areas show the prediction error interval and uncertainty for the predictions of WB/LR and WB/Bayes, respectively, as presented in Section 3.3. Our results indicate that WB/LR, using simple linear regression, outperforms the much more complex best-performing state-of-theart regressor (GBM). In most cases a 30% to 40% upper sample size limit seems enough to allow WB/LR to achieve highly accurate prediction. Interestingly, WB/Bayes also seems to achieve good accuracy in the processing time prediction task with similarly small sample sizes; however, its training cost is considerably greater than the one for WB/LR and WB/Lit/LR, as discussed shortly (Table 1). When assessing the overall performance of our framework, we also need to consider the cost of sampling the dataset, training the CF algorithms on the samples, and running the prediction models on the base data (i.e., processing time and memory usage). To this end, Figures 3(a)-3(c) highlight the cost of acquiring samples from the dataset (blue bars) stacked on top of the cost for training the CF on these samples (red bars). The training time for the full dataset is depicted with a black horizontal line. In other words, when a composite blue-red bar reaches or exceeds the black horizontal line, this denotes that it is more efficient to train the CF algorithms directly on the full dataset rather than draw samples and train on them. We note that for CF algorithms with more expensive training cost (e.g., KNN-based CF), the processing time exhibits a steeper increase across samples. Hence, our framework is more valuable for predicting the efficiency of such CF models, as we can draw a small and cheap sample of the data while accurately estimating the processing time and memory of the CF on the full dataset.
So far, we examined how the processing time varies when sampling the datasets and training the CF algorithms on the given samples. However, training and running the prediction models on the base data also involves a cost. Table 1 presents the average time taken by each prediction model across ML100K, ML1M, and GB10K. As we can see, the cheapest models are WB/LR and WB/Lit/LR, followed by GBM, while WB/Bayes is by far the slowest (by several orders of magnitude). Therefore, for predicting the efficiency of a CF model, it is important to select not only an adequate sampling strategy but also a cheap and accurate predictor, such as our proposed approach (WB/LR). Figures 4(a)-4(c) illustrate the accuracy of the predictions for memory usage across the three datasets. These figures are similar in design to Figure 2; on the y-axis we have the memory usage for training the CF models over the complete dataset, as predicted using as the upper sample size limit the value on the x-axis, while the horizontal black line depicts the actual memory usage for  training the CF models over the complete input dataset. Similarly to the time prediction models, the proposed approach (WB/LR, orange curve) outperforms the best state-of-the-art regressor (GBM, green curve). WB/Bayes (purple curve) again achieves good accuracy in the prediction task, but for a much higher training cost as discussed earlier. Again, a 30% to 40% sample suffices for WB/LR and WB/Bayes to produce highly accurate predictions. WB/Lit/LR is not shown on this figure as, alas, the literature around the CF models considered here provides no space complexity analysis. Another benefit of using the processing time and memory prediction models is knowing when to stop sampling the input data and training the CF models. We have discussed in Section 4.3 a simple stopping strategy based on the variance of the resource (i.e., time, memory) measurements across samples. However, we can assess the quality of the predictions on a given upper sample size S% by quantifying the prediction error interval in linear models and uncertainty in the Bayesian models as presented in Section 3.3. As the predictors get more accurate, the prediction error interval and the uncertainty decrease/shrink; hence, we could stop sampling the dataset and training the models. In our experiments, we noticed that an upper sample size of 30% to 40% produces good estimations for the processing time and memory usage. From Figures 3(a)-3(c) we can see that sampling and training our best predictor (LR) on a sample size of 30% to 40% of the entire collection is much cheaper than training the CF models on the complete datasets.
To fully assess the performance of a CF algorithm on a given input, we need to examine both efficiency and effectiveness. The results above demonstrated how the proposed sampling strategy and models could address the former (i.e., efficiency). For the latter (i.e., effectiveness), we employ state-of-the-art regression models to learn the quality of the recommendations given a CF algorithm, a collection of samples (drawn using the strategy in Section 3.4), and characteristics of the input. Figures 5(a)-5(c) show the predicted recommendation effectiveness for the popular CF algorithms implemented in Surprise [45]. We note that a sample size of 30% to 40% (same size as for efficiency models) attains good accuracy for predicting the effectiveness of a CF model. Consequently, our tool and methodology allow one to draw a single set of samples for predicting both the effectiveness and efficiency of a CF in one.
As presented in Table 2, we also analyzed the efficiency-effectiveness tradeoffs emerging from the training cost on various sample sizes compared to the normalized RMSE for the estimated processing times. For the selected CF algorithms, we measured the training cost in terms of time (seconds) and power consumption (kWh), which combined can indicate the monetary cost quantified in US dollars ($). Then, this cost is compared to the normalized RMSE values obtained from predicting the processing time of the entire dataset using our proposed methods (WB/LR and WB/Bayes), as well as the hard (WB/Lit/LR) and soft (GBM) baselines. This study was conducted on GoodBooks 10K, as we believe it is a good case of when a bigger sample that is more expensive does not necessarily improve the accuracy of the predictions.
For example, let us examine the baseline CF model. As the sample size increases and therefore so does the training cost, the prediction error decreases, leading to more accurate expected processing times. However, not all CF algorithms display this behavior. For SVD, a bigger and more expensive sample does not necessarily minimize the NRMSE value; this means that while the cost of training increases, the processing time predictions' accuracy remains relatively constant, without further improvements. Therefore, we believe that knowing the training cost of a CF algorithm versus its nominal effectiveness (i.e., the quality of the recommendations) is a critical aspect that should be   taken into consideration for building and deploying efficient CF models without sacrificing the users' satisfaction while maximizing the content providers' revenue. Another interesting aspect to examine when choosing a CF algorithm is its effectiveness, as well as training and deployment costs. For the former, benchmarks such as the one from [45] and also our results (i.e., Figures 5(a)-5(c)) show that different CF models can have similar recommendation quality (e.g., RMSE and MAE for KNN and SVD). However, the training cost is considerably higher for one CF model over the other one. Therefore, when CF algorithm selection represents a critical decision based on the infrastructure available, we propose using our prediction tool, methodology, and sampling strategy to determine if a CF model would be feasible and fit the operational time and memory constraints before allocating and spending many resources for it.
As part of RQ (c), in Table 3, we report the average variance ratio (where 0 corresponds to 0% and 1 corresponds to 100%) across samples in different subsets of data for a fixed confidence value of 99% and the constraints from Section 4.3 (i.e., 10% to 15% variance in the processing time values). While most algorithms seem to achieve a low variance in the processing times from the 20% subset of data on MovieLens 100K, we can observe that for the 10% partition, which also corresponds to the smallest input, some CF models, such as KNN, KNN Baseline, and Baseline, are less stable. This was expected since these algorithms were the fastest on very small inputs, and the recorded training times were as small as 0.78 milliseconds (KNN). However, as the size of the input data increases, the variance across samples decreases, resulting in steadier measurements. This behavior was also observed on the larger datasets (ML 1M and GB 10K), which yielded similar Fig. 7. Predicted processing times for the full (a) MovieLens 100K and (b) MovieLens 1M datasets using our approaches (WB/LR and WB/Bayes) and the two baselines (WB/Lit/LR and GBM). Each prediction model has been trained and tested against the two sampling strategies w.r.t. ratings/interactions (i.e., standard practice (SP) sampling), as well as users and items (i.e., proposed sampling). results (i.e., 5% variance, or 0.05 in Table 3, starting with the 10% data split); these are not included for space considerations.
For RQ (d), we first investigated if the standard practice sampling strategy can be employed to draw samples that capture both the efficiency and effectiveness of CF on a subset of the data. Figures 6(a) and 6(b) present the raw measured effectiveness of a CF algorithm on an upper sample S% that has been chosen using (1) standard practice sampling [1,24] and (2) the proposed sampling mechanism (Section 3.4). Interestingly, the two sampling strategies have a similar performance regarding the accuracy of the produced recommendations. We note that for the random algorithm, it is expected to have different RMSEs every time a sample is drawn, using either strategy, since the recommendations are produced at random. Thus, for effectiveness, we conclude that both strategies can be successfully used with similar performance.
However, if we examine the two sampling strategies to predict the efficiency (e.g., processing time) of a CF model, more notable changes are observed. Figures 7(a) and 7(b) show the performance of the prediction models, which have been trained on samples drawn using the two strategies. Our results indicate that the standard practice sampling strategy fails to capture the complexity characteristics of the base data, as the prediction models (non-dashed curves) are far away from the ground truth (black horizontal line). On the other hand, the models that have been trained using the proposed sampling approach (dashed curves) are a lot more accurate in predicting the efficiency of the CF algorithms. Consequently, since our adaptive sampling algorithm (Section 3.4) works well for both efficiency and effectiveness prediction, it has been used throughout the experimental evaluation.

CONCLUSIONS
The accurate prediction of the resource consumption during the training phase of CF models is of exceptional practical interest to establishments of all sizes from both academia and industry. However, so far, the relevant literature does not capture this pressing problem. This article addresses the challenge of predicting processing time and memory overhead of CF algorithms using a simple yet highly effective approach. This incorporates a fit-for-purpose sampling scheme and a fast but accurate linear regression scheme over time and space complexity equations drawn from the algorithms' implementations. Furthermore, we showed that using a smaller sample of a dataset, the CF models' performance and resource cost could be estimated with our methodology without training on the entire collection. Our sampling strategy also allows the prediction of both efficiency and effectiveness while paying the cost of sampling only once.
Despite its simplicity, our proposed methodology and resource cost models for CF algorithms manage to considerably outperform in accuracy even the best-performing off-the-shelf state-ofthe-art regressor. Moreover, our approach is also faster than all contenders, including the best state-of-the-art regressor, and utilizes only a fraction of the resources used by them. We view this work as one of the first core steps toward a systematic exploration of the efficiency-effectiveness tradeoffs inherent in modern recommendation systems. While our methodology was developed and probed with CF models, we believe it could be applied and used for other classes of algorithms and use cases. In the near future we plan to expand our methodology to other areas of the recommendation systems' design space, such as deep learning models, and estimate the training cost for other resources (e.g., GPU usage).