Risk prediction models for head and neck cancer: A rapid review

Abstract Background Cancer risk assessment models are used to support prevention and early detection. However, few models have been developed for head and neck cancer (HNC). Methods A rapid review of Embase and MEDLINE identified n = 3045 articles. Following dual screening, n = 14 studies were included. Quality appraisal using the PROBAST (risk of bias) instrument was conducted, and a narrative synthesis was performed to identify the best performing models in terms of risk factors and designs. Results Six of the 14 models were assessed as “high” quality. Of these, three had high predictive performance achieving area under curve values over 0.8 (0.87–0.89). The common features of these models were their inclusion of predictors carefully tailored to the target population/anatomical subsite and development with external validation. Conclusions Some existing models do possess the potential to identify and stratify those at risk of HNC but there is scope for improvement.

at risk and direct them to appropriate prevention and early detection/ diagnosis and treatment pathways.
There has already been some success and clinical adoption of other cancer prediction models in primary care; for example, the Qseries risk prediction models or cancer Risk Assessment Tool. 15,16 These models have been well evaluated, demonstrating the potential of "personalized medicine" to identify and stratify those at risk. [17][18][19][20][21] However, they do not assess for HNC risk and there seem to be few risk prediction models for HNC developed or adopted for clinical use.
Furthermore, there have been no comprehensive reviews of head and neck risk cancer prediction models or tools published. The aim of this study is to undertake such a review-via systematically searching and identifying models in the international literature, describing their characteristics and performance, quality appraising these models, and performing a narrative synthesis to compare and contrast risk prediction models for HNC.

| METHODS
A rapid review methodology was employed, following the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines. 22 The review was also based on similar reviews on risk prediction models of other cancer sites/diseases. (1947-Present, updated daily) databases was conducted using a combination of key headings and search terms associated with "head and neck cancer" and "risk/risk factor/risk assessment" and "prediction/ model/tool/score" (see Supplementary Material file for the full list of search terms).

| Search strategy
Studies were included if they satisfied all of the following criteria: (i) used a statistical model/tool to predict HNC risk including subsites and potentially malignant conditions; (ii) were published in English; (iii) considered multiple different risk factors; (iv) provided a measurement of risk; and (v) were applicable to the general population.
Given that the focus of this review was on risk prediction, studies that developed prognostic or recurrence models were excluded. Similarly, studies that only considered highly selected groups or risk variables such as highly specific genes were also excluded (as per [v] of exclusion criteria). If multiple publications of the same model were identified, the most extensive and recent report of the model was included.
The reasoning behind a statistical model/tool forming a part of the inclusion criteria was to ensure that there was a robust methodology underlying model development. Crucially, this was also to separate risk models from numerous case-control studies that considered multiple risk factors individually, often expressing these in odds ratios but not evaluating these in the form or context of a risk prediction model or tool. Reporting of a comparable measure of risk was also considered to be, at least, reporting of performance metrics (e.g., AUC) but ideally assigning a value to an individual (e.g., 5-year risk).
Articles that were ultimately sourced from search, were loaded into Endnote X9 (Clarivate Analytics) reference management software and from here were imported into Covidence (Covidence systematic review software, Veritas Health Innovation), which was used to remove duplicates and perform study screening and data extraction.

| Screening and study selection
Two reviewers (CS, DIC) independently screened search results at a title/abstract level and then at a full-text level using the eligibility criteria. In the event of a disagreement, articles were discussed and included or excluded by mutual consensus.

| Data extraction and quality assessment
Following full-text screening, data extraction was undertaken by two reviewers (CS, DIC) using a customized form containing pre-defined fields including items on: study characteristics (location, study design, cancer site/subsite, and risk factors included). The data extraction form also assessed the requirement of clinician input (based on whether reported or the nature of the data required to run the model-for instance a patient would not be able to use machine learning tools or conduct HPV serology analysis), along with items on predictive performance (discrimination, sensitivity/specificity, calibration, positive predictive value/negative predictive value [PPV/NPV] and risk threshold cut-offs) and the method of validation (if undertaken).
Measures of discrimination were considered to be "acceptable" if an area under the curve (AUC) value over 0.7 was reported, and "excellent" if a value greater than 0.8 was reported. 23 Measures of calibration were assessed by the expected/observed ratio or gradient of a calibration slope to the ideal value of 1 and of its intercept to the value of 0. 24,25 Two reviewers (CS, DIC) also examined the risk of bias of each model using PROBAST, a tool specifically designed to appraise clinical risk prediction models. 26 Risk of bias ("high," "low," or "unclear") and applicability of the risk models was assessed using 20 questions across four domains (participants, predictors, analysis, and outcomes).
An overall quality assessment was also given to each model considering model validation, and the risk of bias, and applicability concerns assessment (from PROBAST). This was categorized as "High," "Moderate," or "Low." If the model had a (i) low risk of bias, (ii) a low or unclear applicability concern, and (iii) a robust method of validation then it was considered "High" overall quality. Model performance was also considered separately by evaluating each model's discriminative ability. If a model achieved an "excellent" AUC over 0.8 it was classed as high performing (green in table). 27 Models that achieved acceptable discrimination between 0.7 and 0.8 were classed as moderate (amber in table) and discrimination less than 0.7 was classed as poor (red in   table). These classifications also reflected for confidence intervals for model AUC (where reported).

| Synthesis
The heterogeneous nature of risk prediction models makes the possibility of pooling the data between the models inappropriate. However, a narrative synthesis was conducted-focusing on the model overall quality/performance and including comparing and contrasting risk factors used in the risk prediction models-grouping them as sociodemographic factors (e.g., age, sex, socioeconomic characteristics), behaviors (smoking, alcohol), biomarkers (e.g., HPV, genetic/polygenic data), clinical information (e.g., symptoms, oral potentially malignant disorders). Models were also compared across subsites of HNC.

| RESULTS
Following the removal of 100 duplicates, 2945 studies were identified by the search. Of these, 2900 were excluded by title or abstract screening. Of the remaining 45 studies, 34 were excluded following full-text assessment, with reasons for exclusion noted (Figure 1). The most common reasons for exclusion were studies did not use a statistical method to develop a risk model, or did not consider multiple risk A further three articles were identified-two of these were identified from reviewing the reference lists and the third (at the time of writing yet to be published) was identified from one of reviewer's research collaborations (DIC). Thus, in total, 14 papers were ultimately included in this review ( Figure 1). [28][29][30][31][32][33][34][35][36][37][38][39][40][41] Within these studies, three of the 14 models featured "sub-models," using broadly similar methods but stratifying models by subsite or sex. 29,30,35 These have been reported and considered accordingly, where reported. based case-control design. One study used data collected from a randomized control trial. 31 The other two used a cross-sectional and prospective cohort study method respectively. There were 11 models which utilized a form of logistic regression analysis approach to evaluating HNC risk. Two of the three remaining models used machine learning methods, while one used a cox-regression approach to evaluate risk.

| Study/model characteristics
A variety of cancer outcomes were considered-one model considered the risk of OPMD, and another two included the risk of developing oral cancer from OPMD. Three further models evaluated the risk of oral cancer. Six models evaluated the overall risk of HNC, two of these stratifying risk by sex and various subsites including cancers of the oral cavity, hypopharynx, oropharynx and larynx. One model considered the risk of oral and oropharyngeal cancer. Finally, one model considered the risk of oropharyngeal cancer.
Eight of the 14 models were deemed likely to require clinician input to be used, and six could possibly be used in a self-assessment role by patients.

| Discrimination
Discrimination, the ability of a model to discern between a positive and a negative result for disease, is a crucial performance metric of a risk model. All 14 models provided measurements of discriminatory accuracy in either their development, validation populations, or both.
Ten of these models described the statistical uncertainty of their findings. Many models (n = 9) reported AUC values (and intervals where reported) greater than 0.7, achieving "acceptable" or "excellent" dis-   achieved a "perfect" model discrimination of 1.0, however, this model was constructed from a very small sample size, in addition to other key limitations and bias concerns such as a failure to report statistical uncertainty and any missing data. 36

| Accuracy
Measurements of accuracy in the form of sensitivity and specificity were described in seven of the 14 studies. These ranged from 67.53% to 100%, 36,39 and 67.7% to 100% for each measure respectively. 28,36 The model that reported the highest sensitivity and specificity achieved 100% in both of these metrics in their validation population. 36  cancer models in one study where calibration was sub-optimal in these particular calibration plots. 35 Most of the models that did report calibration presented graphs or statistics that were close to the ideal calibration slope (expected/observed) value of 1, with some models slightly above this value indicating some over-prediction. 24

| PPV/NPV
PPV and NPV are defined as the proportion of patients who actually have the disease that test positive and the proportion of patients without the disease that test negative respectively. Six models reported measurements of PPV and NPV. 43 As such only one model reported the statistical uncertainty of this. 33 The PPV and NPV values reported ranged from 20.7% to 100% and 83% to 100%. Again, Liu and colleagues 36 achieved the highest PPV and NPV values of 100%.

| Model risk cut-offs
Of the 14 models, only five reported model risk cut-offs during development. Two models used a risk probability as a cut-off. 36,40 Three models reported cut-offs using performance metrics including AUC, sensitivity and specificity, and PPV/NPV. 28

| Validation
Only 3 of the 14 models reported external validation-Amarasinghe et al., 28 Koyanagi et al., 32 and Tota et al. 41 Three others reported robust methods of internal validation via split random sampling. 29,35,37

| Risk factors
Altogether, the 14 models considered over 30 various risk factors.
The most common factors included were age (13 models), alcohol consumption (13 models), sex (12 models), and tobacco smoking (12 models). Notably, two models considered HPV serostatus in model development-Budhathoki et al. 29 and Tota et al. 41 The number of risk factors included in models ranged between 5 36,39 and 13. 30

| PROBAST
The evaluation of each domain of the PROBAST risk of bias assessment tool is summarized in Table 3. Of the 14 models, seven were deemed to have a "high" risk of bias in at least one domain. The "analysis" section was the most common domain where a high risk of bias was identified. Common reasons for these included low numbers of the outcome of interest, a lack of external validation and limited or no internal validation, failure to report statistical uncertainty of findings and no discussion of missing data (and procedures in the event of this). Five of the 14 models were reported to have an "unclear" applicability concern whereby aspects of the model may limit its applicability but as such these were not major limitations. The "participants" section was the most common domain where applicability concerns were classified as "unclear." Reasons for these included limited generalizability owing to the outcome considered, limiting the analysis to those of one ethnicity, use of non-primary HNC cancer sites and lowquality reporting of methods. Where models had a limitation but were otherwise fairly robust and well-developed the risk of bias was deemed as "low." Overall, 7 of the 14 models were deemed to have an overall low risk of bias. 28,29,32,35,37,40,41 3.10 | Overall quality assessment The overall quality assessment of the 14 models and the components considered in this quality assessment along with model predictive performance assessment can be seen in Table 4. Of the 14 models, six were assessed as "high" quality, 28,32,35,37,40,41 three as "moderate" quality, and five as "low" quality. The main components, which impacted on quality were PROBAST risk of bias, applicability concern, and validation methods.

T A B L E 3 PROBAST performance by model
Study ID

Participants Predictors
Outcome Analysis Overall

Risk of bias
Applicability concern Risk of bias

Applicability concern
Risk of bias

Applicability concern
Risk of bias

Applicability concern
Amarasinghe et al. In terms of performance, eight of the 14 models were high performing, with AUCs greater than 0.8, ranging from 0.83 to 1. 28,31,33,36,38,39,40,41 Of the six high quality models, three had high predictive performance with good discriminative accuracy-Amarasinghe et al., 28 Tikka et al., 40 and Tota et al. 41

| Synthesis
All six of the high-quality models were more recently developed (since 2010). Despite the heterogenicity of the models, generally, those that were assessed as high quality shared common design aspects. All of the models were developed from case-control study data with some variation in design such as hospital-, community-, synthetic-controls, or a mixture of population and hospital controls. All six high quality studies also used a form of logistic regression to derive their risk models. These included binary, multivariate or conditional logistic regression. Three of the models required clinician input to use: two of these due to HPV or genotype information, 32,41 and one due to use of clinical examination information. 40 With regards to factors included in the high-quality models, all six had some sociodemographic factors-age was evaluated in all six of them and sex in five-one model did not analyze or adjust for sex which in turn resulted in reduced applicability. Four of the models adjusted for at least one additional sociodemographic factor (education, ethnicity, or socioeconomic deprivation). 28,35,37,41 Two highquality models evaluated socioeconomic deprivation, one synthesizing educational and occupational status to define this, the other measured deprivation using an area based socioeconomic index. 28,37 Similarly, all six also incorporated behavioral factors into their model-using both alcohol intake and smoking. Notably, one model used betel quid chewing as an additional behavioral predictor, an important aetiological risk factor for the target population for this model. 28 One model also used exercise and fruit/vegetable intake as additional factors. 37 Two of the three models that used biomarker (genetic or HPV) data in their models were ultimately among the six assessed as high quality.
One high quality model used HPV serostatus as a predictor. 41 Another used DNA sampling to assess ALDH2 genotype. 32 Only one of the high-quality models reported family history as a predictor. 35  The critical design feature common to all of the high-quality models was robust validation methods. These included the use of external validation in another setting, 28,32,41 a history of this (in a previous developmental version of the model), 40 or the utilization of well-conducted internal validation with a large split sample. 35,37 All the high-quality models that did not use an external validation approach described this as a limitation and a next step in their model development.
In addition to the high quality models having a low risk of bias, five of the six also had a low applicability concern, 32,35,37,40,41 while the remaining model scored "unclear" for applicability concern assessment. 28 This was primarily due to the model evaluating the risk of OPMD only and not accounting for sex as a predictor during model development.
The high-quality models mostly had fair to good performance.
One model had a sup-optimal AUC of 0.64, 37 two high quality models reported a fair AUC over 0.7 32,35 and three of the high quality models achieved excellent AUC values over 0.8. 28,40,41 The better the discriminative performance, the more accurately those at risk of disease can be identified. Two of the six high quality models did not report calibration metrics. 28,40 This was the main limitation of these models. Of the six high quality models, three were also high performing, achieving excellent discrimination with AUCs over 0.8. 28,40,41 The three models predicted the risk of OPMD, HNC, and oropharyngeal cancer respectively. Two of the models were externally validated in another cohort, 28,41 while the other model had a history of external validation in its first version. 40 This model has also seen some clinical use, being used to triage patients remotely during the COVID-19 pandemic. 44 The three high-quality, high performing models all used similar sociodemographic and behavioral factors but where the three high performing models differed and ultimately excelled was in the choice of additional predictors used. These included the aforementioned use of betel quid in one model, 28 clinical examination and symptoms in another model, 40 and the use of HPV serostatus and ethnicity in the third high quality model. 41 Those that were classified as moderate overall quality had at least one major methodological limitation, but generally had fair predictive performance. Studies that were classified as low quality had at least two significant limitations; some of these reported very good model discrimination, although this needs to be interpreted with caution.

| DISCUSSION
A range of risk prediction models for HNC were identified. These models were heterogeneous in their risk factors and outcomes, were developed with variable methodological approaches and rigor, and several models demonstrated the potential to predict and identify those at a higher risk of HNC.
The six high performing models incorporate the major HNC risk factors of tobacco smoking and alcohol consumption, in addition to the sociodemographic factors of age and sex. Additional factors are included contributing to improved performance. Four included socioeconomic factors, one: family history, one: betel quid chewing, one: HPV serology, one included a genetic marker, and one included clinical examination findings. These selected factors are consistent with the international analytical epidemiological literature which has identified tobacco smoking and alcohol drinking as the major risk factors (accounting for up to 70% of the population attributable risk), [45][46][47] the important role of demographics of age in cancer risk, 48 and sexparticularly men being more predisposed to HNC. 49 Moreover, the important role of HPV particularly in oropharyngeal cancer 50 and betel quid chewing in oral cavity cancer in particular populations, 51 and the increasingly refined role of genetic factors in HNC are reflected in the models. 52,53 Three of these high performing predictive models were of high quality and consistent methodological rigor-all included major behavioral and sociodemographic factors along with an additional factor. They were generally more specified models that were tailored to their target population or subsite, for example, to South Asia (the inclusion of betel quid) 28 or to oropharyngeal cancer (the inclusion of HPV serology). 41 Perhaps counterintuitively, some of the models that included many additional risk factors generally had lower predictive performance. This could be explained by a statistical phenomenon known as model overfitting, whereby a model becomes too tailored to a developmental dataset with unnecessary components. This violates the principle of parsimony, in turn limiting a model's generalisability when applied to another independent dataset. 54 The higher performing models and particularly the high performing, high-quality models generally required clinician input, reflecting the nature of the variables required.
It could be hypothesized that HPV serostatus and genetic markers could help better inform individual risk, in line with the growing popularity of "personalized medicine" in other diseases. However, this may in turn limit the practicality of a model-for example, primary care medical or dental practices where time and resources may already be limited.
Several similar reviews of risk prediction models have been undertaken for other cancer sites including colorectal [55][56][57][58] and lung. 59 The reviews of colorectal and lung cancer models both identified models with high performance (0.65-0.75, 0.76-0.96, and 0.57-0.879), while the breast cancer models generally reported poorer performance (0.56-0.63 and 0.56-0.71). The poor performance of these breast cancer models was attributed in the reviews to limited knowledge and data on risk factors for breast cancer leading to sub-optimal prediction. The performance range of the HNC models was similar to the colorectal and lung cancer reviews and might in part be attributed to the growing epidemiological research base in the field. 60,61 The methodology of this review is similar to reviews of risk models for other cancers. [55][56][57][58][59] This review employed robust quality and methodological assessment including the PROBAST risk of bias and applicability concerns tool, and a focus on the nature of model development and validation approach. While validation is a domain of the PROBAST tool, this was explicitly assessed separately as external validation is gold standard methodology of risk prediction model development. 62,63 To our knowledge, this is the first review of HNC risk prediction models. This review has some strengths including searching multiple databases, dual article screening, as well as the comprehensive quality assessment. A detailed thematic narrative synthesis drew on the model quality and performance to identify key design characteristics.
There are some limitations to this review, including not publishing a protocol. The study started as a rapid review-but ultimately became more systematic in nature-particularly in term of quality assessment methods. However, it was not feasible to register the review retrospectively, hence the review was not registered with PROSPERO. The PICO/research question and search/inclusion criteria were developed a priori and did not change during the review. The review was also conducted following PRIMSA guidelines and was advised by a subject librarian in the field. Second, the inclusion of papers published only in English may have excluded other pre-existing models. As with most reviews, the nature and limitations of available data can influence the overall quality of evidence synthesized-the source data of this review are largely from case-control studies which do have some potential recall and selection biases. 64 This review has also been conducted within the overarching objectives of developing a risk model for HNC and translating it to a clinical setting. Any findings of this review are intended to help inform model development.

| CONCLUSIONS
This review illustrates that there is a limited but growing number of HNC risk prediction models. Some of the models reviewed do have the potential to identify and stratify those at risk of HNC. Model predictor selection should include, as a minimum, well established risk factors as well as sociodemographic predictors. Additional genetic, biomarker, or clinical factors have the potential to improve predictive performance. However, care should be taken to ensure a limited number of predicting factors are chosen to avoid model overfitting. Such early identification of risk factors in the context of a HNC risk level could have important applications including using this "teachable moment" for behavior change, directing patients to preventive care pathways (e.g., for smoking cessation), or in identifying the need for tailored frequencies in recall intervals for clinical examination (e.g., with primary care dental practitioners). These models could form the basis of a personalized approach to HNC prevention. Further work can be undertaken to refine, improve and validate these models and potentially trial in the clinical setting.