Skip to main content

8. Choosing a Statistical Test

Choosing a statistical test 

Choosing the appropriate statistical test(s) for a study can be a daunting task when you are new to clinical  research. There are many, more comprehensive resources available on each statistical test, so the focus of this article will be to provide a flavour of a few statistical tests to get you started. We hope this will help you start the process of considering what type of analysis might be necessary to answer the clinical questions posed in a research study. At the end of the article, there are some brief signposts for support with planning your research statistics.

Questions to consider 

What is the main study hypothesis? 

If your study has a proposed hypothesis then a statistical test is likely to be required to determine the  significance of the answer to the clinical question, for example “is modality B a suitable replacement for  modality A for measurement X?”, “does factor X influence variable Y?”, or “which of modalities A, B or C is  most sensitive for identifying condition X?” etc. The answer to your research question can be tested to check whether it is likely to be ‘real’ (i.e. statistically significant - unlikely to have occurred by chance) by using an appropriate statistical test. 

What type of data is being used in the study? 

In some cases, questions regarding the statistical validity of study data may need to be assessed prior to  choosing a statistical test. These questions may include “what type of data are being investigated?”, “are the data  independent”, “are the data normally distributed?” and “what is the study power using this data group?”. 

What degree of statistical significance should be used? 

Most clinical studies use a probability value (P-value) of P <0.05 to indicate statistical significance of  results, which means that there is a <5% probability that the results obtained are due chance, and a >95%  probability that the results obtained are the result of a true relationship or difference between groups  being compared. However, in some studies it may be appropriate to use a different P-value, for example in  studies where multiple independent hypothesis are being tested it may be appropriate to use a Bonferroni  correction, where the appropriate significance level is P < 0.05/n, where n represents the number of  independent hypotheses. For example, in a study with two independent hypotheses, using a Bonferroni correction would provide a required P-value of P <0.025.  

Examples of statistical tests in existing studies 

In this article, three clinical studies have been briefly analysed to demonstrate the appropriate statistical test for a particular clinical hypothesis, and to show how these tests are used to answer the question. Only the main statistical test used in answering the study hypothesis has been included for the sake of brevity, so this article will not cover analysing independence of study variables or normalisation of data. The aim of this article’s analysis is to demonstrate how some common clinical hypotheses are answered in published research and thus provide an idea of how you might proceed with study design when proposing similar types of research.

 

Statistical aim: comparing the level of agreement between two different  measurement modalities 

Statistical test: Bland-Altman analysis 

Bland-Altman analysis is a test designed to determine the level of agreement between two different  modalities that are measuring the same variable. If one modality is considered the ‘gold-standard’  modality for this type of measurement, then Bland-Altman can be used to assess the level of accuracy of  the second modality by comparing it to the gold-standard method. Bland Altman analysis involves graphically plotting the measurement differences between the two modalities on the y-axis, against the mean of the two measurements on the x-axis. The average measurement difference is obtained which provides a ‘bias’; the smaller the bias the higher the level of agreement between the two modalities. 

Next, ‘limits of agreement’ are obtained by calculating ±1.96 standard deviations from the bias; 95% of the differences between modalities should ideally lie within these limits. Additionally, ‘limits of acceptability’ should also be agreed upon beforehand and these can also be inputted into the graph: if the limits of agreement fall within the limits of acceptability, then the two modalities can be said to demonstrate an acceptable level of agreement and could in theory be used interchangeably to obtain similar results when  measuring that variable in future. 

In the following study, Bland-Altman analysis was used to demonstrate that tomographic 3D-ultrasound (tUS) and duplex ultrasound (DUS) both demonstrate good agreement with fistulography for identifying and measuring the degree of stenosis within an arteriovenous fistula (AVF). 

Rogers et al. (2021). Arteriovenous Fistula Surveillance Using Tomographic 3D Ultrasound. Eur J  Vasc Endovasc Surg. 62(1), pp.82-88. 

Study aim: 

To investigate the level of agreement between tUS, DUS and fistulography for identifying and measuring  AVF stenosis, with the aim of determining whether tUS is a suitable replacement for DUS in the assessment  of AVF stenosis in future, as tUS is significantly less time-consuming and therefore would be beneficial for  department workflow and reducing ultrasound-related musculoskeletal disorders. 

Study summary: 

97 patients with a poor-flow arteriovenous fistula (AVF) underwent imaging with fistulography, tUS and  DUS, which identified 101 stenoses for analysis. The degree of stenosis was measured and Bland-Altman  analysis was performed to assess the level of agreement between each ultrasound modality and  fistulography, which was considered the gold-standard measurement modality. Bland-Altman analysis  demonstrated close agreement between fistolography and DUS / tUS, with tUS showing slightly better  agreement, indicating that all three measurement modalities are interchangeable when measuring degree  of AVF stenosis. However, tUS has the additional benefits of being non-invasive unlike fistulography, and  takes less than half the time to perform compared to DUS, indicating that tUS is a promising modality for  obtaining non-invasive, fast and accurate AVF stenosis measurements.

Figure 1. Bland-Altman agreement for (A) duplex ultrasound (DUS) and (B) tomographic 3D-ultrasound (tUS) compared with fistulography as  the gold-standard and (C) tUS compared with DUS as the gold-standard in the measurement of arteriovenous fistula (AVF) stenosis. D =  standard deviation; LOA = limit of agreement.


 

Statistical aim: investigating differences in means between independent groups  

Statistical test: One-way ANOVA 

One-way analysis of variance (ANOVA) is a statistical test used to compare the means between  independent variables in a study and determine whether any of the means are statistically significant from  each other. In this study, one-way ANOVA has been used to demonstrate that there is a statistically  significant increase in mean AAA growth rate (GR) as AAA size increases. One-way ANOVA has also been  used in this study to demonstrate that there was no statistically significant difference in mean AAA GR between non-smokers and smokers or ex-smokers, between genders, and between normotensive or hypertensive  patients. This shows that for this patient cohort the most significant factor affecting AAA GR is AAA size, and thus the authors suggest AAA size should be taken into consideration when determining AAA surveillance intervals.  

Ian Hornby-Foster. (2023). Abdominal aortic aneurysm growth rates in patients undergoing local ultrasound surveillance. Ultrasound. 31(1), pp.23-32. 

Study aim: 

A retrospective analysis of abdominal aortic aneurysm (AAA) ultrasound surveillance in University Hospitals  Bristol and Weston (UHBW), with the aim of assessing AAA growth rate (GR) and the concurrent impact of  AAA risk factors (RFs) and associated medications, to inform whether the current UHBW AAA surveillance  protocol is safe and appropriate. 

Study summary: 

315 patients comprising 1312 AAA scans were investigated, with exclusion criteria including aortic  diameter measurements <3.0cm or >5.5cm, and patients who had fewer than 2 AAA scans. The patients were divided into groups of 0.5cm increments (3.0  – 3.4cm, 3.5 – 3.9cm, 4.0 – 4.4cm, 4.5 – 4.9cm, 5.0 – 5.5cm), based on baseline AAA size. Annual GR between groups was compared using one-way ANOVA. One-way ANOVA was also used to investigate the influence of risk factors on AAA GR.  

Mean GR for all patients was 0.25cm per year, however one-way ANOVA demonstrated a significant  increase in GR with increasing AAA diameter. One-way ANOVA also demonstrated that there was no  statistically significant impact of age, smoking, gender, hypertension, or hypercholesterolaemia on AAA GR  for this patient cohort. However, there was a significant difference between the mean growth rate of diabetic and non-diabetic patients, suggesting an inverse relationship of diabetes presence and AAA GR.

Figure 2. Mean annual AAA growth rates (cm/year) with error bars indicating top and bottom end 95% confidence intervals. AAA GR can be  seen to increase with AAA diameter.

 

Statistical aim: investigating sensitivity and specificity of a measurement  modality 

Statistical test: receiver operating characteristic (ROC) curve analysis 

A ROC curve is a graphical plot that illustrates the performance of a binary classifier model at varying  threshold values, by plotting true positive rate vs false positive rate at each threshold. A higher area under  the curve (AUC) indicates higher specificity and sensitivity for this classification model, i.e. this model is  more likely to provide a true positive result and less likely to provide a false positive result. In the study  below, ROC curves are used to demonstrate that intra-arterial fractional flow reserve (FFR) measurement  has the highest sensitivity and joint highest specificity (with translesional pressure measurement (Pd/Pa))  for predicting presence of critical limb-threatening ischaemia (CLTI). 

Albayati et al. (2024). Intra-arterial Fractional Flow Reserve Measurements Provide an Objective  Assessment of the Functional Significance of Peripheral Arterial Stenoses. Eur J Vasc Endovasc Surg.  67(2), pp.332 - 340. 

Study aim: 

To use fractional flow reserve (FFR) to investigate the ischaemic potential of peripheral arterial stenoses,  and compare this technique to other methods of investigating stenosis: ankle brachial pressure index  (ABPI), duplex ultrasound (DUS), CT angiography (CTA), translesional pressure measurement (Pd/Pa).  

Study summary: 

61 isolated iliac or superficial femoral artery stenoses in 41 patients (10 patients with  bilateral disease) with either short-distance claudication or CLTI were recruited prior to elective  angioplasty and/or stenting. Pre-procedural investigations (resting and exercise ABPI, DUS peak systolic  velocity ratio (PSVR), and CTA were performed; intravascular Doppler derived flow reserve and pressure  derived FFR were obtained during angioplasty. Blood oxygen level dependent (BOLD) cardiovascular  magnetic resonance (CMR) was performed before and after angioplasty to assess calf oxygenation. 

Association between variables and disease severity was assessed using ROC curve analysis, which showed  that a lower FFR AUC was associated with CLTI in the cohort studied. The degree of lesional stenosis  measured by CTA, ABPI and PSVR had weaker associations with CLTI than FFR. FFR demonstrated the  highest sensitivity and joint highest specificity for predicting CLTI in this cohort.

Figure 3. Association between standard of care assessments and intra-arterial pressure-flow measurements with CLTI. ROC curve analysis with  corresponding AUC, 95% confidence interval (CI) and sensitivity (Sens) and specificity Spec) values displayed. FFR demonstrates the greatest  AUC for association with CLTI in this cohort. 

Research support resources 

Only a handful of statistical tests have been covered in this article, hopefully providing you with a starting point for how you might begin to consider testing your own research questions and data. There is a much wider range of statistical tests to choose from, which will warrant careful consideration to select the correct statistical method for your research. During the design phase of your research, it is valuable to consider the type and structure of data you intend to generate as it may impact how you use statistics during analysis. For support, try contacting your local research & development department and asking whether they have an associated statistician. Additionally, take a look at the support offered through the National Institute for Health Research (NIHR):


 

Written by Ben Warner-Michel (Kingston Hospital, London)

Edited by Isaac Colliver (University Hospitals Coventry & Warwickshire, Coventry)