Senior Lecturer, Department of Oral Growth and Development, Bart's and The London School of Medicine and Dentistry, Institute of Dentistry, Queen Mary's College, London, UK
There is an increasing volume of research undertaken within orthodontics and with this comes a need to evaluate what is available. This short series aims to help the orthodontist revise basic concepts of critical appraisal and pertinent statistics.
Clinical Relevance: Critical appraisal skills are valuable tools that can aid clinical decision-making. In this article, we cover concepts including basic descriptive statistics, significance testing and confidence intervals.
Article
Descriptive statistics
Data classification
Information is collected for specified variables to provide the data to address a chosen research question. They are called variables as they provide information as to how individuals or items vary. The appropriate summary measures and statistics to use depend on how the variable is classified.
Categorical
Classification is within one of several categories. If there are only two categories (for example, male or female) then the variable is known as binary or dichotomous. If there are more than two categories, then these may be either ordered (for example, low, average and high angle) so the variable is known as ordinal, or unordered, with the categories bearing no mathematical relationship to each other (for example, country of birth or ethnicity) when the variable is known as nominal. The number in each category may be expressed as proportions or percentages of the total, or as the risk or the odds of being in one category relative to another. A contingency table is often used to summarize the data.
For example, in a study examining the aetiology of temporomandibular joint ankylosis, the researchers investigated whether patients in their sample had previous middle ear infection. The data might be displayed in a contingency table (Table 1).
Ankylosis
No ankylosis
TOTAL
Infection
44
8
52
No infection
7
41
48
TOTAL
51
49
100
Numerical
Measurements are made on some form of numeric scale, and usually values can take either whole numbers (for example, number of clinic visits or parity) or lie on a continuous scale (for example, patient height or cephalometric data). Numerical data are usually summarized using two measures – one to give an average and the other to describe the spread.
Measures of average for numerical data
Mean: obtained by dividing the sum of the observations by the number of observations.
Median: the middle value of the ordered observations, corresponding to the 50th centile. Centiles divide the ordered values into 100ths. So, for example, the 10th centile is the value below which 10% of the entire distribution of values lies. The median is the 50th centile: half the population has values higher than this and the other half lower.
Mode: the value that occurs most frequently in a data set.
Data distribution.
If the distribution of values are symmetric around the average, and tailing off away from this central value evenly in both directions, then the data are said to be normally distributed (parametric) and in this case the mean = median. For example, Figure 1 shows the heights of a set of individuals.
If data are not normally distributed, then the distribution will be skewed and the mean will be a biased estimate of the centre of the data – thus the mean and median will be different. For example, Figure 2 shows the distribution of the incomes of a set of individuals. In this data set the mean income will be much higher than the median. The mean is pulled in the direction of the skew as it is overly influenced by the relatively few individuals in the tail of the distribution (in this case the relatively few very high earners):
Note, the Shapiro-Wilk, Kolomogorov-Smirnov tests and Q-Q or normality probability plots may be used to check for normality. If data are not normally distributed, then log transformation is often used to induce normality.
Measures of spread for normally distributed data
If values are normally distributed, then the mean should be used to summarize the average, and the variance or standard deviation should be used to summarize the spread.
The variance is the average deviation of the squared distances of each observation (both positive and negative values) from the arithmetic mean. The squaring eliminates the plus or minus values. It gives a measure of the average deviation around the mean. If values are very widespread then some distances from the mean will be large and there will be a large variance. If the values are tightly grouped, then all values will be quite close to the mean and the variance will be small.
The standard deviation (SD) is defined as the square root of the variance. Therefore, as for the variance, a small value indicates tightly grouped measurements and a large value indicates that they are widely spread (Figures 3 and 4).
Measures of spread for non-normally distributed data
If values are not normally distributed, then the median should be used to summarize the average, and the range or interquartile range may be used to quantify spread.
The range is defined as the highest value minus the smallest value.
The interquartile range represents the range over which the middle 50% of the data lie, between the 1st quartile (25th centile of the distribution) and the 3rd quartile (75th centile). It is preferred since the range may be very unstable being, by definition, dependent on the two oddest (or most extreme) observations (Figure 5).
Prevalence and incidence
These terms are closely related yet distinct. They describe how commonly a disease or condition occurs in a population.
Prevalence: The number or proportion of individuals with a disease at a given point in time. This term is often preferred in the study of rarer conditions.
Incidence: The number or proportion of individuals who contract the disease in a particular time period, customarily expressed as a rate.
Summarizing differences between two groups
Many studies collect information on two groups and make comparisons between them. Commonly, comparisons are made of new or standard treatment groups or of diseased individuals versus healthy controls. The appropriate comparison statistic to use depends on the type of variable being compared.
If the variable is numerical, then the difference in means or medians will provide a suitable summary to compare the groups.
However, if the outcome variable is categorical, several different summaries can be used to quantify the likelihood of being in one of the categories:
Absolute risk: The number developing the event in the group divided by the total number in that group.
Control event rate (CER): The absolute risk in the control group.
Experimental event rate (EER): The absolute risk in the experimental group.
Odds: For each group, the odds are the number with the event divided by the number without the event.
The difference in the likelihood of being in one category may then be compared between the groups in one of several ways:
Absolute risk reduction: How much more likely the individuals in a group exposed to a factor are to have the feature compared to a non-exposed (control) group.
Number needed to treat: The number of individuals that would need to receive a treatment, compared with the control, in order for one individual to benefit. It is the reciprocal of the absolute risk reduction. The ideal number needed to treat is 1, which means that everyone who is treated will benefit. The higher the number needed to treat (NNT), the less effective the treatment is.
Relative risk: Risk in people exposed divided by risk in people not exposed (controls).
Odds ratio: Experimental group odds divided by control group odds.
To clarify this, suppose there are a+b individuals who are exposed to a factor, of whom a have the disease, and c+d individuals, who are not exposed and of whom c have the disease (Table 2).
Disease status: Positive
Negative
TOTAL
Exposed
a
b
a+b
Not exposed
c
d
c+d
TOTAL
a+c
b+d
a+b+c+d
It can therefore be seen that:
CER = c/c+d
EER = a/a+b
Absolute risk reduction (ARR) = CER –EER, which is c/(c+d) – a/(a+b)
Number needed to treat = 1/ARR
Relative risk = (a/a+b)/(c/c+d)
Odds ratio = (a/b)/(c/d).
The contingency table from the study looking at the aetiology of temporomandibular joint ankylosis may be enhanced (Table 3) to illustrate the concepts explained in Table 1. Here, a = 44, b = 8, c = 7 and d = 41.
Disease status: Ankylosis (positive)
No ankylosis (negative)
TOTAL
Infected (exposed)
(a) 44
(b) 8
52
Not infected (not exposed)
(c) 7
(d) 41
48
TOTAL
51
49
100
44/52 = 0.846 or 84.6% of patients reporting past infection demonstrated ankylosis (EER) compared to 7/48 = 0.145 or 14.5% of uninfected patients (CER).
The absolute risk reduction is 14.5 – 84.6 = -70.1%. This means that 70.1 more of those reporting past infection have ankylosis (a negative risk reduction can be interpreted as an increase).
The number needed to treat is 1/-70.1 = -0.014. This means that on average, for every 1.4 who are infected, one extra will develop ankylosis.
The relative risk is (44/52) / (7/48) = 0.846/0.145 = 5.83. This means that those who are infected are, on average, 5.83 times more likely to develop ankylosis.
The odds of a previously infected patient having ankylosis are 44/8 = 5.5, and the odds of an uninfected patient having the same are 7/41 = 0.17.
The odds ratio of an uninfected patient having ankylosis is 5.5/0.17 = 32.35, which shows that the odds of ankylosis are considerably higher in those with a past history of infection.
Evaluating screening test results
A diagnostic or screening test is evaluated by assessing whether some potentially predictive feature is present or not in individuals with and without disease. If the feature is predictive of disease, then it will often be present in those with disease and absent in those without. There are a number of summaries, defined below, that are generally used to quantify how good the test is. All of these summaries can take values between 0 and 100% and an ideal test would have high values of all four.
Sensitivity: The proportion of individuals with a target disorder who have a positive test.
Specificity: The proportion of individuals without a target disorder who have a negative test.
Positive predictive value: The proportion of individuals who have a positive test who have the target disorder.
Negative predictive value: The proportion of individuals who have a negative test who do not have the target disorder.
In addition, two other terms are often used:
Positive likelihood ratio: The extent to which an individual's odds of disease are estimated to increase if they test positive.
Negative likelihood ratio: The extent to which an individual's odds of disease are estimated to fall if they test negative.
For example, the use of pulse oximetry to diagnose obstructive sleep apnoea (OSA) has been explored owing to the ease of administration and lower relative costs compared to more formal sleep studies. Sensitivity and specificity have been reported at 82% and 76%, respectively. Thus, even though these figures are quite high, 18% of those with OSA will be missed by the test and about a quarter of those without any OSA will test positive.
Making inferences
We now discuss ways in which sample data can be used to make inferences. How valid and generalizable these inferences are will depend on how the data were accrued and we should always bear in mind the design of the study when interpreting the results.
Hypothesis (significance) tests
Research questions are generally concerned with finding differences. For example:
Do children with osteogenesis imperfecta have a different average height compared to the normal healthy population?
Does a fluoride-containing mouthwash reduce decalcification around brackets (does it reduce the average number of lesions more than a placebo)?
However, significance tests assess how likely the sample was to have occurred if the converse (no difference) were true.
Thus the above examples would test the following null hypotheses:
Children with osteogenesis imperfecta have the same average height as the normal healthy population.
Fluoride mouthwash has no effect on the number of enamel decalcification lesions.
If we show that these statements (c and d) are unlikely to be true, then we will also have shown that the previous statements (a and b) are likely to be true.
Statistical analyses allow valid inferences to be made from a random sample to the population from which the sample was selected. The research question is phrased as a hypothesis, from which a null hypothesis is derived (one which predicts no difference, thus attributing any observed differences to chance) (Box 2).
Significance testing.
The steps in significance testing are as follows:1
Define null hypothesis.
Collect sample data.
Obtain test statistic from sample data.
Use test statistic to determine a p-value (probability value) using appropriate significance test.
Provide confidence interval.
Decide whether to reject null hypothesis or not.
There are many different significance tests and the appropriate one to use depends on the type of outcome being compared, the number of groups being compared and, in the case of two groups, whether there is a pairing between groups (for example, age and sex matched pairs of diseased individuals and healthy controls; the same patient assessed when on two different treatments as part of a crossover trial). Figure 6 provides a guide to which test is appropriate when two groups (for example, diseased vs healthy, treated vs placebo) or more (for example, five different ethnic groups) are compared.
P-values
The probability of how likely the null hypothesis is to be true is given by a p-value. Since p-values are probabilities they lie within the range zero to 1.
P-values close to zero are more likely to occur if the study has shown something positive (such as treatment works; disease has an effect; prevalence is different from that suggested by previous studies). The smaller the p-value, the greater the evidence against the null hypothesis. If the p-value is close to zero, we say the result is significant. P-values away from zero (usually >0.05) are said to be non-significant (or NS) at the 5% level.
Note that nothing is ever proven one way or the other – the p-value merely shows how likely the sample data was to have occurred if a certain situation (the null hypothesis) were true. Papers should ideally give the actual p-value for each statistical test performed. Be wary of studies that only give the results from statistically significant tests and/or omit to give confidence intervals alongside the significance tests (see later).
Continuing with the previous example, if the sample of children with osteogenesis imperfecta has an average height similar to that of healthy children, then this is a likely sample to obtain under scenario (c) and the p-value will be close to 1. That is, there is a high probability of obtaining the sample if (c) is true.
If the sample of osteogensis imperfecta children has a very low average height compared to the healthy children, then this would be unlikely under the scenario (c) and the p-value will be close to zero.
Similarly, if the treatment and placebo groups of orthodontic patients using a fluoride mouthwash have very different numbers of early decalcified lesions, then (d) appears to be false. The probability of obtaining samples by chance with such different numbers of lesions if the treatment does not work will be low (close to zero) and hence the p-value will be close to zero.
Errors and associated concepts
In statistical hypothesis testing, there are two types of incorrect conclusions or errors that can be drawn. These are explained below, together with important consequential applications.
Type I error: This occurs when the null hypothesis is rejected even though it is true. A false positive result is given, which means that a difference is found when it does not really exist. The p-value is the probability of getting the observed results by chance alone and so tells us how likely we are to be making a type I error if we accept that the null is false based on our sample data.
Multiple hypothesis testing: If a number of significance tests are carried out on a data set, and we use the same value of p as evidence against the null being true (usually p < 0.05), then the true type I error rate increases, leading us to be more likely to conclude that an ineffective treatment or intervention is actually efffective. The Bonferroni correction can be used to adjust the individual p-values to give the correct overall type I error rate.
Type II error: This occurs when the null hypothesis is accepted when it is in fact false. A false negative result is given which means that no difference is found when one actually does exist, usually attributed to the sample size being too small to detect the difference. It is guarded against by adequate, pre-study, sample size/power calculations (see below).
Power: The power of a study is the probability that a Type II error will not be made (the probability of rejecting the null hypothesis when it is false). It represents the study's ability to detect a true difference in outcome. A power of 0.80 (or 80%) is often accepted as adequate. Conversely, note that a power of 80% also means that there is a 20% chance of missing the real difference if it truly exists. Hence a higher power (90 or 95%) would probably be more appropriate for many studies (Box 3).
Power calculations.
Larger studies give more precise answers to research questions. A power calculation is made to ensure the sample size of a study is large enough to have a high chance of detecting a statistically significant result if one truly did exist. For numeric outcomes, the more variable the outcome, the larger the sample will need to be to detect a difference of a set size.
Optimal sample size can be evaluated using this information by formulae, computer programs or a diagrammatic tool called a nomogram.2 Many internet applications are also available.3,4,5
To undertake a power calculation, the following are required:
The probability of type I and type II error rates (significance level and power required).
An estimate of variability (variance) if the outcome is numeric.
The clinically relevant difference (the minimum size of difference that would be of clinical interest and we would not want to miss − for example, in a study measuring a change in overjet, 0.001 mm is irrelevant, but 2 mm is clinically relevant). Previous studies can help determine this.
Confidence intervals
A random sample should give an unbiased estimate of a population parameter. We usually want an estimate of a population mean (for example, the average height of 5-year-olds), population percentage (for example, the prevalence of molar-incisor hypoplasia), or the difference attributable to a treatment (for example, change in mandibular length). In each case, the sample gives an estimate. How precise an estimate this is depends on the sample size (larger samples give better estimates) and, in the case of numeric data, the variability of the observations made.
A standard error (SE or SEM) may be used to quantify the precision of an estimate (for example, a sample mean, difference in means or proportions between two groups, or odds ratio). Essentially, the SE quantifies the expected variability of the estimate given the sample size it is based on. Larger standard errors imply less precision of the estimate. Larger samples give more precise estimates (hence a smaller standard error).
A confidence interval (CI) can be constructed as follows:
95% CI for a population estimate = sample estimate +/- 1.96 x SEM
Confidence intervals are another approach to address the potential effect of chance on the results. Rather than phrasing the research question as a hypothesis, the value of a particular population parameter is estimated using data collected from a sample. Rather than a point estimate, it is more useful to have an interval estimate.
The confidence interval for a parameter is the range of values within which we are (usually 95%) confident that the true population parameter lies. It thus describes the degree of uncertainty around an estimate when quoted in relation to it (Box 4).
Interpreting confidence intervals.
When using confidence intervals, it is helpful to bear in mind the following:
Confidence level: The specified probability is called the confidence level and it is widespread convention to create confidence intervals at the 95% level (this means that, 95% of the time, properly constructed intervals should include the true value of the variable of interest).
Width: This indicates the precision of the estimate. A narrow interval is more precise.
Confidence limits: The end points of the confidence interval are called the confidence limits, often symmetric around the point estimate. They represent the largest and smallest effects that are likely given the observed data. To interpret the results properly we should consider the implications of the population scenarios for each limit.
If the p-value displayed is <0.05, the 95% confidence interval will not contain zero (or 1 if the estimate is a relative risk or odds ratio) – since both show that the data are not compatible with zero difference. Similarly, if the p-value displayed is > 0.05, the 95% confidence interval will contain zero (or 1) – indicating no significant difference (compatibility of the sample with there being zero difference in the population). If these rules are broken this indicates a potential problem with the analyses presented.
For example, in a study comparing the difference in mean change in mandibular growth following functional appliance treatment in treated (Group A) and untreated (Group B) patients, the 95% CI is reported as being 1.4 to 2.8 mm. Thus, we would expect the mean of population A to be between 1.4 and 2.8 mm greater than the mean of population B. This is the range of population scenarios (population mean differences between A and B from 1.4 to 2.8 mm) with which the sample data are reasonably compatible (ie with 95% confidence). The sample data leads us to believe that differences outside this range are unlikely to be true. Since the interval does not contain zero, the data suggests that the treatment does actually cause a difference in mandibular growth. We can also conclude that performing a significance test of the difference would yield a p-value < 0.05.