Are descriptive statistics sometimes more useful than tests of significance?

Are descriptive statistics sometimes more useful than tests of significance?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

How does the biology community currently feel with regards to publishing descriptive and effect size statistics rather than significance stats? Almost every journal article I read in the cell biology field almost always reports things like P values and stats tests to report statistical significance, but should effect sizes be more important to a biologist? Do we even care if something is statistically significant if the effect size is negligible? Rather than crunch for significance, could one get away with showing things like confidence intervals, eta^2, Cohen's d, and r values instead over P values? P values tell you the odds that if you assume the null hypothesis is true, then the observation you're making are only 5%(assuming of course P<0.05). However, this can lead to the logical fallacy as noted by Aristotle--theory A predicts that changing X will cause Y. An experimenter thus performs experiments to manipulate X and sees changes in Y, therefore he/she concludes theory A is supported, which is however completely wrong. Theories B, C, D, E… could all also predict that X changes Y and may even be better at it. Even if you conclude that your findings "support" theory A, it's still weak because you haven't ruled out all of the other possibilities.

So in order to avoid statistical significance relative to null hypothesis that has all sorts of pitfalls, can one just use descriptive and effect size statistics just as effectively, if not more so?

I'm not a statistician, but I think the comments have got it right. There is never a reason to omit P values, statistical power or some other measure that you have done something that is not a random outcome.

For the sake of reference lets define the terms you reference: Eta squared is a ratio of the variances of two sets of measurements Cohen's d is a measure of the difference between two means.
R value or Pearson Correlation describes the linearity of two numbers, usually one of them being a measurement and the second being an experimental variable.

These numbers as you say are descriptive, but they could be created by throwing coins and writing them up. With small numbers of measurements, and a large enough range of possibilities, its possible to get terribly large numbers here.

Biology, medicine, social science and economics are really susceptible to this. You go into the field and measure butterfly wings or do surveys of people's opinions, or try to guess who is going to win the election and its quite expensive to do more measurements.

If you are measuring something hard to determine because accuracy is important such as a close election race or that is really complicated such as which genes convey a susceptibility to type 2 diabetes (a problem which remains unsolved because so many genes play a role) you need large numbers of responses. Yet each study that comes out gets some answer, but if you want to believe it these numbers should convince no one in most cases.

Microarray data and RNASeq data analysis for instance often suffer from this problem. The measurements are all statistically significant but each one costs hundreds or even thousands of dollars. Most experiments do a minimal three measurements to understand the variance in each measurement and then do 2 to 8 actual measurements. That's not going to so revealing when working with a system with thousands of genes in it. one bad sample with slightly different culture conditions can ruin the experiment.

Our butterfly biologist may measure 100 butterflies and stop when the P values are 0.05 or 0.001 - its a lot of work camping out and setting nets. The truth is that 5% is a number that can happen at random a lot. Even a 0.1% error will happen in one in ten such experiments. In thousands of experiments published that means that 10% of them have a mistake. Not so great.

It gets worse though - not only significance, but bias needs consideration. Because biologists and most other scientists don't understand statistics, the assumptions that we use when we calculate a P value are often inappropriate and don't give honest estimates of the chance that this is a random phenomenon.

If good looking result is chosen specifically to show to prove the point or some data are thrown out because they simply don't look good, or if a hypothesis is chosen in a biased way simply to fit an unreliable set of data.

Or the statistical assumptions of the calculation might simply be so inappropriate that to really cite them is an out and out lie. An everyday example of this is to do a BLAST search. The E-value calculations, if read as a P value would be wrong, even though they are mathematically correct - two strings that show a 10% identity will have a small E value - 10^-8 for instance, but this is only the chance two strings of these lengths will have so many letters in common. Anyone who plays with BLAST will quickly throw out anything that has less than 30% identity unless they are desperate, even though the E-values are infinitesimal.

John Ioannidis has made this subject his focus and has published widely on this topic. A good place to start is his commentary "Why Most Published Research Findings Are False".

There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias. In this essay, I discuss the implications of these problems for the conduct and interpretation of research.

Choosing a statistical test

This table is designed to help you decide which statistical test or descriptive statistic is appropriate for your experiment. In order to use it, you must be able to identify all the variables in the data set and tell what kind of variables they are.

test nominal variables measurement variables ranked variables purpose notes example
Exact test for goodness-of-fit 1 &ndash &ndash test fit of observed frequencies to expected frequencies use for small sample sizes (less than 1000) count the number of red, pink and white flowers in a genetic cross, test fit to expected 1:2:1 ratio, total sample <1000
Chi-square test of goodness-of-fit 1 &ndash &ndash test fit of observed frequencies to expected frequencies use for large sample sizes (greater than 1000) count the number of red, pink and white flowers in a genetic cross, test fit to expected 1:2:1 ratio, total sample >1000
G&ndashtest of goodness-of-fit 1 &ndash &ndash test fit of observed frequencies to expected frequencies used for large sample sizes (greater than 1000) count the number of red, pink and white flowers in a genetic cross, test fit to expected 1:2:1 ratio, total sample >1000
Repeated G&ndashtests of goodness-of-fit 2 &ndash &ndash test fit of observed frequencies to expected frequencies in multiple experiments - count the number of red, pink and white flowers in a genetic cross, test fit to expected 1:2:1 ratio, do multiple crosses
test nominal variables measurement variables ranked variables purpose notes example
Fisher's exact test 2 &ndash &ndash test hypothesis that proportions are the same in different groups use for small sample sizes (less than 1000) count the number of live and dead patients after treatment with drug or placebo, test the hypothesis that the proportion of live and dead is the same in the two treatments, total sample <1000
Chi-square test of independence 2 &ndash &ndash test hypothesis that proportions are the same in different groups use for large sample sizes (greater than 1000) count the number of live and dead patients after treatment with drug or placebo, test the hypothesis that the proportion of live and dead is the same in the two treatments, total sample >1000
G&ndashtest of independence 2 &ndash &ndash test hypothesis that proportions are the same in different groups large sample sizes (greater than 1000) count the number of live and dead patients after treatment with drug or placebo, test the hypothesis that the proportion of live and dead is the same in the two treatments, total sample >1000
Cochran-Mantel-Haenszel test 3 &ndash &ndash test hypothesis that proportions are the same in repeated pairings of two groups alternate hypothesis is a consistent direction of difference count the number of live and dead patients after treatment with drug or placebo, test the hypothesis that the proportion of live and dead is the same in the two treatments, repeat this experiment at different hospitals
test nominal variables measurement variables ranked variables purpose notes example
Arithmetic mean &ndash 1 &ndash description of central tendency of data - -
Median &ndash 1 &ndash description of central tendency of data more useful than mean for very skewed data median height of trees in forest, if most trees are short seedlings and the mean would be skewed by a few very tall trees
Range &ndash 1 &ndash description of dispersion of data used more in everyday life than in scientific statistics -
Variance &ndash 1 &ndash description of dispersion of data forms the basis of many statistical tests in squared units, so not very understandable -
Standard deviation &ndash 1 &ndash description of dispersion of data in same units as original data, so more understandable than variance -
Standard error of the mean &ndash 1 &ndash description of accuracy of an estimate of a mean - -
Confidence interval &ndash 1 &ndash description of accuracy of an estimate of a mean - -
test nominal variables measurement variables ranked variables purpose notes example
One-sample t&ndashtest &ndash 1 &ndash test the hypothesis that the mean value of the measurement variable equals a theoretical expectation - blindfold people, ask them to hold arm at 45° angle, see if mean angle is equal to 45°
Two-sample t&ndashtest 1 1 &ndash test the hypothesis that the mean values of the measurement variable are the same in two groups just another name for one-way anova when there are only two groups compare mean heavy metal content in mussels from Nova Scotia and New Jersey
One-way anova 1 1 &ndash test the hypothesis that the mean values of the measurement variable are the same in different groups - compare mean heavy metal content in mussels from Nova Scotia, Maine, Massachusetts, Connecticut, New York and New Jersey
Tukey-Kramer test 1 1 &ndash after a significant one-way anova, test for significant differences between all pairs of groups - compare mean heavy metal content in mussels from Nova Scotia vs. Maine, Nova Scotia vs. Massachusetts, Maine vs. Massachusetts, etc.
Bartlett's test 1 1 &ndash test the hypothesis that the standard deviation of a measurement variable is the same in different groups usually used to see whether data fit one of the assumptions of an anova compare standard deviation of heavy metal content in mussels from Nova Scotia, Maine, Massachusetts, Connecticut, New York and New Jersey
test nominal variables measurement variables ranked variables purpose notes example
Nested anova 2+ 1 &ndash test hypothesis that the mean values of the measurement variable are the same in different groups, when each group is divided into subgroups subgroups must be arbitrary (model II) compare mean heavy metal content in mussels from Nova Scotia, Maine, Massachusetts, Connecticut, New York and New Jersey several mussels from each location, with several metal measurements from each mussel
Two-way anova 2 1 &ndash test the hypothesis that different groups, classified two ways, have the same means of the measurement variable - compare cholesterol levels in blood of male vegetarians, female vegetarians, male carnivores, and female carnivores
Paired t&ndashtest 2 1 &ndash test the hypothesis that the means of the continuous variable are the same in paired data just another name for two-way anova when one nominal variable represents pairs of observations compare the cholesterol level in blood of people before vs. after switching to a vegetarian diet
Wilcoxon signed-rank test 2 1 &ndash test the hypothesis that the means of the measurement variable are the same in paired data used when the differences of pairs are severely non-normal compare the cholesterol level in blood of people before vs. after switching to a vegetarian diet, when differences are non-normal
test nominal variables measurement variables ranked variables purpose notes example
Linear regression &ndash 2 &ndash see whether variation in an independent variable causes some of the variation in a dependent variable estimate the value of one unmeasured variable corresponding to a measured variable - measure chirping speed in crickets at different temperatures, test whether variation in temperature causes variation in chirping speed or use the estimated relationship to estimate temperature from chirping speed when no thermometer is available
Correlation &ndash 2 &ndash see whether two variables covary - measure salt intake and fat intake in different people's diets, to see if people who eat a lot of fat also eat a lot of salt
Polynomial regression &ndash 2 &ndash test the hypothesis that an equation with X 2 , X 3 , etc. fits the Y variable significantly better than a linear regression - -
Analysis of covariance (ancova) 1 2 &ndash test the hypothesis that different groups have the same regression lines first test the homogeneity of slopes if they are not significantly different, test the homogeneity of the Y-intercepts measure chirping speed vs. temperature in four species of crickets, see if there is significant variation among the species in the slope or Y-intercept of the relationships
test nominal variables measurement variables ranked variables purpose notes example
Multiple regression &ndash 3+ &ndash fit an equation relating several X variables to a single Y variable - measure air temperature, humidity, body mass, leg length, see how they relate to chirping speed in crickets
Simple logistic regression 1 1 &ndash fit an equation relating an independent measurement variable to the probability of a value of a dependent nominal variable - give different doses of a drug (the measurement variable), record who lives or dies in the next year (the nominal variable)
Multiple logistic regression 1 2+ &ndash fit an equation relating more than one independent measurement variable to the probability of a value of a dependent nominal variable - record height, weight, blood pressure, age of multiple people, see who lives or dies in the next year
test nominal variables measurement variables ranked variables purpose notes example
Sign test 2 &ndash 1 test randomness of direction of difference in paired data - compare the cholesterol level in blood of people before vs. after switching to a vegetarian diet, only record whether it is higher or lower after the switch
Kruskal&ndashWallis test 1 &ndash 1 test the hypothesis that rankings are the same in different groups often used as a non-parametric alternative to one-way anova 40 ears of corn (8 from each of 5 varieties) are ranked for tastiness, and the mean rank is compared among varieties
Spearman rank correlation &ndash &ndash 2 see whether the ranks of two variables covary often used as a non-parametric alternative to regression or correlation 40 ears of corn are ranked for tastiness and prettiness, see whether prettier corn is also tastier

&lArr Previous topic| Table of Contents

This page was last revised December 4, 2014. Its address is It may be cited as:
McDonald, J.H. 2014. Handbook of Biological Statistics (3rd ed.). Sparky House Publishing, Baltimore, Maryland. This web page contains the content of pages 293-296 in the printed version.

©2014 by John H. McDonald. You can probably do what you want with this content see the permissions page for details.

What is Mean Statistics?

This is the most common method used in the measure of central tendency. It is the average of all the samples involved. To determine the mean of a sample, you only need to find the sum of all the value involved and divide them by the number of the values.

For example, if you want to find the mean of the score of students in a particular test, you need to sum up all the scores and divide them by the number of students. For example, to calculate the mean score of the marks of ten students with the following values involved:

50, 67, 35, 46, 21, 77, 92, 46, 88, 63

What is Median Statistics?

The median statistic is the value found in the exact middle of a set of values. To find the median of a set of values, you will need to organize all the values in a numerical order and identify the value at the center of the sample. For example, if you have 100 values, them the 50 th value would be the median. In our case above, the median would be:

First, let’s arrange them in ascending order:

21, 35, 46, 46, 50, 63, 67, 77, 88, 92

Here we have position 5 and 6 in the middle, therefore, to get the median we are going to interpolate them by adding the two then dividing them by 2.

What is the Mode Statistic?

In a set of values, the mode is the frequently occurring value. The mode is usually determined by identifying the most occurring number. You will also need to arrange the numbers in an ascending order then count each of them to identify the most frequently occurring.

For our example above, the most occurring number is 46. It occurs two times in the same set of value. We can as well have two modal values in a set of values. An example of this case always happens in bi-modal distribution where there are always two values that occur frequently.

Note that, in the same set of values, we have obtained totally different values for the measures of central tendency:

For a normal distribution, the mean, median and the mode are usually equal.


Dispersion is a term used to describe how values have spread around the central tendency. We have to common means of measuring dispersion that is range and standard deviation.


The range is just the maximum value minus the minimum value.

For our example above, the highest value is 92 and the lowest is 21. So the range is

Standard Deviation

This is the most detailed and the most accurate description of dispersion. This is because it shows how the different values in the set, relate to the mean.

21, 35, 46, 46, 50, 63, 67, 77, 88, 92

To compute the standard deviation, we first need to find the differences between the values and the mean.

Note that, all the values above the mean have positive discrepancies while the values below the mean have negative ones.

The next step is to square all the discrepancies:

Now we need to determine the variance:

We get this by, finding the sum of the squares of the discrepancies (sum of squares) then divide them by (n-1).

Variance = Sum of Squares/ (n-1)

Our standard deviation is now the square root of the variance

This computation seems so complicated but it is actually very simple. We can capture it in the formula below.

The standard deviation can be described as the square root of the sum of the squared deviations from the mean divided by the number of values minus one.

It is important to note that, it is possible to calculate the univariate statistics manually it can be very tedious especially when dealing with many variables. There are quite a number of statistics software that can help in doing so. An example is SPSS.

The standard deviation is a very important descriptive statistic because it allows us to make a number of conclusions based on the values we have. If we assume that our values are distributed normally or bell-shaped or something close to this, then we can make the following conclusions:

  1. At least 58% of all the values in the sample are found within one standard deviation of the mean
  2. At least 95% of the values in the sample are found two standard deviations of the mean
  3. At least 99% of all the values in the sample fall within three standard deviations of the mean.

Such kind of information is very vital especially when we want to compare the performance of two individual samples based on a single variable. This is possible even in cases when the two variables have been measured in completely different scales.

Importance of Descriptive Statistics

Descriptive statistics are very vital because it helps us in presenting data in a manner that can be easily visualized by people. This, therefore means, the data can be easily absorbed by people.

For example, if you are presenting the performance of students in a test, then the measure of central tendency can give an indication of how the class performed.

For example, the mode can tell the score that most of the students got. The mean can tell the average performance of the class. On the other hand, the measure of spread can be used to summarize the performance of a group of students. For example, the range can tell the bracket of scores the students got.

In general, descriptive statistics is a great way of breaking raw data into meaningful piece of information that can be easily understood by people they are intended for. However, presenting raw data can be sometimes important because it helps in keeping the original information and the meaning is not distorted.


The medical journals are replete with P values and tests of hypotheses. It is a common practice among medical researchers to quote whether the test of hypothesis they carried out is significant or non-significant and many researchers get very excited when they discover a “statistically significant” finding without really understanding what it means. Additionally, while medical journals are florid of statement such as: “statistical significant”, “unlikely due to chance”, “not significant,” 𠇍ue to chance”, or notations such as, “P > 0.05”, “P < 0.05”, the decision on whether to decide a test of hypothesis is significant or not based on P value has generated an intense debate among statisticians. It began among founders of statistical inference more than 60 years ago 1-3 . One contributing factor for this is that the medical literature shows a strong tendency to accentuate the positive findings many researchers would like to report positive findings based on previously reported researches as “non-significant results should not take up” journal space 4-7 .

The idea of significance testing was introduced by R.A. Fisher, but over the past six decades its utility, understanding and interpretation has been misunderstood and generated so much scholarly writings to remedy the situation 3 . Alongside the statistical test of hypothesis is the P value, which similarly, its meaning and interpretation has been misused. To delve well into the subject matter, a short history of the evolution of statistical test of hypothesis is warranted to clear some misunderstanding.

A Brief History of P Value and Significance Testing

Significance testing evolved from the idea and practice of the eminent statistician, R.A. Fisher in the 1930s. His idea is simple: suppose we found an association between poverty level and malnutrition among children under the age of five years. This is a finding, but could it be a chance finding? Or perhaps we want to evaluate whether a new nutrition therapy improves nutritional status of malnourished children. We study a group of malnourished children treated with the new therapy and a comparable group treated with old nutritional therapy and find in the new therapy group an improvement of nutritional status by 2 units over the old therapy group. This finding will obviously, be welcomed but it is also possible that this finding is purely due to chance. Thus, Fisher saw P value as an index measuring the strength of evidence against the null hypothesis (in our examples, the hypothesis that there is no association between poverty level and malnutrition or the new therapy does not improve nutritional status). To quantify the strength of evidence against null hypothesis “he advocated P < 0.05 (5% significance) as a standard level for concluding that there is evidence against the hypothesis tested, though not as an absolute rule’’ 8 . Fisher did not stop there but graded the strength of evidence against null hypothesis. He proposed “if P is between 0.1 and 0.9 there is certainly no reason to suspect the hypothesis tested. If it’s below 0.02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at 0.05’’ 9 . Since Fisher made this statement over 60 years ago, 0.05 cut-off point has been used by medical researchers worldwide and has become ritualistic to use 0.05 cut-off mark as if other cut-off points cannot be used. Through the 1960s it was a standard practice in many fields to report P values with the star attached to indicate P < 0.05 and two stars to indicate P < 0.01. Occasionally three stars were used to indicate P < 0.001. While Fisher developed this practice of quantifying the strength of evidence against null hypothesis some eminent statisticians where not accustomed to the subjective interpretation inherent in the method 7 . This led Jerzy Neyman and Egon Pearson to propose a new approach which they called “Hypothesis tests”. They argued that there were two types of error that could be made in interpreting the results of an experiment as shown in Table ​ Table1 1 .

Table 1.

Errors associated with results of experiment.

The truth
Result of experimentNull hypothesis trueNull hypothesis false
Reject null hypothesisType I error rate(α)Power = 1- β
Accept null hypothesisCorrect decisionType II error rate (β)

The outcome of the hypothesis test is one of two: to reject one hypothesis and to accept the other. Adopting this practice exposes one to two types of errors: reject null hypothesis when it should be accepted (i.e., the two therapies differ when they are actually the same, also known as a false-positive result, a type I error or an alpha error) or accept null hypothesis when it should have rejected (i.e. concluding that they are the same when in fact they differ, also known as a false-negative result, type II error or a beta error).

What does P value Mean?

The P value is defined as the probability under the assumption of no effect or no difference (null hypothesis), of obtaining a result equal to or more extreme than what was actually observed. The P stands for probability and measures how likely it is that any observed difference between groups is due to chance. Being a probability, P can take any value between 0 and 1 . Values close to 0 indicate that the observed difference is unlikely to be due to chance, whereas a P value close to 1 suggests no difference between the groups other than due to chance. Thus, it is common in medical journals to see adjectives such as “highly significant” or “very significant” after quoting the P value depending on how close to zero the value is.

Before the advent of computers and statistical software, researchers depended on tabulated values of P to make decisions. This practice is now obsolete and the use of exact P value is much preferred. Statistical software can give the exact P value and allows appreciation of the range of values that P can take up between 0 and 1. Briefly, for example, weights of 18 subjects were taken from a community to determine if their body weight is ideal (i.e. 100kg). Using student’s t test, t turned out to be 3.76 at 17 degree of freedom. Comparing tstat with the tabulated values, t= 3.26 is more than the critical value of 2.11 at p=0.05 and therefore falls in the rejection zone. Thus we reject null hypothesis that ì = 100 and conclude that the difference is significant. But using an SPSS (a statistical software), the following information came when the data were entered, t = 3.758, P = 0.0016, mean difference = 12.78 and confidence intervals are 5.60 and 19.95. Methodologists are now increasingly recommending that researchers should report the precise P value. For example, P = 0.023 rather than P < 0.05 10 . Further, to use P = 0.05 “is an anachronism. It was settled on when P values were hard to compute and so some specific values needed to be provided in tables. Now calculating exact P values is easy (i.e., the computer does it) and so the investigator can report (P = 0.04) and leave it to the reader to (determine its significance)” 11 .

Hypothesis Tests

A statistical test provides a mechanism for making quantitative decisions about a process or processes. The purpose is to make inferences about population parameter by analyzing differences between observed sample statistic and the results one expects to obtain if some underlying assumption is true. This comparison may be a single obser ved value versus some hypothesized quantity or it may be between two or more related or unrelated groups. The choice of statistical test depends on the nature of the data and the study design.

Neyman and Pearson proposed this process to circumvent Fisher’s subjective practice of assessing strength of evidence against the null effect. In its usual form, two hypotheses are put forward: a null hypothesis (usually a statement of null effect) and an alternative hypothesis (usually the opposite of null hypothesis). Based on the outcome of the hypothesis test one hypothesis is rejected and accept the other based on a previously predetermined arbitrary benchmark. This bench mark is designated the P value. However, one runs into making an error: one may reject one hypothesis when in fact it should be accepted and vise versa. There is type I error or á error (i.e., there was no difference but really there was) and type II error or â error (i.e., when there was difference when actually there was none). In its simple format, testing hypothesis involves the following steps:

Identify null and alternative hypotheses.

Determine the appropriate test statistic and its distribution under the assumption that the null hypothesis is true.

Specify the significance level and determine the corresponding critical value of the test statistic under the assumption that null hypothesis is true.

Calculate the test statistic from the data. Having discussed P value and hypothesis testing, fallacies of hypothesis testing and P value are now looked into.

Fallacies of Hypothesis Testing

In a paper I submitted for publication in one of the widely read medical journals in Nigeria, one of the reviewers commented on the age-sex distribution of the participants, “Is there any difference in sex distribution, subject to chi square statistics”? Statistically, this question does not convey any query and this is one of many instances among medical researchers (postgraduate supervisors alike) in which test of hypothesis is quickly and spontaneously resorted to without due consideration to its appropriate application. The aim of my research was to determine the prevalence of diabetes mellitus in a rural community it was not part of my objectives to determine any association between sex and prevalence of diabetes mellitus. To the inexperienced, this comment will definitely prompt conducting test of hypothesis simply to satisfy the editor and reviewer such that the article will sail through. However, the results of such statistical tests becomes difficult to understand and interprete in the light of the data. (The result of study turned out that all those with elevated fasting blood glucose are females). There are several fallacies associated with hypothesis testing. Below is a small list that will help avoid these fallacies.

Failure to reject null hypothesis leads to its acceptance. (No. When you fail to reject null hypothesis it means there is insufficient evidence to reject)

The use of á = 0.05 is a standard with an objective basis (No. á = 0.05 is merely a convention that evolved from the practice of R.A. Fisher. There is no sharp distinction between “significant” and “not significant” results, only increasing strong evidence against null hypothesis as P becomes smaller. (P=0.02 is stronger than P=0.04)

Small P value indicates large effects (No. P value does not tell anything about size of an effect)

Statistical significance implies clinical importance. (No. Statistical significance says very little about the clinical importance of relation. There is a big gulf of difference between statistical significance and clinical significance. By statistical definition at á = 0.05, it means that 1 in 20 comparisons in which null hypothesis is true will result in P < 0.05!. Finally, with these and many fallacies of hypothesis testing, it is rather sad to read in journals how significance testing has become an insignificance testing.

Fallacies of P Value

Just as test of hypothesis is associated with some fallacies so also is P value with common root causes, “ It comes to be seen as natural that any finding worth its salt should have a P value less than 0.05 flashing like a divinely appointed stamp of approval’’ 12 . The inherent subjectivity of Fisher’s P value approach and the subsequent poor understanding of this approach by the medical community could be the reason why P value is associated with myriad of fallacies. Thirdly, P value produced by researchers as mere ‘’passports to publication’’ aggravated the situation 13 . We were earlier on awakened to the inadequacy of the P value in clinical trials by Feinstein 14 ,

“The method of making statistical decisions about ‘significance’ creates one of the most devastating ironies in modern biologic science. To avoid usual categorical data, a critical investigator will usually go to enormous efforts in mensuration. He will get special machines and elaborate technologic devices to supplement his old categorical statement with new measurements of 𠆌ontinuous’ dimensional data. After all this work in getting 𠆌ontinuous’ data, however, and after calculating all the statistical tests of the data, the investigator then makes the final decision about his results on the basis of a completely arbitrary pair of dichotomous categories. These categories, which are called ‘significant’ and ‘nonsignificant’, are usually demarcated by a P value of either 0.05 or 0.01, chosen according to the capricious dictates of the statistician, the editor, the reviewer or the granting agency. If the level demanded for ‘significant’ is 0.05 or lower and the P value that emerge is 0.06, the investigator may be ready to discard a well-designed, excellently conducted, thoughtfully analyzed, and scientifically important experiment because it failed to cross the Procrustean boundary demanded for statistical approbation.

We should try to understand that Fisher wanted to have an index of measurement that will help him to decide the strength of evidence against null effect. But as it has been said earlier his idea was poorly understood and criticized and led to Neyman and Pearson to develop hypothesis testing in order to go round the problem. But, this is the result of their attempt: �pt” or “reject” null hypothesis or alternatively “significant” or “non significant”. The inadequacy of P value in decision making pervades all epidemiological study design. This head-or-tail approach to test of hypothesis has pushed the stakeholders in the field (statistician, editor, reviewer or granting agency) into an ever increasing confusion and difficulty. It is an accepted fact among statisticians of the inadequacy of P value as a sole standard judgment in the analysis of clinical trials 15 . Just as hypothesis testing is not devoid of caveats so also P values. Some of these are exposed below.

The threshold value, P < 0.05 is arbitrary. As has been said earlier, it was the practice of Fisher to assign P the value of 0.05 as a measure of evidence against null effect. One can make the “significant test” more stringent by moving to 0.01 (1%) or less stringent moving the borderline to 0.10 (10%). Dichotomizing P values into “significant” and “non significant” one loses information the same way as demarcating laboratory finding into normal” and �normal”, one may ask what is the difference between a fasting blood glucose of 25mmol/L and 15mmol/L?

Statistically significant (P < 0.05) findings are assumed to result from real treatment effects ignoring the fact that 1 in 20 comparisons of effects in which null hypothesis is true will result in significant finding (P < 0.05). This problem is more serious when several tests of hypothesis involving several variables were carried without using the appropriate statistical test, e.g., ANOVA instead of repeated t-test.

Statistical significance result does not translate into clinical importance. A large study can detect a small, clinically unimportant finding.

Chance is rarely the most important issue. Remember that when conducting a research a questionnaire is usually administered to participants. This questionnaire in most instances collect large amount of information from several variables included in the questionnaire. The manner in which the questions where asked and manner they were answered are important sources of errors (systematic error) which are difficult to measure.

What Influences P Value?

Generally, these factors influence P value.

Effect size. It is a usual research objective to detect a difference between two drugs, procedures or programmes. Several statistics are employed to measure the magnitude of effect produced by these interventions. They range: r 2 , ç 2 , ù 2 , R 2 , Q 2 , Cohen’s d, and Hedge’s g. Two problems are encountered: the use of appropriate index for measuring the effect and secondly size of the effect. A 7kg or 10 mmHg difference will have a lower P value (and more likely to be significant) than a 2-kg or 4 mmHg difference.

Size of sample. The larger the sample the more likely a difference to be detected. Further, a 7 kg difference in a study with 500 participants will give a lower P value than 7 kg difference observed in a study involving 250 participants in each group.

Spread of the data. The spread of observations in a data set is measured commonly with standard deviation. The bigger the standard deviation, the more the spread of observations and the lower the P value.

P Value and Statistical Significance: An Uncommon Ground

Both the Fisherian and Neyman-Pearson (N-P) schools did not uphold the practice of stating, “P values of less than 0.05 were regarded as statistically significant” or “P-value was 0.02 and therefore there was statistically significant difference.” These statements and many similar statements have criss-crossed medical journals and standard textbooks of statistics and provided an uncommon ground for marrying the two schools. This marriage of inconvenience further deepened the confusion and misunderstanding of the Fisherian and Neyman-Pearson schools. The combination of Fisherian and N-P thoughts (as exemplified in the above statements) did not shed light on correct interpretation of statistical test of hypothesis and p-value. The hybrid of the two schools as often read in medical journals and textbooks of statistics makes it as if the two schools were and are compatible as a single coherent method of statistical inference 4 , 23 , 24 . This confusion, perpetuated by medical journals, textbooks of statistics, reviewers and editors, have almost made it impossible for research report to be published without statements or notations such as, “statistically significant” or “statistically insignificant” or “Pπ.05” or “PϠ.05”.Sterne, then asked �n we get rid of P-values? His answer was “practical experience says no-why? 21 ”

However, the next section, “P-value and confidence interval: a common ground” provides one of the possible ways out of the seemingly insoluble problem. Goodman commented on P–value and confidence interval approach in statistical inference and its ability to solve the problem. “The few efforts to eliminate P values from journals in favor of confidence intervals have not generally been successful, indicating that the researchers’ need for a measure of evidence remains strong and that they often feel lost without one” 6 .

P Value and Confidence Interval: A Common Ground

Thus, so far this paper has examined the historical evolution of ‘significance’ testing as was initially proposed by R.A. Fisher. Neyman and Pearson were not accustomed to his subjective approach and therefore proposed ‘hypothesis testing’ involving binary outcomes: �pt” or “reject” null hypothesis. This, as we saw did not “solve” the problem completely. Thus, a common ground was needed and the combination of P value and confidence intervals provided the much needed common ground.

Before proceeding, we should briefly understand what confidence intervals (CIs) means having gone through what p-values and hypothesis testing mean. Suppose that we have two diets A and B given to two groups of malnourished children. An 8-kg increase in body weight was observed among children on diet A while a 3-kg increase in body weights was observed on diet B. The effect in weight increase is therefore 5kg on average. But it is obvious that the increase might be less than 3kg and also more than 8kg, thus a range can be represented and the chance associated with this range under the confidence intervals. Thus, for 95% confidence interval in this example will mean that if the study is repeated 100 times, 95 out of 100 the times, the CI contain the true increase in weight. Formally, 95% CI: “the interval computed from the sample data which when the study is repeated multiple times would contain the true effect 95% of the time.”

In the 1980s, a number of British statisticians tried to promote the use of this common ground approach in presenting statistical analysis 16 , 17 , 18 . They encouraged the combine presentation of P value and confidence intervals. The use of confidence intervals in addressing hypothesis testing is one of the four popular methods journal editors and eminent statisticians have issued statements supporting its use 19 . In line with this, the American Psychological Association’s Board of Scientific Affairs commissioned a white paper, “Task Force on Statistical Inference”. The Task Force suggested,

“When reporting inferential statistics (e.g. t - tests, F - tests, and chi-square) include information about the obtained ….. value of the test statistic, the degree of freedom, the probability of obtaining a value as extreme as or more extreme than the one obtained [i.e., the P value]…. Be sure to include sufficient descriptive statistics [e.g. per-cell sample size, means, correlations, standard deviations]…. The reporting of confidence intervals [for estimates of parameters, for functions of parameter such as differences in means, and for effect sizes] can be an extremely effective way of reporting results… because confidence intervals combine information on location and precision and can often be directly used to infer significance levels” 20 .

Jonathan Sterne and Davey Smith came up with their suggested guidelines for reporting statistical analysis as shown in the box 21 :

Box 1: Suggested guidance’s for the reporting of results of statistical analyses in medical journals.

The description of differences as statistically significant is not acceptable.

Confidence intervals for the main results should always be included, but 90% rather than 95% levels should be used. Confidence intervals should not be used as a surrogate means of examining significance at the conventional 5% level. Interpretation of confidence intervals should focus on the implication (clinical importance) of the range of values in the interval.

When there is a meaningful null hypothesis, the strength of evidence against it should be indexed by the P value. The smaller the P value, the stronger is the evidence.

While it is impossible to reduce substantially the amount of data dredging that is carried out, authors should take a very skeptical view of subgroup analyses in clinical trials and observational studies. The strength of the evidence for interaction-that effects really differ between subgroups – should always be presented. Claims made on the basis of subgroup findings should be even more tempered than claims made about main effects.

In observational studies it should be remembered that considerations of confounding and bias are at least as important as the issues discussed in this paper.

Since the 1980s when British statisticians championed the use of confidence intervals, journal after journal are issuing statements regarding its use. In an editorial in Clinical Chemistry, it read as follows,

“There is no question that a confidence interval for the difference between two true (i.e., population) means or proportions, based on the observed difference between sample estimate, provides more useful information than a P value, no matter how exact, for the probability that the true difference is zero. The confidence interval reflects the precision of the sample values in terms of their standard deviation and the sample size …..’’ 22

On the final note, it is important to know why it is statistically superior to use P value and confidence intervals rather than P value and hypothesis testing:

Confidence intervals emphasize the importance of estimation over hypothesis testing. It is more informative to quote the magnitude of the size of effect rather than adopting the significantnonsignificant hypothesis testing.

The width of the CIs provides a measure of the reliability or precision of the estimate.

Confidence intervals makes it far easier to determine whether a finding has any substantive (e.g. clinical) importance, as opposed to statistical significance.

While statistical significant tests are vulnerable to type I error, CIs are not.

Confidence intervals can be used as a significance test. The simple rule is that if 95% CIs does not include the null value (usually zero for difference in means and proportions one for relative risk and odds ratio) null hypothesis is rejected at 0.05 levels.

Finally, the use of CIs promotes cumulative knowledge development by obligating researchers to think meta-analytically about estimation, replication and comparing intervals across studies 25 . For example, in a meta-analysis of trials dealing with intravenous nitrates in acute myocardial infraction found reduction in mortality of somewhere between one quarter and two-thirds. Meanwhile previous six trials 26 showed conflicting results: some trials revealed that it was dangerous to give intravenous nitrates while others revealed that it actually reduced mortality. For the six trials, the odds ratio, 95% CIs and P-values are: OR = 0.33 (CI = 0.09, 1.13, P = 0.08) OR = 0.24 (CI = 0.08, 0.74, P = 0.01) OR = 0.83(CI = 0.33, 2.12, P = 0.07) OR = 2.04 (CI = 0.39, 10.71, P = 0.04) OR = 0.58 (CI = 0.19. 1.65 P = 0.29) and OR = 0.48 (CI = 0.28, 0.82 P = 0.007). The first, third, fourth and fifth studies appear harmful while the second and the sixth appear useful (in reducing mortality).

What is to be done?

While it is possible to make a change and improve on the practice, however, as Cohen warns, 𠇍on’t look for a magic alternative … It does not exist” 27 .

The foundation for change in this practice should be laid in the foundation of teaching statistics: classroom. The curriculum and class room teaching should clearly differentiate between the two schools. Historical evolution should be clearly explained so also meaning of “statistical significance”. The classroom teaching of the correct concepts should begin at undergraduate and move up to graduate classroom instruction, even if it means this teaching would be at introductory level.

We should promote and encourage the use of confidence intervals around sample statistics and effect sizes. This duty lies in the hands of statistics teachers, medical journal editors, reviewers and any granting agency.

Generally, researchers, preparing on a study are encouraged to consult a statistician at the initial stage of their study to avoid misinterpreting the P value especially if they are using statistical software for their data analysis.

Other variability measures

Standard deviation is the average distance of each data point from the mean of the data set. It’s calculated by taking the square root of the sum of all numbers minus the mean (squared) and dividing by one less than the number of values. For example, in a data set of five systolic blood pressures of 125, 128, 142, 145, and 150, the mean would be 138, based on this calculation: (125+128+142+145+150)/5. The standard deviation would be 10.9, based on this calculation: √(((125-138)2 + (128-138)2 + (142-138)2 + (145-138)2 + (150-138)2)/(5-1)), indicating that there’s not a large dispersion in this set of systolic measures. (Don’t worry about the complexity of the formula you can enter the data points in a free standard deviation tool that does the calculation for you. The formula is here to illustrate the point.)

The variance also describes the variation of data points from the mean, but it’s affected by outliers. If the standard deviation and variance are large, the spread of data points in the data set also is large however, if the standard deviation and variance are small, most data points are close to the mean. Whether standard deviation and variance are determined to be small or large depends on the range of data. For example, in data with a range of 5, a standard deviation of 4 would be large however, in data with a range of 10,000, a standard deviation of 4 would be small.

A quartile (q) consists of three points, q1 (lower), q2 (median), and q3 (upper), that divide a list of numbers into four equal categories. When using quartiles, you can identify the interquartile range (q3-q1), which describes the middle part of the data set.

Descriptive Statistics

Descriptive statistics is the term given to the analysis of data that helps describe, show or summarize data in a meaningful way such that, for example, patterns might emerge from the data. Descriptive statistics do not, however, allow us to make conclusions beyond the data we have analysed or reach conclusions regarding any hypotheses we might have made. They are simply a way to describe our data.

Descriptive statistics are very important because if we simply presented our raw data it would be hard to visualize what the data was showing, especially if there was a lot of it. Descriptive statistics therefore enables us to present the data in a more meaningful way, which allows simpler interpretation of the data. For example, if we had the results of 100 pieces of students' coursework, we may be interested in the overall performance of those students. We would also be interested in the distribution or spread of the marks. Descriptive statistics allow us to do this. How to properly describe data through statistics and graphs is an important topic and discussed in other Laerd Statistics guides. Typically, there are two general types of statistic that are used to describe data:

  • Measures of central tendency: these are ways of describing the central position of a frequency distribution for a group of data. In this case, the frequency distribution is simply the distribution and pattern of marks scored by the 100 students from the lowest to the highest. We can describe this central position using a number of statistics, including the mode, median, and mean. You can learn more in our guide: Measures of Central Tendency.
  • Measures of spread: these are ways of summarizing a group of data by describing how spread out the scores are. For example, the mean score of our 100 students may be 65 out of 100. However, not all students will have scored 65 marks. Rather, their scores will be spread out. Some will be lower and others higher. Measures of spread help us to summarize how spread out these scores are. To describe this spread, a number of statistics are available to us, including the range, quartiles, absolute deviation, variance and standard deviation.

When we use descriptive statistics it is useful to summarize our group of data using a combination of tabulated description (i.e., tables), graphical description (i.e., graphs and charts) and statistical commentary (i.e., a discussion of the results).

The 2 Main Types of Descriptive Statistics (with Examples)

Descriptive statistics has 2 main types:

  • Measures of Central Tendency (Mean, Median, and Mode).
  • Measures of Dispersion or Variation (Variance, Standard Deviation, Range).

1. Central Tendency

Central tendency (also called measures of location or central location) is a method to describe what’s typical for a group (set) of data.

It means central tendency doesn’t show us what is typical about each one piece of data, but it gives us an overview of the whole picture of the entire data set.

It tells us what is normal or average for a given set of data. There are three key methods to show central tendency: mean, mode, and median.

As the name suggests, mean is the average of a given set of numbers. The mean is calculated in two very easy steps:

1. Find the whole sum as add the data together
2. Divide the sum by the total number of data

The below is one of the most common descriptive statistics examples.

Let’s say you have a sample of 5 girls and 6 boys.

The girls’ heights in inches are: 62, 70, 60, 63, 66.

To calculate the mean height for the group of girls you need to add the data together:

Now, you take the sum (320) and divide it by the total number of girls (5): 320 / 5 = 64.

The best advantage of the mean is that it can be used to find both continuous and discrete numerical data (see our post about continuous vs discrete data).

Of course, the mean has limitations. Data must be numerical in order to calculate the mean. You cannot work with the mean when you have nominal data (see our post about nominal vs ordinal data).

The mode of a set of data is the number in the set that occurs most often.

Let’s see the next of our descriptive statistics examples, problems and solutions.

Consider you have a dataset with the retirement age of 10 people, in whole years:

55, 55, 55, 56, 56, 57, 58, 58, 59, 60

To illustrate this let’s see table below that shows the frequency of the retirement age data.

Retirement AgeFrequency

As you see, the most common value is 55. That is why the mode of this data set is 55 years.

The mode has one very important advantage over the median and the mean. It can be calculated for both numerical and categorical data (see our post about categorical data examples).

Limitations of the mode: In some data sets, the mode may not reflect the centre of the set. In the above example, if we order the retirement age from lowest to the highest, would see that the centre of the data set is 57 years, but the mode is lower, at 53 years.

Simply said, the median is the middle value in a data set. As you might guess, in order to calculate the middle, you need:

– first listing the data in a numerical order
– second, locating the value in the middle of the list.

The middle number in the below set is 26 as there are 4 numbers above it and 4 numbers below:

21, 22, 24, 24, 26, 27, 28, 29, 31.

But this was an odd set of data – you have 9 numbers. How to find the middle if you have an even set of data?

Easily – you just need to find the average of the two middle numbers.

For example, in the below dataset of 10 numbers, the average of the numbers is 26.5 (26 + 27) / 2.

21, 22, 24, 24, 26, 27, 28, 29, 31, 32

As an advantage of the median, we can say that it is less reflected by outliers and skewed data than the mean. We usually prefer the median when the data set is not symmetrical.

And to point the limitation, we should say that as the median cannot be ordered in a logical way, it cannot be calculated for nominal data.

Having trouble remembering the difference between the mode, mean, and median? Here are some hints:

  • The word MOde is very like MOst (the most frequent number)
  • “Mean” requires you do some arithmetic (adding all the numbers together and dividing).
  • “Median” practically means “Middle” and has the same number of letters.

Having trouble deciding which measure to use when you have nominal, ordinal or interval data? The above table can help.

2. Dispersion

Central tendency tells us important information but it doesn’t show everything we want to know about average values. Central tendency fails to reveal the extent to which the values of the individual items differ in a data set.

Measures of dispersion do a lot more – they complement the averages and allow us to interpret them much better.

Dispersion in statistics describes the spread of the data values in a given dataset. In other words, it shows how the data is “dispersed” around the mean (the central value).

Imagine you have to compare the performance of 2 group of students on the final math exam. You find that the average math test results are identical for both groups.

Is that mean the students in the two groups are performing equally? NO! Let’s see why.

Group of students A: 56, 58, 60, 62, 64
Group of students B: 40, 50, 60, 70, 80

Both of these groups have mean scores of 60.

However, in group A the individual scores are concentrated around the center – 60. All students in A have a very similar performance. There is consistency.

On the other hand, in group B the mean is also 60 but the individual scores are not even close to the center. One score is quite small – 40 and one score is very large – 80.

We can conclude that there is greater dispersion in group B.

The study of dispersion has a key role in statistical data. If in a given country there are very poor people and very rich people, we say there is serious economic disparity. Dispersion also is very useful when we want to find the relation between the set of data.

There are two popular measures of dispersion: standard deviation and range.

Let’s see some more descriptive statistics examples and definitions for dispersion measures.

The range is simply the difference between the largest and smallest value in a data set. It shows how much variation from the average exists.

You might guess that low range tells us that the data points are very close to the mean. And a high range shows the opposite.

Here is the formula for calculating the range:

Range = max. value – min. value

Let’s see the next of our descriptive statistics examples.

If we use the math results from Example 6:

Group of students A: 56, 58, 60, 62, 64
Group of students B: 40, 50, 60, 70, 80

we easily can calculate the range:

You see that the data values in Group A are much closer to the mean than the ones in Group B.

A serious disadvantage of the Range is that it only provides information about the minimum and maximum of the data set. It tells nothing about the values in between.

Standard deviation also provides information on how much variation from the mean exists. However, the standard deviation goes further than Range and shows how each value in a dataset varies from the mean.

As in the Range, a low standard deviation tells us that the data points are very close to the mean. And a high standard deviation shows the opposite.

The standard deviation formula for a sample of a population is:

If we use the math results in Example 6:

Group of students A: 56, 58, 60, 62, 64

The mean is 60.

Let’s find the standard deviation of the math exam scores by hand. We use simple values for the purposes of easy calculations.

Now, let’s replace the values in the formula:

The result above shows that, on average, every math exam score in The Group of students A is approximately 2.45 points away from the mean of 60.

Of course, you can calculate the above values by calculator instead by hand.

Note: The above formula is for a sample of a population. The standard deviation of an entire population is represented by the Greek lowercase letter sigma and looks like that:

More examples of Standard Deviation, you can see in the Explorable site.

The above 8 descriptive statistics examples, problems and solutions are simple but aim to make you understand the descriptive data better.

As you saw, descriptive statistics are used just to describe some basic features of the data in a study.

They provide simple summaries about the sample and enable us to present data in a meaningful way. It allows a simpler interpretation of the data.

Together with some plain graphics analysis, they form a solid basis for almost every quantitative analysis of data.

Descriptive statistics cannot, however, be used for making conclusions beyond the data we have analyzed or making conclusions regarding any hypotheses.

5.6 Standard scores

Suppose my friend is putting together a new questionnaire intended to measure “grumpiness”. The survey has 50 questions, which you can answer in a grumpy way or not. Across a big sample (hypothetically, let’s imagine a million people or so!) the data are fairly normally distributed, with the mean grumpiness score being 17 out of 50 questions answered in a grumpy way, and the standard deviation is 5. In contrast, when I take the questionnaire, I answer 35 out of 50 questions in a grumpy way. So, how grumpy am I? One way to think about would be to say that I have grumpiness of 35/50, so you might say that I’m 70% grumpy. But that’s a bit weird, when you think about it. If my friend had phrased her questions a bit differently, people might have answered them in a different way, so the overall distribution of answers could easily move up or down depending on the precise way in which the questions were asked. So, I’m only 70% grumpy with respect to this set of survey questions. Even if it’s a very good questionnaire, this isn’t very a informative statement.

A simpler way around this is to describe my grumpiness by comparing me to other people. Shockingly, out of my friend’s sample of 1,000,000 people, only 159 people were as grumpy as me (that’s not at all unrealistic, frankly), suggesting that I’m in the top 0.016% of people for grumpiness. This makes much more sense than trying to interpret the raw data. This idea – that we should describe my grumpiness in terms of the overall distribution of the grumpiness of humans – is the qualitative idea that standardisation attempts to get at. One way to do this is to do exactly what I just did, and describe everything in terms of percentiles. However, the problem with doing this is that “it’s lonely at the top”. Suppose that my friend had only collected a sample of 1000 people (still a pretty big sample for the purposes of testing a new questionnaire, I’d like to add), and this time gotten a mean of 16 out of 50 with a standard deviation of 5, let’s say. The problem is that almost certainly, not a single person in that sample would be as grumpy as me.

However, all is not lost. A different approach is to convert my grumpiness score into a standard score, also referred to as a (z) -score. The standard score is defined as the number of standard deviations above the mean that my grumpiness score lies. To phrase it in “pseudo-maths” the standard score is calculated like this: [ mbox = frac - mbox>> ] In actual maths, the equation for the (z) -score is [ z_i = frac> ] So, going back to the grumpiness data, we can now transform Dan’s raw grumpiness into a standardised grumpiness score. 77 If the mean is 17 and the standard deviation is 5 then my standardised grumpiness score would be 78 [ z = frac<35 - 17> <5>= 3.6 ] To interpret this value, recall the rough heuristic that I provided in Section 5.2.5, in which I noted that 99.7% of values are expected to lie within 3 standard deviations of the mean. So the fact that my grumpiness corresponds to a (z) score of 3.6 indicates that I’m very grumpy indeed. Later on, in Section 9.5, I’ll introduce a function called pnorm() that allows us to be a bit more precise than this. Specifically, it allows us to calculate a theoretical percentile rank for my grumpiness, as follows:

At this stage, this command doesn’t make too much sense, but don’t worry too much about it. It’s not important for now. But the output is fairly straightforward: it suggests that I’m grumpier than 99.98% of people. Sounds about right.

In addition to allowing you to interpret a raw score in relation to a larger population (and thereby allowing you to make sense of variables that lie on arbitrary scales), standard scores serve a second useful function. Standard scores can be compared to one another in situations where the raw scores can’t. Suppose, for instance, my friend also had another questionnaire that measured extraversion using a 24 items questionnaire. The overall mean for this measure turns out to be 13 with standard deviation 4 and I scored a 2. As you can imagine, it doesn’t make a lot of sense to try to compare my raw score of 2 on the extraversion questionnaire to my raw score of 35 on the grumpiness questionnaire. The raw scores for the two variables are “about” fundamentally different things, so this would be like comparing apples to oranges.

What about the standard scores? Well, this is a little different. If we calculate the standard scores, we get (z = (35-17)/5 = 3.6) for grumpiness and (z = (2-13)/4 = -2.75) for extraversion. These two numbers can be compared to each other. 79 I’m much less extraverted than most people ( (z = -2.75) ) and much grumpier than most people ( (z = 3.6) ): but the extent of my unusualness is much more extreme for grumpiness (since 3.6 is a bigger number than 2.75). Because each standardised score is a statement about where an observation falls relative to its own population, it is possible to compare standardised scores across completely different variables.

Topic 1: Multiple regression: Revision/Introduction

Contents of this handout: What is multiple regression, where does it fit in, and what is it good for? The idea of a regression equation From simple regression to multiple regression interpreting and reporting multiple regression results Carrying out multiple regression Exercises Worked examples using Minitab and SPSS

These notes cover the material of the first lecture, which is designed to remind you briefly of the main ideas in multiple regression. They are not full explanations they assume you have at least met multiple regression before. If you haven't, you will probably need to read Bryman & Cramer, pp. 177-186 and pp. 235-246. The words and phrases printed in bold type are all things which you should understand by the end of the course. Many of them you will already know some will be explained in the course of this lecture. In some cases we will explain them later in the course. Some of the material in these notes will not be gone through in the lecture, and you should make sure to read it over and ask us for explanations if you don't understand it.

What is multiple regression, where does it fit in, and what is it good for?

Multiple regression is the simplest of all the multivariate statistical techniques. Mathematically, multiple regression is a straightforward generalisation of simple regression , the process of fitting the best straight line through the dots on an x-y plot or scattergram . We will discuss what "best" means later in the lecture.

Regression (simple and multiple) techniques are closely related to the analysis of variance (anova) . Both are special cases of the General Linear Model or GLIM , and you can in fact do an anova using the regression commands in statistical packages (though the process is clumsy). You can combine the two, when what you have is an analysis of covariance (ancova) , which we will discuss briefly later in this course.

What distinguishes multiple regression from other techniques? The following are the main points:

  • In multiple regression, we work with one dependent variable and many independent variables . In simple regression, there is only one independent variable in factor analysis, cluster analysis and most other latent variable multivariate techniques, there are many dependent variables.
  • In multiple regression, the independent variables may be correlated . In analysis of variance, we arrange for all the independent variables to vary completely independently of each other.
  • In multiple regression, the independent variables can be continuous . For analysis of variance, they have to be categorical, and if they are naturally continuous, we have to force them into categories, for example by a median split .

This means that multiple regression is useful in the following general class of situations. We observe one dependent variable, whose variation we want to explain in terms of a number of other independent variables, which we can also observe. These other variables are not under experimental control - we just have to accept the variations in them that happen to occur in the sample of people or situations we can observe. We want to know which if any of these independent variables is significantly correlated with the dependent variable, taking into account the various correlations that may exist between the independent variables. So typically we use multiple regression to analyse data that come from "natural" rather than experimental situations. This makes it very useful in social psychology, and social science generally. Note, however, that it is inherently a correlational technique it cannot of itself tell us anything about the causalities that may underlie the relationships it describes .

There are some additional rules that have to be obeyed if multiple regression is to be useful:

  • The units (usually people) we observe should be a random sample from some well defined population. This is a basic requirement for all statistical work if we want to draw any kind of general inference from the observations we have made.
  • The dependent variable should be measured on an interval , continuous scale. In practice an ordinal (ranking or rating) scale is usually good enough unless the number of levels is small. If the dependent variable is only measured on a nominal (unordered category, including dichotomies ) scale, we have to use discriminant analysis or logistic regression instead. These are dealt with in a later lecture.
  • The independent variables should be measured on interval scales. However this is not a serious restriction since most ordinal scale measurement will be acceptable in practice 2-valued categorical variables (dichotomies) can be used directly and there is way of dealing with k-valued categorical variables (k usually stands for any integer greater than 2), by dummy variables , which we will discuss in the next lecture.
  • The distributions of all the variables should be normal . If they are not roughly normal, this can often be corrected by using an appropriate transformation (e.g. taking logarithms of all the measurements).
  • The relationships between the dependent variable and the independent variable should be linear . That is, it should be possible to draw a rough straight line through an x-y scattergram of the observed points. If the line looks curved, but is monotonic (increases or decreases all the time), things are not too bad and could be made better by transformation. If the line looks U-shaped, we will need to take special steps before regression can be used.
  • Although the independent variables can be correlated, there must be no perfect (or near-perfect) correlations among them, a situation called multicollinearity (which will be explained later in the course).
  • There must be no interactions , in the anova sense, between independent variables - the effect of each on the dependent variable must be roughly independent of the effects of all others. However, if interactions are obviously present, and not too complex, there are special steps we can take to cope with the situation.

The idea of a regression equation

Like many statistical procedures, multiple regression has two functions: to summarise some data, and to examine it for (statistically) significant trends . The first of these is part of descriptive statistics , the second of inferential statistics . We spend most of our time in elementary statistics courses thinking about inferential statistics, because at that level they are usually more difficult. But at any level, descriptive statistics are more important. In this section, we concentrate on how multiple regression describes a set of data.

How do we choose a descriptive statistic?

Any number we use to summarise a set of numbers is called a descriptive statistic . Many different descriptive statistics can be calculated for a given set of numbers, and different ones are useful for different purposes. In many cases, a descriptive statistic is chosen because it is in some sense the best summary of a particular type. But what do we mean by "best"?

Consider the best known of all descriptive statistics, the arithmetic mean - what lay people call the average. Why is this the best summary of a set of numbers? There is an answer, but it isn't obvious. The mean is the value from which the numbers in the set have the minimum sum of squared deviations . For the meaning of this, see Figure 1.

Consider observation 1. Its y value is y 1 . If we consider an "average" value ÿ , we define the deviation from the average as y 1 - ÿ , the squared deviation from the as ( y 1 - ÿ ) 2 , and the sum of squared deviations as sigma i ( y i - ÿ ) 2 . The arithmetic mean turns out to be the value of ÿ that makes this sum lowest. It also, of course, has the property that sigma i ( y i - ÿ ) = 0 that, indeed, is its definition.

Describing data with a simple regression equation

If we look at Figure 1, it's obvious that we could summarise the data better if we could find some way of representing the fact that the observations with high y values tend to be those with high x values. Graphically, we can do this by drawing a straight line on the graph so it passes through the cluster of points, as in Figure 2. Simple regression is a way of choosing the best straight line for this job.

This raises two problems: what is the best straight line, and how can we describe it when we have found it?
Let's deal first with describing a straight line. This is GCSE maths. Any straight line can be described by an equation relating the y values to the x values. In general, we usually write,

Here m and c are constants whose values tell us which of the infinite number of possible straight lines we are looking at. m (from French monter ) tells us about the slope or gradient of the line. Positive m means the line slopes upwards to the right negative m that it slopes downwards. High m values mean a steep slope, low values a shallow one. c (from French couper ) tells us about the intercept , i.e. where the line cuts the y axis: positive c means that when x is zero, y has a positive value, negative c means that when x is zero, y has a negative value. But for regression purposes, it's more convenient to use different symbols. We usually write:

This is just the same equation with different names for the constants: a is the intercept, b is the gradient.

The problem of choosing the best straight line then comes down to finding the best values of a and b . We define "best" in the same way as we did when we explained why the mean is the best summary: we choose the a and b values that give us the line such that the sum of squared deviations from the line is minimised. This is illustrated in Figure 3. The best line is called the regression line , and the equation describing it is called the regression equation . The deviations from the line are also called residuals .

Goodness of fit

Having found the best straight line, the next question is how well it describes the data. We measure this by the fraction

This is called the variance accounted for , symbolised by VAC or R 2 . Its square root is the Pearson correlation coefficient . R 2 can vary from 0 (the points are completely random) to 1 (all the points lie exactly on the regression line) quite often it is reported as a percentage (e.g. 73% instead of 0.73). The Pearson correlation coefficient (usually symbolised by r ) is always reported as a decimal value. It can take values from -1 to +1 if the value of b is negative, the value of r will also be negative.

Note that two sets of data can have identical a and b values and very different R 2 values, or vice versa. Correlation measure the strength of a linear relationship: it tells you how much scatter there is about the best fitting straight line through a scattergram. a and b , on the other hand, tell you what the line is. The values of a and b will depend on the units of measurement used, but the value of r is independent of units. If we transform y and x to z-scores , which involves rescaling them so they have means of zero and standard deviations of 1, b will equal r .

Note carefully that a , b , R 2 and r are all descriptive statistics. We have not said anything about significance tests. Given a set of paired x and y values, we can use virtually any statistics package to find the corresponding values of a , b and R 2 . It will also do some significance tests for us. The way to do this is described later. All the calculations can also be done by hand, or on a pocket calculator that has statistical functions.

From simple regression to multiple regression

What happens if we have more than two independent variables? In most cases, we can't draw graphs to illustrate the relationship between them all. But we can still represent the relationship by an equation. This is what multiple regression does. It's a straightforward extension of simple regression. If there are n independent variables, we call them x 1 , x 2 , x 3 and so on up to x n . Multiple regression then finds values of a , b 1 , b 2 , b 3 and so on up to b n which give the best fitting equation of the form

y = a + b 1 x 1 + b 2 x 2 + b 3 x 3 + . + b n x n

b 1 is called the coefficient of x 1 , b 2 is the coefficient of x 2 , and so forth. The equation is exactly like the one for simple regression, except that it is very laborious to work out the values of a , b 1 etc by hand. Most statistics packages, however, do it with exactly the same command as for simple regression.

What do the regression coefficients mean? The coefficient of each independent variable tells us what relation that variable has with y , the dependent variable, when all the other independent variables are held constant . So, if b 1 is high and positive, that means that if x 2 , x 3 and so on up to x n do not change, then increases in x 1 will correspond to large increases in y .

Goodness of fit in multiple regression

In multiple regression, as in simple regression, we can work out a value for R 2 . However, every time we add another independent variable, we necessarily increase the value of R 2 (you can get a feel for how this happens if you compare Fig 3 with Fig 1). Therefore, in assessing the goodness of fit of a regression equation, we usually work in terms of a slightly different statistic, called R 2 -adjusted or R 2 adj . This is calculated as

R 2 adj = 1 - (1- R 2 )( N - n -1)/( N -1)

where N is the number of observations in the data set (usually the number of people) and n the number of independent variables or regressors . This allows for the extra regressors. You can see that R 2 adj will always be lower than R 2 if there is more than one regressor. There is also another way of assessing goodness of fit in multiple regression, using the F statistic which is discussed below. It is possible in principle to to take the square root of R 2 or R 2 adj to get what is called the multiple correlation coefficient , but we don't usually bother.


Regression equations can also be used to obtain predicted or fitted values of the dependent variable for given values of the independent variable. If we know the values of x 1 , x 2 , . x n , it is obviously a simple matter to calculate the value of y which, according to the equation, should correspond to them: we just multiply x 1 by b 1 , x 2 by b 2 , and so on, and add all the products to a . We can do this for combinations of independent variables that are represented in the data, and also for new combinations. We need to be careful, though, of extending the independent variable values far outside the range we have observed ( extrapolating ), as it is not guaranteed that the regression equation will still hold accurately.

Interpreting and reporting multiple regression results

The main questions multiple regression answers

Multiple regression enables us to answer five main questions about a set of data, in which n independent variables (regressors), x 1 to x n , are being used to explain the variation in a single dependent variable, y .

  1. How well do the regressors, taken together, explain the variation in the dependent variable? This is assessed by the value of R 2 adj . As a very rough guide, in psychological applications we would usually reckon an R 2 adj of above 75% as very good 50-75% as good 25-50% as fairr and below 25% as poor and perhaps unacceptable. Alas, R 2 adj values above 90% are rare in psychological data, and if you get one, you should wonder whether there is some artefact in your data.
  2. Are the regressors, taken together, significantly associated with the dependent variable? This is assessed by the statistic F in the "Analysis of Variance" or anova part of the regression output from a statistics package. This is the Fisher F as used in the ordinary anova, so its significance depends on its degrees of freedom , which in turn depend on sample sizes and/or the nature of the test used. As in anova, F has two degrees of freedom associated with it. In general they are referred to as the numerator and denominator degrees of freedom (because F is actually a ratio). In regression, the numerator degrees of freedom are associated with the regression (and equal the number of regressors used), and the denominator degrees of freedom with the residual or error you can find them in the Regression and Error rows of the anova table in the output from a statistics package. If you were finding the significance of an F value by looking it up in a book of tables, you would need the degrees of freedom to do it. Statistics packages normally work out significances for you, and you will find them in the anova table next to the F value but you need to use the degrees of freedom when reporting the results (see below). It is useful to remember that the higher the value of F , the more significant it will be for given degrees of freedom.
  3. What relationship does each regressor have with the dependent variable when all other regressors are held constant? This is answered by looking at the regression coefficients. Some statistics packages (e.g. Minitab) report these twice, once in the form of a regression equation and again (to an extra decimal place) in a table of regression coefficients and associated statistics. Note that regression coefficients have units. So if the dependent variable is number of cigarettes smoked per week, and one of the regressors is annual income, the coefficient for that regressor would have units of (cigarettes per week) per (pound of income per year). That means that if we changed the units of one of the variables, the regression coefficient would change - but the relationship it is describing, and what it is saying about it, would not. So the size of a regression coefficient doesn't tell us anything about the strength of the relationship it describes until we have taken the units into account. The fact that regression coefficients have units also means that we can give a precise interpretation to each coefficient. So, staying with smoking and income, a coefficient of 0.062 in this case would mean that, with all other variables held constant, increasing someone's income by one pound per year is associated with an increase of cigarette consumption of 0.062 cigarettes per week (we might want to make this easier to grasp by saying that an increase in income of 1 pound per week would be associated with an increase in cigarette consumption of 52 * 0.062 = 3.2 cigarettes per week). Negative coefficients mean that when the regressor increases, the dependent variable decreases. If the regressor is a dichotomous variable (e.g. gender), the size of the coefficient tells us the size of the difference between the two classes of individual (again, with all other variables held constant). So a gender coefficient of 2.6, with women coded 0 and men coded 1, would mean that with all other variables held constant, men's dependent variable scores would average 2.6 units higher than women's.
  4. Which independent variable has most effect on the dependent variable? It is not possible to give a fully satisfactory answer to this question, for a number of reasons. The chief one is that we are always looking at the effect of each variable in the presence of all the others since the dependent variable need not be independent, it is hard to be sure which one is contributing to a joint relationship (or even to be sure that that means anything). However, the usual way of addressing the question is to look at the standardised regression coefficients or beta weights for each variable these are the regression coefficients we would get if we converted all variables (independent and dependent) to z-scores before doing the regression. SPSS reports beta weights for each independent variable in its regression output Minitab, unfortunately, does not.
  5. Are the relationships of each regressor with the dependent variable statistically significant, with all other regressors taken into account? This is answered by looking at the t values in the table of regression coefficients. The degrees of freedom for t are those for the residual in the anova table, but statistics packages work out significances for us, so we need to know the degrees of freedom only when it comes to reporting results. Note that if a regression coefficient is negative, most packages will report the corresponding t value as negative, but if you were looking it up in tables, you would use the absolute (unsigned) value, and the sign should be dropped when reporting results.

Further questions to ask

Either the nature of the data, or the regression results, may suggest further questions. For example, you may want to obtain means and standard deviations or histograms of variables to check on their distributions or plot one variable against another, or obtain a matrix of correlations, to check on first order relationships. You should also check for unusual observations or " outliers ": these will be discussed in the next session.

Reporting regression results

Research articles frequently report the results of several different regressions done on a single data set. In this case, it is best to present the results in a table. Where a single regression is done, however, that is unnecessary, and the results can be reported in text. The wording should be something like the following - this is for the depression vs age, income and gender example used as a Minitab example below:

The data were analysed by multiple regression, using as regressors age, income and gender. The regression was a rather poor fit ( R 2 adj = 40%), but the overall relationship was significant ( F 3,12 = 4.32, p < 0.05). With other variables held constant, depression scores were negatively related to age and income, decreasing by 0.16 for every extra year of age, and by 0.09 for every extra pound per week income. Women tended to have higher scores than men, by 3.3 units. Only the effect of income was significant ( t 12 = 3.18, p < 0.01).

Normally you will need to go on to discuss the meaning of the trends you have described.

Note the following pitfalls for the unwary:

  • The above brief paragraph does not exhaust what you can say about a set of regression results. There may be features of the data you should look at - "Unusual observations", for example.
  • Always report what happened before moving on to its significance - so R 2 adj values before F values, regression coefficients before t values. Remember, descriptive statistics are more important than significance tests.
  • Degrees of freedom for both F and t values must be given. Usually they are written as subscripts. For F the numerator degrees of freedom are given first. You can also put degrees of freedom in parentheses, or report them explicitly, e.g.: " F (3,12) = 4.32" or " F = 4.32, d. of f. = 3, 12".
  • Significance levels can either be reported exactly (e.g. p = 0.032) or in terms of conventional levels (e.g. p < 0.05). There are arguments in favour of either, so it doesn't much matter which you do. But you should be consistent in any one report.
  • Beware of highly significant F or t values, whose significance levels will be reported by statistics packages as, for example, 0.0000. It is an act of statistical illiteracy to write p = 0.0000, because significance levels can never be exactly zero - there is always some probability that the observed data could arise if the null hypothesis was true. What the package means is that this probability is so low it can't be represented with the number of columns available. We should write it as p < 0.00005 (or, if we are using conventional levels, p < 0.001).
  • Beware of spurious precision , i.e. reporting coefficients etc to huge numbers of significant figures when, on the basis of the sample you have, you couldn't possibly expect them to replicate to anything like that degree of precision if someone repeated the study. F and t values are conventionally reported to two decimal places, and R 2 adj values to the nearest percentage point (sometimes to one decimal place). For coefficients, you should be guided by the sample size: with a sample size of 16 as in the example above, two significant figures is plenty, but even with more realistic samples, in the range of 100 to 1000, three significant figures is usually as far as you should go. This means that you will usually have to round off the numbers that statistics packages give you.

Carrying out multiple regression


At the end of the handout there is a complete worked example on some made-up data, in which we attempt to predict scores on a paper and pencil test of depression (running from 0 to 100) from income (in pounds/week), gender (coded 0 for men and 1 for women) and age. Note that the REGRESS command, which actually carries out the regression, needs us to tell it how many independent variables there are. It is very important to make sure that we then provide the corresponding number of columns - if we provide too many, Minitab will not warn us of the error, but will write some detailed results into the extra columns, thus overwriting any data we might have in them, and producing mystifying errors later in our analysis.

The SPSS example uses a set of data on the psychology of tax avoidance. An appropriate command file would be as follows:

  • The regression command indicates that one or several regression analyses are to be carried out, and is followed by a list of all the variables that are to be used, either as dependent or a independent variables. In this case they include an index of tax evasion, and 15 questionnaire items measuring alienation, free-rider tendencies and attitudes to the law).
  • The /statistics line can be used to control what sort of output we get. The default output is very similar to Minitab's regression output.
  • The /missing line tells the system how to deal with missing values. Replacing them with the mean, as here, is not very satisfactory, but was necessary with this data set because there were too many missing values to discard all cases involving missing values on any variable (the more usual procedure).
  • The /dependent line tells us which variable will be the dependent variable. If we give no other information, all the others will be used as independent variables. This line must come immediately before the /method line.
  • The /method line tells us how to use the independent variables. The simple enter option used here will run one regression, using all the independent variables. We shall look at other possibilities in a later lecture.

Output from this file is given at the end of this handout. It shows that the 15 questionnaire items do quite a good job of predicting tax avoidance.


1. The following are the IQ scores on the Verbal and Numerical scales of a certain test for a group of students:

Use Minitab to calculate the mean and standard deviation of the scores on each scale. Use LET to work out the difference between them and put it in a new column. Use TTEST on this column to see whether there is a significant difference between the verbal and numerical scores.

2. Using the data from the previous example, work out the regression line for predicting Numerical scores (dependent variable) from Verbal scores (independent variable).

3. A social psychologist observes the scores achieved on a video game in a pub, by the first new (previously unobserved) player to use the machine after each half hour through the evening. They are as follows:

Use SPSS to investigate whether the data support the psychologist's hypothesis that more expert players use the machine later in the evening? What would be the most likely score to observe at 9.45pm?

4. The following data show the levels of anxiety recorded by a paper-and-pencil test just before a group of students took an examination, together with the exam marks obtained. Use Minitab's PLOT command to decide whether it would be appropriate to use linear regression to summarize these data.

5. The Singer file /singer1/eps/psybin/stats/teengamb.DAT contains, for each of 47 teenagers, the following information:

  1. subject number
  2. gender (0=male, 1=female)
  3. status (arbitrary scale based on parents' occupation. Higher numbers => higher status)
  4. income (pocket money+earnings) in pounds/wk
  5. verbal intelligence (number of words out of 12 correctly defined)
  6. estimate (from questionnaire answers) of expenditure on all forms of gambling, in pounds/yr

Each line of the file contains all 6 data items for a single person. These are real data, collected during an undergraduate project a few years ago, and since published (Ide-Smith & Lea, 1988, Journal of Gambling Behavior, 4, 110-118). Note , though, that you won't get quite the same results as in the published article, because I've cut out the data from some subjects whose data would have given you problems.

Set up a Minitab worksheet with columns with appropriate names, and read these data into it using READ. Note that you don't need to type the file extension (.DAT) because this is the default for READ, but if you do type it, you must use CAPITALS. The rest of the filename must be typed in lower case.

How to choose the right statistical test?

Today statistics provides the basis for inference in most medical research. Yet, for want of exposure to statistical theory and practice, it continues to be regarded as the Achilles heel by all concerned in the loop of research and publication – the researchers (authors), reviewers, editors and readers.

Most of us are familiar to some degree with descriptive statistical measures such as those of central tendency and those of dispersion. However, we falter at inferential statistics. This need not be the case, particularly with the widespread availability of powerful and at the same time user-friendly statistical software. As we have outlined below, a few fundamental considerations will lead one to select the appropriate statistical test for hypothesis testing. However, it is important that the appropriate statistical analysis is decided before starting the study, at the stage of planning itself, and the sample size chosen is optimum. These cannot be decided arbitrarily after the study is over and data have already been collected.

The great majority of studies can be tackled through a basket of some 30 tests from over a 100 that are in use. The test to be used depends upon the type of the research question being asked. The other determining factors are the type of data being analyzed and the number of groups or data sets involved in the study. The following schemes, based on five generic research questions, should help.[1]

Question 1: Is there a difference between groups that are unpaired? Groups or data sets are regarded as unpaired if there is no possibility of the values in one data set being related to or being influenced by the values in the other data sets. Different tests are required for quantitative or numerical data and qualitative or categorical data as shown in Fig. 1 . For numerical data, it is important to decide if they follow the parameters of the normal distribution curve (Gaussian curve), in which case parametric tests are applied. If distribution of the data is not normal or if one is not sure about the distribution, it is safer to use non-parametric tests. When comparing more than two sets of numerical data, a multiple group comparison test such as one-way analysis of variance (ANOVA) or Kruskal-Wallis test should be used first. If they return a statistically significant p value (usually meaning p < 0.05) then only they should be followed by a post hoc test to determine between exactly which two data sets the difference lies. Repeatedly applying the t test or its non-parametric counterpart, the Mann-Whitney U test, to a multiple group situation increases the possibility of incorrectly rejecting the null hypothesis.

Tests to address the question: Is there a difference between groups – unpaired (parallel and independent groups) situation?

Question 2: Is there a difference between groups which are paired? Pairing signifies that data sets are derived by repeated measurements (e.g. before-after measurements or multiple measurements across time) on the same set of subjects. Pairing will also occur if subject groups are different but values in one group are in some way linked or related to values in the other group (e.g. twin studies, sibling studies, parent-offspring studies). A crossover study design also calls for the application of paired group tests for comparing the effects of different interventions on the same subjects. Sometimes subjects are deliberately paired to match baseline characteristics such as age, sex, severity or duration of disease. A scheme similar to Fig. 1 is followed in paired data set testing, as outlined in Fig. 2 . Once again, multiple data set comparison should be done through appropriate multiple group tests followed by post hoc tests.

Tests to address the question: Is there a difference between groups – paired situation?

Question 3: Is there any association between variables? The various tests applicable are outlined in Fig. 3 . It should be noted that the tests meant for numerical data are for testing the association between two variables. These are correlation tests and they express the strength of the association as a correlation coefficient. An inverse correlation between two variables is depicted by a minus sign. All correlation coefficients vary in magnitude from 0 (no correlation at all) to 1 (perfect correlation). A perfect correlation may indicate but does not necessarily mean causality. When two numerical variables are linearly related to each other, a linear regression analysis can generate a mathematical equation, which can predict the dependent variable based on a given value of the independent variable.[2] Odds ratios and relative risks are the staple of epidemiologic studies and express the association between categorical data that can be summarized as a 2 × 2 contingency table. Logistic regression is actually a multivariate analysis method that expresses the strength of the association between a binary dependent variable and two or more independent variables as adjusted odds ratios.

Tests to address the question: Is there an association between variables?

Question 4: Is there agreement between data sets? This can be a comparison between a new screening technique against the standard test, new diagnostic test against the available gold standard or agreement between the ratings or scores given by different observers. As seen from Fig. 4 , agreement between numerical variables may be expressed quantitatively by the intraclass correlation coefficient or graphically by constructing a Bland-Altman plot in which the difference between two variables x and y is plotted against the mean of x and y. In case of categorical data, the Cohen’s Kappa statistic is frequently used, with kappa (which varies from 0 for no agreement at all to 1 for perfect agreement) indicating strong agreement when it is > 0.7. It is inappropriate to infer agreement by showing that there is no statistically significant difference between means or by calculating a correlation coefficient.

Tests to address the question: Is there an agreement between assessment (screening / rating / diagnostic) techniques?

Question 5: Is there a difference between time-to-event trends or survival plots? This question is specific to survival analysis[3](the endpoint for such analysis could be death or any event that can occur after a period of time) which is characterized by censoring of data, meaning that a sizeable proportion of the original study subjects may not reach the endpoint in question by the time the study ends. Data sets for survival trends are always considered to be non-parametric. If there are two groups then the applicable tests are Cox-Mantel test, Gehan’s (generalized Wilcoxon) test or log-rank test. In case of more than two groups Peto and Peto’s test or log-rank test can be applied to look for significant difference between time-to-event trends.

It can be appreciated from the above outline that distinguishing between parametric and non-parametric data is important. Tests of normality (e.g. Kolmogorov-Smirnov test or Shapiro-Wilk goodness of fit test) may be applied rather than making assumptions. Some of the other prerequisites of parametric tests are that samples have the same variance i.e. drawn from the same population, observations within a group are independent and that the samples have been drawn randomly from the population.

A one-tailed test calculates the possibility of deviation from the null hypothesis in a specific direction, whereas a two-tailed test calculates the possibility of deviation from the null hypothesis in either direction. When Intervention A is compared with Intervention B in a clinical trail, the null hypothesis assumes there is no difference between the two interventions. Deviation from this hypothesis can occur in favor of either intervention in a two-tailed test but in a one-tailed test it is presumed that only one intervention can show superiority over the other. Although for a given data set, a one-tailed test will return a smaller p value than a two-tailed test, the latter is usually preferred unless there is a watertight case for one-tailed testing.

It is obvious that we cannot refer to all statistical tests in one editorial. However, the schemes outlined will cover the hypothesis testing demands of the majority of observational as well as interventional studies. Finally one must remember that, there is no substitute to actually working hands-on with dummy or real data sets, and to seek the advice of a statistician, in order to learn the nuances of statistical hypothesis testing.

Watch the video: Beskrivende statistik - oberservationsskema (June 2022).


  1. Pwyll

    the answer Faithful

  2. Kilabar

    In it all the charm!

  3. Zolokinos

    I disagree with those

Write a message