Let’s imagine that we ran 5 tests simultaneously, and came up with the following result: When p-values are not independent, other methods such as Tuckey’s can provide a more suitable alternative.Įxample of multiple testing using the Bonferroni correction: Consequently, the Bonferroni correction often requires a practical FWER that is larger than needed and is known to be conservative, and not the best approach in general. When the hypotheses being tested are not independent, it results from the rule of addition that P(A or B) << P(A) + P(B). The Bonferroni correction guarantees that we do not make false positive claims ( Type I errors), but this comes at the price of making false negative claims ( Type II errors). So in practice, the p-value is considered to be significant if it’s below the desired level alpha/ m, where m is the number of hypotheses being tested. In the aforementioned derivation, we look for a quantity “alpha multiple” that controls the FWER at 0.05, which results in a simple division of alpha by the number of hypotheses m. Two of the most used methods to control the FWER, are (A) the Bonferroni correction and (B) Holm’s procedure. We define the family wise error rate (FWER) as the probability of making at least one Type I error. When testing multiple hypotheses simultaneously, we can instead be concerned about erroneously rejecting the 1st, 2nd… or mth null hypothesis. When testing one hypothesis, we are concerned about erroneously rejecting one null hypothesis. So, how do we adjust for multiple testing? (II) Controlling the family-wise error rate (FWER) This is what makes multiple testing adjustment important. This means that if we test 1000 hypotheses simultaneously, we expect to claim false findings on 50 just by chance. Because we fixed the Type I error at 5%, under regularity conditions we will on average make the decision to falsely reject the null 5% of the times. Testing a large number of hypotheses using this framework needs specific considerations. This decision is based on the p-value, and the pre-defined type I error rate: if the p-value is under 0.05, we reject the null hypothesis. (Step 4) Make a decision to reject, or to not reject the null hypothesis The p-value is a measure of how extreme it is to observe the test statistic we constructed in step 2, which assumed the null hypothesis of no effect. Usually, this is based on the z, t or chi-square distribution. (Step 2) Using the observed data, construct a test statistic to test the hypothesis We fix the Type I error at 5% and expect to falsely claim a finding 5% of the times. The “confidence level” is (1- the type I error), and the “statistical power” is (1- the type II error). Traditionally, we want to reach a conclusion with a confidence level of 95%, and a statistical power of 80%. Table 1: Type I and type II errors in hypothesis testing (by author) (Step 1) Define H0 (the null hypothesis), the confidence level, and power Table 1 below is a reminder of the two types of errors one can make when testing a hypothesis (the type I error or false positive rate, and the type II error, or false negative rate). (I) Hypothesis testing and p-values: a quick refresherīefore diving into multiple testing, let’s quickly go over 4 steps to follow when doing hypothesis testing. A link to the lab section in the ISL textbook with R code is also provided at the end of this article. (I) Hypothesis testing and p-values: a quick refresher (II) Controlling the Family-Wise Error Rate (FWER) (III) Controlling the False Discovery Rate (FDR)Ĭoncluding remarks will discuss how to approach multiple testing, and hypothesis testing in general. The structure of the article is inspired by a chapter in “ An Introduction to Statistical Learning” (ISL) by Tibshirani et al. Other popular (and sometimes more powerful) methods such as Scheffé’s, Tukey’s, and resampling approaches will not be covered. We will review the Bonferroni correction, Holm’s and the Benjamini-Hochberg procedures. In this article, I discuss the traditional frequentist approach to multiple testing, which is centered around the p-value. In Genome Wide Association Studies (GWAS) for instance, researchers are interested in testing the relationship between a phenotype and millions of DNA nucleotide mutations. It is now a key consideration in statistical inference problems.Ĭommon examples of multiple testing problems include testing whether several variables have an effect on a given outcome, or testing the effect of a single variable on a myriad of outcomes. Multiple testing adjustment has gained in popularity with large scale datasets used for exploratory purposes.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |