onlineFDR.Rmd
onlineFDR
?
Multiple hypothesis testing is a fundamental problem in statistical inference, and the failure to manage multiple testing problems has been highlighted as one of the elements contributing to the replicability crisis in science (Ioannidis 2015). Methodologies have been developed to manage the multiple testing situation by adjusting the significance levels for a family of hypotheses, in order to control error metrics such as the familywise error rate (FWER) or the false discovery rate (FDR).
Frequently, modern data analysis problems have a further complexity in that the hypotheses arrive in a stream.
This introduces the challenge that at each step, the investigator must decide whether to reject the current null hypothesis without having access to the future p-values or the total number of hypotheses to be tested, but with the knowledge of the historic decisions to date.
The onlineFDR
package provides a family of algorithms
you can apply to a historic or growing dataset to control the FDR or
FWER in an online manner. At a high-level, these algorithms rely on a
concept called “alpha wealth” in which experiments cost some amount of
error from your “budget” but a discovery earns some of the budget
back.
This vignette explains the two main uses of the package and demonstrates their typical workflows.
We strive to make our R package as easy to use as possible. Please see the flowchart below to decide which function is best to solve your problem. The interactive version (click-to-functions) is available here.
We also have a provided a non-exhaustive list of answers to some questions you may have when navigating the flowchart.
The FDR is the expected proportion of false rejections out of all rejections. The FWER is the probability of making any false rejections at all. Controlling the FWER is more conservative than controlling the FDR. Note that in the case when all null hypotheses are true, the FDR and FWER are the same.
Offline refers to the case when all the hypotheses are tested simultaneously by an algorithm. Batch refers to the case when the hypotheses are tested as they arrive in batches. One-by-one refers to the case when hypotheses are tested as they arrive, one at a time.
‘Independent’ means that a given null p-value does not depend on any other non null p-values. A simple way to think about p-values being ‘positively dependent’ is to consider correlated hypothesis tests. For instance, consider testing for pairwise differences in means between 4 groups. If group A has an especially low mean, then not only would A vs. B yield a small p-value, but also A vs. C and A vs. D. Finally, ‘arbitrary dependence’ includes the situation where some of your p-values happen to be correlated with p-values from a long time ago.
LOND
, LORD
, SAFFRON
, and
ADDIS
?LOND
is a fairly simple algorithm where the significance
levels are multiplied by the number of discoveries/rejections that have
been made thus far. It also provably controls the FDR when the p-values
are positively correlated. However, the drawback is that unless many
discoveries are continually being made right from the start of an online
experiment, the adjusted significance levels (and hence the power) will
very quickly go towards zero. In this way, LOND
is
oblivious to the information it gained from the previous hypothesis
tests and does not take full advantage of its alpha-wealth.
LORD
improves upon LOND
by taking advantage
of “alpha investing” where it can regain some of its alpha-wealth when
it makes a discovery/rejection. The adjusted significance levels depend
not only on how many discoveries have been made, but also the timing of
these discoveries. However, one drawback is that LORD does not take
advantage of the strength of the signals present in the data (i.e. the
size of the p-values).
SAFFRON
improves upon this by focusing on the stronger
signals in the experiment (i.e. the smaller p-values). By removing the
possibility of ever rejecting weaker signals (those which are a
priori more likely to be truly null hypotheses),
SAFFRON
preserves alpha-wealth. When there is a substantial
fraction of non-nulls in the online experiment, SAFFRON
will often be more powerful than LORD
.
ADDIS
is a further improvement upon SAFFRON
because it invests alpha-wealth more effectively by explicitly
discarding the weakest signals (i.e. the largest p-values) in a
principled way. This can result in an even higher power.
This Quick Start guide is meant to provide a framework for you to use
any of the algorithms within the onlineFDR
package. The
algorithms used in the examples below were selected arbitrarily for the
sake of example.
In general, your dataset should contain, at the minimum, a column of p-values (‘pval’)). You can also pass in an id column (‘id’) or a date column (‘date’), but that is optional; the p-values will be treated as being ordered in sequence. Alternatively, you can also use just the vector of p-values, in which case, the p-values will also be treated as being ordered in sequence.
If you are using the Batch algorithms, ensure that your dataset
contains a column (‘batch’) where batches are defined in sequence
starting from 1. For more complex data structures, you may want to
consider using the STAR algorithms (see LONDstar()
,
LORDstar()
, and SAFFRONstar()
). If you are not
sure which algorithm to use, click here.
All p-values generated should be passed to the function (and not just the significant p-values). An exception to this would be if you have implemented an orthogonal filter to reduce the dataset size, such as discussed in (Burgon et al., 2010).
If you’re using LOND()
, LORD()
,
SAFFRON()
or ADDIS()
, it orders the p-values
by date. If there are multiple p-values with the same date (i.e. the
same batch), the order of the p-values within each batch is randomised
by default. Generally, users should randomise unless they have a
priori knowledge that hypotheses should be ordered in such way such
that the ones with smaller p-values are more likely to appear first. In
order for the randomisation of the p-values to be reproducible, it is
necessary to set a seed (via the set.seed
function) before
calling the wrapper function.
Otherwise, the other algorithms will take in the p-values in the original order of the data.
For each hypothesis test, the functions calculate the adjusted
significance thresholds (alphai
) at which the corresponding
p-value would be declared statistically significant.
Also calculated is an indicator function of discoveries
(R
), where R[i] = 1
corresponds to hypothesis
i being rejected, otherwise R[i] = 0
.
A dataframe is returned with the original data and the newly
calculated alphai
and R
.
onlineFDR
Exploratively
This package (and the corresponding Shiny
app) can be used in an exploratory way post-hoc. If you have a
dataset of p-values for a series of experiments that have completed, you
can use the algorithms provided in onlineFDR
to explore how
you could control the FDR and how the different algorithms have
different levels of power.
First, we initialize a toy dataset with three columns: an identifier (‘id’), date (‘date’) and p-value (‘pval’). Note that the date should be in the format “YYYY-MM-DD”.
sample.df <- data.frame(
id = c('A15432', 'B90969', 'C18705', 'B49731', 'E99902',
'C38292', 'A30619', 'D46627', 'E29198', 'A41418',
'D51456', 'C88669', 'E03673', 'A63155', 'B66033'),
date = as.Date(c(rep("2014-12-01",3),
rep("2015-09-21",5),
rep("2016-05-19",2),
"2016-11-12",
rep("2017-03-27",4))),
pval = c(2.90e-14, 0.00143, 0.06514, 0.00174, 0.00171,
3.61e-05, 0.79149, 0.27201, 0.28295, 7.59e-08,
0.69274, 0.30443, 0.000487, 0.72342, 0.54757))
Next, we call our algorithm of interest. Note that we also set a seed
using the set.seed
function in order for the results to be
reproducible.
library(onlineFDR)
set.seed(1)
LOND_results <- LOND(sample.df)
LOND_results
#> pval alphai R
#> 1 2.9000e-14 0.0026758385 1
#> 2 1.4300e-03 0.0011638206 0
#> 3 6.5140e-02 0.0009912499 0
#> 4 1.7400e-03 0.0008243606 0
#> 5 1.7100e-03 0.0006988870 0
#> 6 2.7201e-01 0.0006045900 0
#> 7 3.6100e-05 0.0005319444 1
#> 8 7.9149e-01 0.0007117838 0
#> 9 7.5900e-08 0.0006421423 1
#> 10 2.8295e-01 0.0007796504 0
#> 11 6.9274e-01 0.0007155186 0
#> 12 7.2342e-01 0.0006610273 0
#> 13 3.0443e-01 0.0006141682 0
#> 14 5.4757e-01 0.0005734509 0
#> 15 4.8700e-04 0.0005377472 1
To check how many hypotheses we’ve rejected, we can do:
sum(LOND_results$R)
#> [1] 4
To compare the results of one algorithm to another, we can visualize the adjusted significance thresholds:
set.seed(1)
LORD_results <- LORD(sample.df)
set.seed(1)
Bonf_results <- Alpha_spending(sample.df) # Bonferroni-like test
x <- seq_len(nrow(LOND_results))
par(mar=c(5.1, 4.1, 4.1, 9.1))
plot(x, log(LOND_results$alphai), ylim = c(-9.5, -2.5), type = 'l',
col = "green", xlab = "Index", ylab = "log(alphai)", panel.first = grid())
lines(x, log(LORD_results$alphai), col = "blue") # LORD
lines(x, log(Bonf_results$alphai), col = "red") # Bonferroni-like test
lines(x, rep(log(0.05),length(x)), col = "purple") # Unadjusted
legend("right", legend = c("Unadjusted", "Bonferroni", "LORD", "LOND"),
col = c("purple", "red", "blue", "green"), lty = rep(1,4),
inset = c(-0.35,0), xpd = TRUE)
Note that both LOND and LORD result in higher significance thresholds (alpha_i) than a Bonferroni adjustment. When alphai jumps, that indicates that the algorithm is recovering some of its “alpha wealth” when it makes a discovery. You can see how if the algorithm does not discover anything over time, its alpha wealth decreases (the alphai will monotonically decrease), and it becomes harder to reject a null hypothesis since the significance threshold gets smaller and smaller.
onlineFDR
over time
This package can be used over time as your dataset grows. In order for the randomisation of the data within the previous batches to remain the same (and hence to allow for reproducibility of the results), the same seed should be used for all analyses. Ideally, you will have selected your algorithm a priori based on your needs (click here. You can pass your growing dataset to the same algorithm.
# Initial experimental data
sample.df <- data.frame(
id = c('A15432', 'B90969', 'C18705'),
date = as.Date(c(rep("2014-12-01",3))),
pval = c(2.90e-14, 0.06743, 0.01514))
set.seed(1)
LOND_results <- LOND(sample.df)
# After you've completed more experiments
sample.df <- data.frame(
id = c('A15432', 'B90969', 'C18705', 'B49731', 'E99902',
'C38292', 'A30619', 'D46627', 'E29198', 'A41418',
'D51456', 'C88669', 'E03673', 'A63155', 'B66033'),
date = as.Date(c(rep("2014-12-01",3),
rep("2015-09-21",5),
rep("2016-05-19",2),
"2016-11-12",
rep("2017-03-27",4))),
pval = c(2.90e-14, 0.06743, 0.01514, 0.08174, 0.00171,
3.61e-05, 0.79149, 0.27201, 0.28295, 7.59e-08,
0.69274, 0.30443, 0.000487, 0.72342, 0.54757))
set.seed(1)
LOND_results <- LOND(sample.df)
This section covers some more use cases for more “advanced”
onlineFDR
users.
If your p-values came from hypothesis tests that were performed in
batches, you might consider using the batch algorithms:
BatchPRDS()
, BatchBH()
, and
BatchStBH()
.
sample.df <- data.frame(
id = c('A15432', 'B90969', 'C18705', 'B49731', 'E99902',
'C38292', 'A30619', 'D46627', 'E29198', 'A41418',
'D51456', 'C88669', 'E03673', 'A63155', 'B66033'),
pval = c(2.90e-08, 0.06743, 0.01514, 0.08174, 0.00171,
3.60e-05, 0.79149, 0.27201, 0.28295, 7.59e-08,
0.69274, 0.30443, 0.00136, 0.72342, 0.54757),
batch = c(rep(1,5), rep(2,6), rep(3,4)))
batchprds_results <- BatchPRDS(sample.df)
In the cases where you a priori expect a certain
number of hypothesis tests, you can set a bound. Note that the bounds
for LOND and LORDdep depend on alpha, so ensure that the alpha value
used for the bound is the same alpha value used for the algorithm.
Supply your bound to either the betai
or
gammai
argument in your chosen algorithm.
sample.df <- data.frame(
id = c('A15432', 'B90969', 'C18705', 'B49731', 'E99902',
'C38292', 'A30619', 'D46627', 'E29198', 'A41418',
'D51456', 'C88669', 'E03673', 'A63155', 'B66033'),
date = as.Date(c(rep("2014-12-01",3),
rep("2015-09-21",5),
rep("2016-05-19",2),
"2016-11-12",
rep("2017-03-27",4))),
pval = c(2.90e-14, 0.06743, 0.01514, 0.08174, 0.00171,
3.61e-05, 0.79149, 0.27201, 0.28295, 7.59e-08,
0.69274, 0.30443, 0.000487, 0.72342, 0.54757))
# Assuming a bound of 20 hypotheses
bound <- setBound("LOND", alpha = 0.04, 20)
set.seed(1)
LOND_results <- LOND(sample.df, alpha = 0.04, betai = bound)
LOND()
implements the LOND procedure for online FDR
control, where LOND stands for (significance) Levels based On Number of
Discoveries, as presented by Javanmard and Montanari (2015). The
procedure controls the FDR for independent or positively dependent
(PRDS) p-values, with an option (dep = TRUE)
which
guarantees control for arbitrarily dependent p-values.
LORD()
implements the LORD procedure for online FDR
control, where LORD stands for (significance) Levels based On Recent
Discovery, as presented by Javanmard and Montanari (2018), Ramdas et
al. (2017) and Tian & Ramdas (2019). The function provides
different versions of the procedure valid for independent p-values, see
vignette("theory")
. There is also a version (‘dep’) that
guarantees control for dependent p-values.
SAFFRON()
implements the SAFFRON procedure for
online FDR control, where SAFFRON stands for Serial estimate of the
Alpha Fraction that is Futilely Rationed On true Null hypotheses, as
presented by Ramdas et al. (2018). The procedure provides an
adaptive method of online FDR control.
Alpha_investing()
Implements a variant of the
Alpha-investing algorithm of Foster and Stine (2008) that guarantees FDR
control, as proposed by Ramdas et al. (2018). This procedure uses a
variant of SAFFRON’s update rule. This procedure controls the FDR for
independent p-values.
ADDIS()
implements the ADDIS algorithm for online
FDR control, where ADDIS stands for an ADaptive algorithm that DIScards
conservative nulls, as presented by Tian & Ramdas (2019). The
algorithm compensates for the power loss of SAFFRON with conservative
nulls, by including both adaptivity in the fraction of null hypotheses
(like SAFFRON) and the conservativeness of nulls (unlike SAFFRON). This
procedure controls the FDR for independent p-values.
BatchPRDS()
implements the BatchPRDS algorithm for
online FDR control, where PRDS stands for positive regression dependency
on a subset, as presented by Zrnic et al. (2020). The BatchPRDS
algorithm controls the FDR when the p-values in one batch are positively
dependent, and independent across batches, by running the
Benjamini-Hochberg procedure on each batch.
BatchBH()
implements the BatchBH algorithm for
online FDR control, as presented by Zrnic et al. (2020). The BatchBH
algorithm controls the FDR when the p-values in a batch are independent,
and independent across batches, by running the Benjamini-Hochberg
procedure on each batch.
BatchStBH()
implements the BatchSt-BH algorithm for
online FDR control, as presented by Zrnic et al. (2020). This algorithm
makes one modification to the original Storey-BH algorithm (Storey
2002), by adding 1 to the numerator of the null proportion estimate for
more stable results. The BatchSt-BH algorithm controls the FDR when the
p-values in a batch are independent, and independent across batches, by
running the Storey Benjamini-Hochberg procedure on each batch.
LONDstar()
implements the LOND algorithm for
asynchronous online testing, as presented by Zrnic et al.
(2021). This controls the mFDR.
LORDstar()
implements LORD algorithms for
asynchronous online testing, as presented by Zrnic et al.
(2021). This controls the mFDR.
SAFFRONstar()
implements the SAFFRON algorithm for
asynchronous online testing, as presented by Zrnic et al.
(2021). This controls the mFDR.
Alpha_spending()
implements online FWER control
using a Bonferroni-like test. Alpha-spending provides strong FWER
control for arbitrarily dependent p-values.
online_fallback()
implements the online fallback
algorithm for FWER control, as proposed by Tian & Ramdas (2021).
Online fallback is a uniformly more powerful method than Alpha-spending,
as it saves the significance level of a previous rejection. Online
fallback strongly controls the FWER for arbitrarily dependent
p-values.
ADDIS_spending()
implements the ADDIS-spending
algorithm for online FWER control, as proposed by Tian & Ramdas
(2021). The algorithm compensates for the power loss of Alpha-spending,
by including both adaptivity in the fraction of null hypotheses and the
conservativeness of nulls. ADDIS-spending provides strong FWER control
for independent p-values. Tian & Ramdas (2021) also presented a
version for handling local dependence.
All questions regarding onlineFDR should be posted to the Bioconductor support site, which serves as a searchable knowledge base of questions and answers:
https://support.bioconductor.org
Posting a question and tagging with “onlineFDR” will automatically send an alert to the package authors to respond on the support site.
We would like to thank the IMPC team (via Jeremy Mason and Hamed Haseli Mashhadi) for useful discussions during the development of the package.
Aharoni, E. and Rosset, S. (2014). Generalized \(\alpha\)-investing: definitions, optimality results and applications to public databases. Journal of the Royal Statistical Society (Series B), 76(4):771–794.
Benjamini, Y., and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29(4):1165-1188.
Bourgon, R., Gentleman, R., and Huber, W. (2010). Independent filtering increases detection power for high-throughput experiments. Proceedings of the National Academy of Sciences, 107(21), 9546-9551.
Foster, D. and Stine R. (2008). \(\alpha\)-investing: a procedure for sequential control of expected false discoveries. Journal of the Royal Statistical Society (Series B), 29(4):429-444.
Ioannidis, J.P.A. (2005). Why most published research findings are false. PLoS Medicine, 2.8:e124.
Javanmard, A., and Montanari, A. (2015). On Online Control of False Discovery Rate. arXiv preprint, https://arxiv.org/abs/1502.06197.
Javanmard, A., and Montanari, A. (2018). Online Rules for Control of False Discovery Rate and False Discovery Exceedance. Annals of Statistics, 46(2):526-554.
Koscielny, G., et al. (2013). The International Mouse Phenotyping Consortium Web Portal, a unified point of access for knockout mice and related phenotyping data. Nucleic Acids Research, 42.D1:D802-D809.
Li, A., and Barber, F.G. (2017). Accumulation Tests for FDR Control in Ordered Hypothesis Testing. Journal of the American Statistical Association, 112(518):837-849.
Ramdas, A., Yang, F., Wainwright M.J. and Jordan, M.I. (2017). Online control of the false discovery rate with decaying memory. Advances in Neural Information Processing Systems 30, 5650-5659.
Ramdas, A., Zrnic, T., Wainwright M.J. and Jordan, M.I. (2018). SAFFRON: an adaptive algorithm for online control of the false discovery rate. Proceedings of the 35th International Conference in Machine Learning, 80:4286-4294.
Robertson, D.S. and Wason, J.M.S. (2018). Online control of the false discovery rate in biomedical research. arXiv preprint, https://arxiv.org/abs/1809.07292.
Robertson, D.S., Wason, J.M.S. and Ramdas, A. (2022). Online multiple hypothesis testing for reproducible research. arXiv preprint, https://arxiv.org/abs/2208.11418.
Robertson, D.S., Wildenhain, J., Javanmard, A. and Karp, N.A. (2019). Online control of the false discovery rate in biomedical research. Bioinformatics, 35:4196-4199, https://doi.org/10.1093/bioinformatics/btz191.
Storey, J. D. (2002). A direct approach to false discovery rates. JRSS B, 64(3):479–498.
Tian, J. and Ramdas, A. (2019). ADDIS: an adaptive discarding algorithm for online FDR control with conservative nulls. Advances in Neural Information Processing Systems, 32.
Tian, J. and Ramdas, A. (2021). Online control of the familywise error rate. Statistical Methods in Medical Research, 30(4):976–993.
Zrnic, T., Jiang, D., Ramdas, A. and Jordan, M. (2020). The power of batching in multiple hypothesis testing. International Conference on Artificial Intelligence and Statistics (AISTATS) 2020, PMLR, 108:3806-3815.
Zrnic, T., Ramdas, A. and Jordan, M.I. (2021). Asynchronous Online Testing of Multiple Hypotheses. JMLR, 22:1-33.