Abstract

In modern science, often data collection precedes the careful specification of hypotheses. Large datasets are mined, testing a large number of possible hypotheses, with the goal of identifying those that hold promise for follow-up. In this framework, controlling the False Discovery Rate is an appropriate criterion to avoid investing time and resources into non viable leads. In many contexts, the initial large collection of explored hypotheses is somewhat redundant: in an effort to maximize power, the same scientific statement can be probed with a number of related hypotheses. For example, the association between a phenotype and one genetic locus can be investigated by exploring the association between the phenotype and many genetic variants in the locus. After the first pass through the data is completed, however, and it is time to take stock of the identified scientific leads, this redundancy is corrected: in the example above, rather than reporting all variants associated with a phenotype, scientists routinely report only a ?lead? variant, selected to represent the entire locus. Because the false discovery proportion is crucially defined with reference to the total set of discoveries, however, these subsets of discoveries identified post-hoc are not equipped with guarantees of FDR control. To overcome this problem, we note that if the criterion with which discoveries will be filtered can be specified in advance, it is possible to modify the Benjamini Hochberg procedure to result in a focused set of discoveries with FDR guarantees. Our framework allows not only subsetting of discoveries, but also their prioritization with weights reflecting the extent to which they provide insight into distinct scientific questions. We illustrate our methodology with examples from gene set enrichment on the Gene Ontology, a collection of hypotheses organized in a directed acyclic graph.

The talk will be based on joint work with Eugene Katsevich (Stanford) and Marina Bogomolov (Technion).

Video Recording