Prioritizing Disease-Associated Genes with Heterogeneous Network Edge Prediction.


The first decade of Genome Wide Association Studies (GWAS) has uncovered a wealth of disease-associated variants. Two important derivations will be the translation of this information into a multiscale understanding of pathogenic variants, and leveraging existing data to increase the power of existing and future studies through prioritization. We explore edge prediction on heterogeneous networks—graphs with multiple node and edge types—for accomplishing both tasks.

First we constructed a network with 18 node types—genes, diseases, tissues, pathophysiologies, and 14 MSigDB (molecular signatures database) collections—and 19 edge types from high-throughput publicly-available resources. From this network composed of 40,343 nodes and 1,608,168 edges, we extracted features that describe the topology between specific genes and diseases. Next, we trained a model from GWAS associations and predicted the probability of association between each protein-coding gene and each of 29 well-studied complex diseases. The model, which achieved 132-fold enrichment in precision at 10% recall, outperformed any individual domain, highlighting the benefit of integrative approaches. We identified pleiotropy, transcriptional signatures of perturbations, pathways, and protein interactions as fundamental mechanisms explaining pathogenesis.

Publication »


View the results of our analysis. Browse predictions by gene or disease and learn more about the network-based features. Results include context-specific summaries and performance assessments. Applications of interest include:

  • Using the predications as prior probabilities of association to increase the power of GWAS analysis.
  • Determining candidate causal genes within genomic regions identified by GWAS.
  • Identifying genes of biological interest for a specific disease.
  • Identifying diseases of interest for a specific gene.
  • Exploring a high-confidence, gene-centric translation of the GWAS Catalog.
  • Comparing the informativeness of various information domains for identifying pathogenic variants.
  • Viewing the contibution of each network-based feature composing an association prediction.

View details »


Download an extensive collection of files related to the project. Available downloads include our predictions, positive set of disease-associated genes, vector graphics, serialized networks, and processed data sources.

View details »


Browse the collection of media relating to the project. Check back here for citation information.

View details »