Commit 345c5d50 authored by Nicasia Beebe-Wang's avatar Nicasia Beebe-Wang
Browse files

updated example plot

parents de221952 01650852
......@@ -4,7 +4,7 @@
##### Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle
###### * Equal contribution
Although biological pathways are essential for interpreting results from computational biology studies, the growing number of pathway databases makes it difficult to perform pathway analysis. Our study seeks to reconcile pathways from different databases and reduce pathway redundancy by revealing informative groups with distinct biological functions. Uniquely applying the Louvain community detection algorithm to a network of 4,847 pathways from KEGG, REACTOME and Gene Ontology databases, we identify 35 distinct communities of pathways and show that these communities are consistent with expert-curated pathway categories. Further, we develop an algorithm to automatically annotate each community based on member pathways’ names. By learning informative categories, we progress towards a tool that computational biologists can use to more efficiently interpret their biological findings.
Although knowledge of biological pathways is essential for interpreting results from computational biology studies, the growing number of pathway databases complicates efforts to efficiently perform pathway analysis. Our study seeks to reconcile pathways from different databases and reduce pathway redundancy by revealing informative groups with distinct biological functions. Uniquely applying the Louvain community detection algorithm to a network of 4,847 pathways from KEGG, REACTOME and Gene Ontology databases, we identify 35 distinct communities of pathways and show that these communities are consistent with expert-curated pathway categories. Further, we develop an algorithm to automatically annotate each community based on member pathways’ names. By learning informative categories, we help computational biologists more efficiently interpret their biological findings and lay important groundwork for developing an interactive pathway community query tool to further enrich pathway analysis.
<img align="center" src="Concept_Figure.jpg" width="50%">.
......@@ -12,33 +12,30 @@ Although biological pathways are essential for interpreting results from computa
### Resources
**Community_Members.xlsx** and **Community_Members.csv** files contain the list of all pathways included in each of the 35 pathway communities we learned.
**Community_kmers.xlsx** and **Community_kmers.csv** files contain the list of k-mers for each pathway community, along with the number of occurence and hubness of each pathway (within the community's subgraph).
- **Community_Members.xlsx** and **Community_Members.csv** files contain the list of all pathways included in each of the 35 pathway communities we learned.
- **Community_kmers.xlsx** and **Community_kmers.csv** files contain the list of k-mers for each pathway community, along with the number of occurence and hubness of each pathway (within the community's subgraph).
---
### Pipeline
**pathways_raw** folder contains raw pathway files from MSigDB.
**curated_hierarchies** folder contains the hierarchies and high-level categories for KEGG, REACTOME, and GO databases.
**pipeline** folder contains all the scripts in our pipeline.
- **pathways_raw** folder contains raw pathway files from MSigDB.
- **pipeline** folder contains all the scripts in our pipeline.
#### 1. Generating adjacency matrices and curated hierarchies
**gmts_to_adj_matrices.py** and **gmts_to_adj_matrices_offdiagonal.py** define adjacency matrices for the MSigDB pathways by measuring the pairwise similarities betweeen pathways from each database and across databases, respectively.
**save_true_labels.ipynb** script records the true curated labels for each pathway category.
**adj_matrices** and **curated_labels** folders contain the pathway adjacency matrices and curated labels for each pathway category, respectively.
- **gmts_to_adj_matrices.py** and **gmts_to_adj_matrices_offdiagonal.py** define adjacency matrices for the MSigDB pathways by measuring the pairwise similarities betweeen pathways from each database and across databases, respectively.
- **adj_matrices** and **curated_labels** folders contain the pathway adjacency matrices and curated labels for each pathway category, respectively.
#### 2. Comparison of clustering algorithms
**algorithm_helpers.py** is the helper script to run various clustering and community detection algorithms.
**CNM_networkx.py** is a modified version of the CNM algorithm (which allows us to select the number of communities to generate) originally from [NetworkX]( https://networkx.github.io/documentation/stable/_modules/networkx/algorithms/community/modularity_max.html).
**select_resolution_for_Louvain.ipynb** script selects the best resolution for the Louvain algorithm for each database category.
**comparison_of_algorithms.ipynb** script executes all the clustering algorithms and compares them across all pathway databases.
- **algorithm_helpers.py** is the helper script to run various clustering and community detection algorithms.
- **CNM_networkx.py** is a modified version of the CNM algorithm (which allows us to select the number of communities to generate) originally from [NetworkX]( https://networkx.github.io/documentation/stable/_modules/networkx/algorithms/community/modularity_max.html).
- **select_resolution_for_Louvain.ipynb** script selects the best resolution for the Louvain algorithm for each database category.
- **comparison_of_algorithms.ipynb** script executes all the clustering algorithms and compares them across all pathway databases.
#### 3. Generating combined pathway network and learning communities
**combined_graph_louvain_with_weights.ipynb** defines the combined pathay network and applies the Louvain algoritm to learn pathway communities.
**Full_graph_louvain_with_weights_community_labels** includes the community labels learned using different resolutions.
- **combined_graph_louvain_with_weights.ipynb** defines the combined pathay network and applies the Louvain algoritm to learn pathway communities.
#### 4. Analysis of combined pathway network
**combined_graph_analyses/community_sizes_and_distributions.ipynb** investigates the size and pathway distribution for each pathway community.
**combined_graph_analyses/curated_category_distributions_clustermaps.ipynb** generates cluster maps showing distributions of curated categories as they relate to our communities
**combined_graph_analyses/generate_kmer_labels.ipynb** automatically generates labels for each community based on their members' names.
\ No newline at end of file
- **combined_graph_analyses/community_sizes_and_distributions.ipynb** investigates the size and pathway distribution for each pathway community.
- **combined_graph_analyses/curated_category_distributions_clustermaps.ipynb** generates cluster maps showing distributions of curated categories as they relate to our communities
- **combined_graph_analyses/generate_kmer_labels.ipynb** automatically generates labels for each community based on their members' names.
%% Cell type:code id: tags:
``` python
import numpy as np
import pandas as pd
```
%% Cell type:code id: tags:
``` python
HIERARCHY = pd.read_table('GO_CC_Hierarchy_FINAL_ONLY_V7.tsv', index_col = 0)
print("All pathways ", HIERARCHY.shape)
nonobsolete_sets = [s for s in np.arange(HIERARCHY.iloc[:, 0].values.shape[0]) if 'OBSOLETE' not in HIERARCHY.iloc[:, 0].values[s]]
HIERARCHY = HIERARCHY.iloc[nonobsolete_sets]
print("All pathways, obsoletes excluded ", HIERARCHY.shape)
for i in range(HIERARCHY.shape[1]):
print("Layer ", i + 1)
print(len(np.unique(HIERARCHY.iloc[:, i].values.astype(str))))
```
%% Cell type:code id: tags:
``` python
HIERARCHY = pd.read_table('GO_MF_Hierarchy_FINAL_ONLY_V7.tsv', index_col = 0)
print("All pathways ", HIERARCHY.shape)
nonobsolete_sets = [s for s in np.arange(HIERARCHY.iloc[:, 0].values.shape[0]) if 'OBSOLETE' not in HIERARCHY.iloc[:, 0].values[s]]
HIERARCHY = HIERARCHY.iloc[nonobsolete_sets]
print("All pathways, obsoletes excluded ", HIERARCHY.shape)
for i in range(HIERARCHY.shape[1]):
print("Layer ", i + 1)
print(len(np.unique(HIERARCHY.iloc[:, i].values.astype(str))))
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment