Bioinformatics projects for internship, master thesis, or bachelor thesis in the Sonnhammer group:
Projects are offered in the area of protein function prediction and analysis, mainly in three projects:
- Gene network construction and analysis
- Orthology identification and analysis
- Protein domain architecture evolution and analysis
Specific project descriptions are listed here:
- Gene regulatory network inference from single cell perturbations.
The Perturb-seq technique combined with CRISPRi or CRISPRa can generate large scale perturbation-based expression data at single cell resolution, which is cleaner than bulk data but has the drawback of being fragmented. The primary goal of the project is to develop a simulator for such single-cell data based on experimental data properties, in order to explore which processing techniques produce optimal GRN inference. A further goal is to apply this protocol to real data to derive GRNs for different cell types. The project will involve script writing and analysis of results.
- Gene regulatory network inference with prior information.
The goal is to use prior information on gene regulation from sources such as TRRUST, RegNetwork, or ChIP-seq data to act as a weight in the LASSO penalty during the GRN inference. The increase in accuracy will be measured in a crossvalidation procedure.
The project will involve script writing and analysis of results.
- Layering directionality onto functional association networks
This project aims to layer functional associations that have a direction, for instance signalling, modification, regulation, or enzymatic cascade from KEGG or Reactome pathway maps as well as other sources such as high-throughput phosphorylation data onto the FunCoup networks, either as inferred (via training) or known links. After networks enhanced with directed links are produced for the species with suitable data, various analyses will be performed and they will be used for augmented pathway enrichment analysis
The project will involve script writing and analysis of results.
- Network-based identification of new disease genes.
The goal is to use the FunCoup network to identify genes not previously associated with a disease but with an enrichment of network links to known disease genes (MaxLink algorithm) or found as connectors with the TOPAS algorithm. Such enhanced disease modules will be analyzed in model organisms in order to identify particularly interesting new disease gene candidates. The project will involve script writing and analysis of results.
- Benchmarking the next generation of homology search tools
Recently, ultrafast homology search tools like MMSeqs2 and Diamond have become popular as they claim to combine much higher speed than Blast with the same sensitivity. The goal of the project is use a previously developed benchmark to evaluate the sensitivity and speed of these new tools. The project will involve script writing and analysis of results.
- Modules in gene regulatory networks
The aim of the project is to find biologically relevant or functional modules in Gene Regulatory Networks (GRNs) inferred from cancer cell line knockdown data. This will be done by applying module detection methods to large GRNs in order to identify submodules. The found modules will be characterized and evaluated for reproducibility and functional features. The project will involve script writing and analysis of results.
Previous projects:
- Network-based drug repurposing.
The goal is to use the FunCoup network to identify drugs that are likely to be repurposable for other diseases than they are approved for. We also want to define network modules and analyze the number of distinct modules that a drug repurposing candidate targets. See Scientific Reports, 11:20687 (2021) by a previous project student.
The project will involve script writing and analysis of results.
- Fast orthology analysis
Orthology in InParanoid, Hieranoid use the Blast tool. Recently, much faster homology searching tools have become available. The goal is to speed up the above orthology inference algorithms, while retaining the same accuracy. An additional aim is to make the 2-pass homology search strategy more efficient. The project will involve script writing (mostly Perl) and analysis of results.
- Differential network analysis
The project aims to identify modules (sets of tightly interconnected genes) in gene networks of prostate cancer. Further, modular differences will be studied in healthy and cancerous regions of tissue samples to reveal changed genes and pathways. The project will involve script writing and analysis of results.
- Disease module detection and analysis
- HiPathway - discovery of novel pathways from high-throughput data
The goal of this project is to use high-throughput omics data to derive groups of proteins with a coherent function and to map these to known pathways. This will reveal which pathways are rediscoverable and also provide novel protein sets that represent novel pathways. There are several databases and methods that can be used, and the goal is to apply several of them and compare the results. The project will involve script writing and analysis of results.
- Domain orthology
Orthology in InParanoid, Hieranoid, and most other ortholog databases is defined on the whole-protein level. The goal is to explore the difference when orthology is defined on the domain level, and to establish rules for when domain orthology can give an advantage. The project will involve script writing and analysis of results.
- Benchmarking of pathway analysis methods
BinoX is a new tool for measuring crosstalk between gene sets, particularly aimed at pathway annotation. The goal is to benchmark BinoX and other similar tools such as NEAT, NEA+, and LEGO, possibly also in combination with clustering of the gene sets. The project will involve script writing and analysis of results.
- Ultrafast sequence clustering for Pfam-B
Pfam is a comprehensive database of annotated Pfam domains. Between these Pfam-A domains are long stretches of unclassified sequences. In the past, homology-based methods were used to cluster these stretches into Pfam-B domains but with the huge growth of the sequence databases this method is no longer feasible. Instead we want to employ ultrafast alignment-free methods to make an approximate but feasible sequence clustering. The project will involve exploration of alignment-free algorithms and packages, programming, script writing, and analysis of results.
- Development of an interactive website to explore protein networks
The FunCoup network database is a vast resource of functional coupling between proteins and genes. The goal of the project is to develop a new network viewer with modern web technologies that is integrated with the FunCoup database and can make full use of its features. The project will mainly involve programming.
- Pathway crosstalk enrichment visualisation
The PathwAX website provides online pathway annotation based on crosstalk derived through FunCoup's genome wide functional association networks. The goal is to develop a network visualisation tool that will show the crosstalk between a given query gene set and a give pathway. The project will mainly involve programming.
- Development of an online Hieranoid database.
- A tool for analysing protein domain architecture evolution.
The goal is to develop a web-based tool that can display evolutionary trees of proteins and their domain architecture. This is very useful for understanding how domains have been shuffled during evolution. The tool should either be written in Java and integrated in PfamAlyzer or be implemented using JavaScript.
- Protein domain architecture evolution
- Benchmarking the next generation of homology inference tools
- Adaptive evolution in birds
- Interactive website to explore protein networks