Evolution and Function of the Environmental Protein Sequence Universe (ProteinSpace)
A project sponsored by the Austrian National Science Foundation (FWF)
Project Number: P27703
Protein sequences are generated in large quantities by DNA sequencing and represent one of the most important reservoirs of molecular biological data. Protein sequences point to the molecular functions and biological roles of their gene products through blueprints of the function and structure of their encoded proteins and their connected evolutionary relationships. During the last decade, the sequencing of metagenomes directly from environmental samples without cultivation has significantly expanded the known protein sequence universe. However, the environmental protein universe is still mainly unstructured and awaits specific utilization in computational biology; although, hundreds of metagenomes have been deeply sequenced and thereby account for the majority of protein sequences stored in databases.
The central aim of this proposal is investigating the fundamental evolutionary structures behind the environmental protein sequences previously obtained. We will cluster the entire protein sequence universe, including metagenomes, into evolutionary related families. Based on established concepts, such as orthology or protein domains, this project will develop novel clustering methods for large protein networks.
Based on this large-scale evolutionary reconstruction, we will investigate the function of protein families in the environmental protein sequence universe. We will comprehensively determine the relative abundances of protein families in different environments. We expect to discover many associations that will not only link known protein families to specific habitat types but will also establish connections between families of unknown function and the environment. The abundance matrix of protein families in different environments will be further studied with respect to the predictive power of environmental co-occurrence profiles for the prediction of functional interactions between protein families. We expect to develop a novel method that will significantly extend current principles for the prediction of protein interactions.
In a case study, we will utilize the structured environmental protein sequences universe to investigate the phylogenetic and ecological diversity of the monophyletic PCV superphylum (Planctomycetes, Verrucomicrobia, Chlamydiae, Lentisphaerae, etc.), a bacterial clade with exceptional physiologies and major medical, ecological and biotechnological importance.
Although this proposal is mainly focused on fundamental biological questions, it also comprises broader aspects such as developing novel and universal methods and resources in computational biology as well as improving our knowledge about biotechnologically and medically important bacteria.
Schnitzer D., Flexer A.: The Unbalancing Effect of Hubs on K-medoids Clustering in High-Dimensional Spaces, Proceedings of the International Joint Conference on Neural Networks, Killarney, Ireland, 2015. also available as: TR-2015-01.
Matlab implementations of algorithms developed in a previous project are published as the HUB Toolbox. The full toolbox is available as a .zip file:
- Download: hub_toolbox (November 3, 2015)
The toolbox includes Matlab code for Mutual Proximity (with different distance distribution assumptions), Local Scaling and Shared Nearest Neighbors. Evaluation functions include hubness analysis functions, the Goodman-Kruskal index, KNN evaluation functions and a function to estimate the intrinsic dimensionality. Please refer to the README and the corresponding publication for a detailed description.