Evolution and Function of the Environmental Protein Sequence Universe

Protein sequences are generated in large quantities by DNA sequencing and represent one of the most important reservoirs of molecular biological data. Protein sequences point to the molecular functions and biological roles of their gene products through blueprints of the function and structure of their encoded proteins and their connected evolutionary relationships. During the last decade, the sequencing of metagenomes directly from environmental samples without cultivation has significantly expanded the known protein sequence universe. However, the environmental protein universe is still mainly unstructured and awaits specific utilization in computational biology; although, hundreds of metagenomes have been deeply sequenced and thereby account for the majority of protein sequences stored in databases.

The central aim of this proposal is investigating the fundamental evolutionary structures behind the environmental protein sequences previously obtained. We will cluster the entire protein sequence universe, including metagenomes, into evolutionary related families. Based on established concepts, such as orthology or protein domains, this project will develop novel clustering methods for large protein networks.

Based on this large-scale evolutionary reconstruction, we will investigate the function of protein families in the environmental protein sequence universe. We will comprehensively determine the relative abundances of protein families in different environments. We expect to discover many associations that will not only link known protein families to specific habitat types but will also establish connections between families of unknown function and the environment. The abundance matrix of protein families in different environments will be further studied with respect to the predictive power of environmental co-occurrence profiles for the prediction of functional interactions between protein families. We expect to develop a novel method that will significantly extend current principles for the prediction of protein interactions.

In a case study, we will utilize the structured environmental protein sequences universe to investigate the phylogenetic and ecological diversity of the monophyletic PCV superphylum (Planctomycetes, Verrucomicrobia, Chlamydiae, Lentisphaerae, etc.), a bacterial clade with exceptional physiologies and major medical, ecological and biotechnological importance.

Although this proposal is mainly focused on fundamental biological questions, it also comprises broader aspects such as developing novel and universal methods and resources in computational biology as well as improving our knowledge about biotechnologically and medically important bacteria.

  • Feldbauer R., Flexer A.: A comprehensive empirical comparison of hubness reduction in high-dimensional spaces, Knowlege and Information Systems, Volume 59, Issue 1, pp. 137–166, 2019. (published online 18th of May, 2018) DOI:
  • Feldbauer R., Flexer A.: Centering versus Scaling for Hubness Reduction, in Proceedings of the 25th International Conference on Artificial Neural Networks (ICANN'16), Part I, pp. 175-183, Springer International Publishing, 2016.
  • Feldbauer R., Flexer A., Rattei T.: Deep learning for extremely fast protein similarity search (abstract), Austrian High Performance Computing Meeting 2019, Grundlsee, Austria, 2019.
  • Feldbauer R., Flexer A., Rattei T.: Protein vector representations for fast similarity search (abstract), German Conference on Bioinformatics, Vienna, Austria, 2018.
  • Feldbauer R., Leodolter M., Plant C., Flexer A.: Fast approximate hubness reduction for large high-dimensional data, Proceedings of the IEEE International Conference on Big Knowledge (ICBK), 2018.
  • Feldbauer R., Rattei T., Flexer A.: scikit-hubness: Hubness Reduction and Approximate Neighbor Search, Journal of Open Source Software, 5(45), 1957, 2020. DOI:
  • Schnitzer D., Flexer A.: The Unbalancing Effect of Hubs on K-medoids Clustering in High-Dimensional Spaces, Proceedings of the International Joint Conference on Neural Networks, Killarney, Ireland, 2015.

Research staff