Evolution and Function of the Environmental Protein Sequence Universe (ProteinSpace)

A project sponsored by the Austrian National Science Foundation (FWF)
Project Number: P27703

Project lead: Prof. Thomas Rattei, CUBE - Computational Systems Biology, Department of Microbiology and Ecosystem Science, Faculty of Life Sciences, University of Vienna
Partners: Arthur Flexer, OFAI


Protein sequences are generated in large quantities by DNA sequencing and represent one of the most important reservoirs of molecular biological data. Protein sequences point to the molecular functions and biological roles of their gene products through blueprints of the function and structure of their encoded proteins and their connected evolutionary relationships. During the last decade, the sequencing of metagenomes directly from environmental samples without cultivation has significantly expanded the known protein sequence universe. However, the environmental protein universe is still mainly unstructured and awaits specific utilization in computational biology; although, hundreds of metagenomes have been deeply sequenced and thereby account for the majority of protein sequences stored in databases.

The central aim of this proposal is investigating the fundamental evolutionary structures behind the environmental protein sequences previously obtained. We will cluster the entire protein sequence universe, including metagenomes, into evolutionary related families. Based on established concepts, such as orthology or protein domains, this project will develop novel clustering methods for large protein networks.

Based on this large-scale evolutionary reconstruction, we will investigate the function of protein families in the environmental protein sequence universe. We will comprehensively determine the relative abundances of protein families in different environments. We expect to discover many associations that will not only link known protein families to specific habitat types but will also establish connections between families of unknown function and the environment. The abundance matrix of protein families in different environments will be further studied with respect to the predictive power of environmental co-occurrence profiles for the prediction of functional interactions between protein families. We expect to develop a novel method that will significantly extend current principles for the prediction of protein interactions.

In a case study, we will utilize the structured environmental protein sequences universe to investigate the phylogenetic and ecological diversity of the monophyletic PCV superphylum (Planctomycetes, Verrucomicrobia, Chlamydiae, Lentisphaerae, etc.), a bacterial clade with exceptional physiologies and major medical, ecological and biotechnological importance.

Although this proposal is mainly focused on fundamental biological questions, it also comprises broader aspects such as developing novel and universal methods and resources in computational biology as well as improving our knowledge about biotechnologically and medically important bacteria.


Feldbauer R., Flexer A.: A comprehensive empirical comparison of hubness reduction in high-dimensional spaces, Knowlege and Information Systems, Volume 59, Issue 1, pp. 137–166, 2019. (published online 18th of May, 2018) DOI: https://doi.org/10.1007/s10115-018-1205-y

Feldbauer R., Flexer A.: Centering versus Scaling for Hubness Reduction, in Proceedings of the 25th International Conference on Artificial Neural Networks (ICANN'16), Part I, pp. 175-183, Springer International Publishing, 2016. also available as: TR-2016-05.

Feldbauer R., Flexer A., Rattei T.: Deep learning for extremely fast protein similarity search (abstract), Austrian High Performance Computing Meeting 2019, Grundlsee, Austria, 2019.

Feldbauer R., Flexer A., Rattei T.: Protein vector representations for fast similarity search (abstract), German Conference on Bioinformatics, Vienna, Austria, 2018.

Feldbauer R., Leodolter M., Plant C., Flexer A.: Fast approximate hubness reduction for large high-dimensional data, Proceedings of the IEEE International Conference on Big Knowledge (ICBK), 2018. also available as: TR-2018-02.

Feldbauer R., Rattei T., Flexer A.: scikit-hubness: Hubness Reduction and Approximate Neighbor Search, Journal of Open Source Software, 5(45), 1957, 2020. DOI: https://doi.org/10.21105/joss.01957

Schnitzer D., Flexer A.: The Unbalancing Effect of Hubs on K-medoids Clustering in High-Dimensional Spaces, Proceedings of the International Joint Conference on Neural Networks, Killarney, Ireland, 2015. also available as: TR-2015-01.


scikit-hubness is an easy-to-use, fully scikit-learn compatible Python package for hubness analysis ("Is my data affected by hubness?"), hubness reduction ("How can we improve neighbor retrieval in high dimensions?"), and approximate neighbor search ("Does it work for large data sets?"). The package aims at both hubness researchers and practitioners, and is successor to the HUB Toolboxes.

The OFAI HUB Toolbox is developed in the course of this project. It contains functions for hubness analysis, hubness reduction, and implementations of additional algorithms relevant for high-dimensional data. A package for Python3 is available from GitHub and licensed under GPLv3.

Please visit the GitHub page for documentation, installation instruction, usage examples, issue monitoring, etc. While the HUB Toolbox for Python3 is under active development and contains all bleeding edge functionality, we also provide a MATLAB version with core functionality: