Introduction
Introduction#
Reproducible research is the cornerstone of scientific advances and builds the core principle of the scientific method itself. Within the last decade, an increasing number of reports raised attention on unreproducible results across various scientific disciplines (Begley and Ioannidis, 2015). Reportedly, more than 70 % of researchers were not able to reproduce the results of another researcher, whereas more than 50 % were not able to reproduce their own findings (Baker, 2016).
Since reproducibility is a non-standardized term, it often serves as a vague umbrella term for mingling (i) repeatable analysis based on given information by the same scientist, (ii) reproducible analysis based on given information by another scientist, or (iii) replicable analysis with new information by another scientist.
Therefore, reproducibility is not an absolute term but rather relative, depending on the scope of the notion (Goodman et al., 2016). Goodman et al. (2016) proposed to combine the word reproducibility with its relative scope, hence subdividing reproducibility in “methods reproducibility”, “results reproducibility” and “inferential reproducibility”. According to Goodman et al. (2016), “methods reproducibility refers to the provision of enough detail about study procedures and data so the same procedures could, in theory or in actuality, be exactly repeated”. Results reproducibility requires methodological reproducibility, so that the same results are obtained from an independent study. Inferential reproducibility is given, if qualitatively similar conclusions are drawn based on the results of independent studies or reanalysis of a given study. Therefore, inferential reproducibility constitutes the highest level of reproducibility.
Possible reasons for non-reproducible experiments are broadly discussed and range from study design, to publication bias and malpractice (Munafò et al., 2017), (Miyakawa, 2020) (Baker, 2016). Once published, the results of non-reproducible studies are harmful for science, since the reported results along the generated data assets are likely incorrect.
Low reproducibility also affects the field of enzymology, where varying experimental standards are described as the leading cause (Bell et al., 2021). With the aim to increase reproducibility on biocatalytic experiments, the STRENDA commission has defined standards for reporting enzymology data (STRENDA) (Tipton et al., 2014). The guidelines recommend reporting on identity information of the enzyme, enzyme preparation steps, storage conditions, assay conditions and method, as well as enzyme activity with the corresponding kinetic parameters and units. Furthermore, the precision of measurements and preferably the raw data itself should be shared. Hence, the STRENDA guidelines pave the way for reproducible and therefore reusable functional enzymology data.
Besides the reproducibility of data, availability of data assets must be given in order to enable reuse of extant data. This is especially important to enable the application of big data technologies, like machine learning which rely on large amounts of high quality data. In the field of biocatalysis, machine learning bares the potential to investigate how the structures of an enzyme is related to its catalytic properties. Thereby, valuable conclusions for enzyme engineering might be drawn.
One prerequisite for applying big data technologies and therefore use their potential in biocatalysis is the availability of data. With the scope to enhance data reusability, FAIR guiding principles for scientific data management were established (Wilkinson et al., 2016). The acronym FAIR denotes four foundational principles for contemporary data management in a Findable, Accessible, Interoperable and Reusable fashion. Special emphasis is placed on machines being able to find and use data, due to their fundamental importance in modern science. Thereby, (meta)data should be findable by a globally unique identifier. Data should be accessible by a standardized and open protocol. Interoperability is achieved, if data contains qualitative references to other (meta)data and uses conclusive vocabulary. For reusability, the data needs to be extensively describes with accurate and relevant attributes (Wilkinson et al., 2016). Broad application of the stated principles will promote the reuse of data and ultimately amplify the knowledge which can be deduced from each published dataset.
To enable data exchange in biocatalysis the standardized exchange format EnzymeML was developed, abiding by FAIR data principles as well as STRENDA guidelines (Pleiss, 2021). Within an EnzymeML document, information on reaction conditions, obtained measurement data, as well as modeling results are stored. As such, reaction conditions contain information on the pH value, temperature and the reaction vessel. Each species within a reaction (e.g. substrate, product or inhibition species) is uniquely labeled with an identifier, which allows unequivocal referencing of species. Thereby, proteins are labeled with their UniProt ID, and reactants are labeled with their respective SMILES or InChI code.
A special focus of the EnzymeML format is to store the modeling results, information on kinetic parameters, and model equations alongside the measurement data on which the results are based. In summary, data and meta data of biocatalytic experiments can be stored and shared between scientists and databases compliant with contemporary data management practices.
In this work, a computational workflow based on EnzymeML for kinetic parameters estimation of enzyme reactions was developed with the following requirements: The workflow should enable comprehensive data analysis. Should be based on raw data. And yield reproducible kinetic parameter estimates. Furthermore, the workflow should be iteratively developed by applied it to on ongoing research projects of EnzymeML project partners. Therefore, two software packages should be developed, which facilitate the parameter estimation process. Finally, the estimated parameters should be written back to the EnzymeML file, unifying measurement data with information on the applied estimation model as well as the resulting kinetic parameters.