Discussion
Contents
Discussion#
Ensuring data integrity throughout an analysis workflow is important to yield trustworthy results. Hence, the analysis workflow was conceptualized starting from photometric measurement data in order to control all data processing steps. For that purpose, discussing the output format of the analytical device with the experimental partners proved to be valuable to plan the data acquisition process. Thus, a project specific parser function was established, which transferred the measurement data to the respective EnzymeML containing information on the measurement conditions. Thereby, error-prone and tedious manual copying between files was avoided, ensuring raw data integrity from the very start of the workflow.
Data integrity of the analysis workflow#
Another aspect of data integrity is accurate concentration calculation based on the measured absorption signal. The relation between concentration to absorption signal of an analyte can be linear as well as non-linear. Therefore, wrong assumption of a linear relationship, albeit the relation is better described non-linearly, leads to avoidable inaccuracy in concentration calculation (Hsu and Chen, 2010) (Martin et al., 2017) and thus imprecise kinetic parameter estimates. Hence, concentration calculation based on extinction coefficients was discouraged, if the underlying relation between absorption and concentration of an analyte was not self-determined for the respective reaction system. To enable precise concentration calculation of linear and non-linear absorption to concentration relationships, CaliPytion was applied with analyte standards provided by the experimental partners. All of the analyzed standard curves, except for scenario C ABTS at pH 5, were best described by non-linear calibration equations based on their fit quality determined by AIC. Hence, uncertainties from concentration calculation were reduced and thus their impact on the kinetic parameters.
Besides data integrity, high quality data is required for accurate parameter estimates, since unnoticed experimental errors unintentionally distort the resulting kinetic parameters. Therefore, quality control of measurement data is an important aspect of every data analysis process. EnzymePynetics allows visualization of measurement data together with the fitted kinetic model. Ultimately, systematic deviations of single measurements were identified by simultaneously visualizing measurement data and the fitted model.
By this method, errors with pipetting, mixing, and temperature equilibration during experimental preparation as well as issues with the experimental design itself were identified in different scenarios.
Systematic pipetting errors were disclosed by systematic deviations between the measurement data and the fitted model.
In scenario D individual enzyme reactions showed to have substrate concentrations which deviated from the assay protocol. Parameter estimation without the presumably erroneous measurements yielded a better fit between data and model based on lower standard deviation for the kinetic parameters. Hence the experimental partners were advised to repeat the deviating measurement.
In two projects the progress-curve of the reactions indicate a lag phase after reaction start. In one case, enzyme reactions were started by adding enzyme with low volume. In consequence, inhomogeneous mixing led to increasing reaction rates until the enzyme was distributed evenly. As a measure, the project partners were advised to increase the pipetting volume of the enzyme solution.
In another case, temperature incubation effects in MTP caused by prolonged assay preparation time resulted in an initial lag phase. Due to small reaction volumes and low mass, MTPs have a low heat capacity and thus high susceptibility to temperature change. Hence, project partners were advised to pre-incubate the MTP within the photometer at reaction temperature for 5 min and then start the reaction by adding enzyme as quick as possible.
In another project, modeling results of the estimated \(k_{cat}\) and \(K_{m}\) were used to iteratively improve the design of the enzyme assay. In addition, the correlation between the aforementioned parameters was used as a measure to ensure that the highest substrate concentration applied was sufficiently high and thus the kinetic parameters could be determined independently.
Thereby, the appropriate enzyme concentration and substrate concentration range were identified through multiple round of lab experiments with subsequent kinetic modeling.
Assessing the data quality through modeling and enhancing assay designs proved to be a strength of the implemented progress curve method on which the parameter estimation of this workflow is based. In contrast to the predominantly applied initial rates method (Tang and Leyh, 2010), the progress curve method offers intrinsic advantages with regard to methods reproducibility. In initial rate kinetics no consensus on the linear reaction period on which the kinetic parameters are estimated exists. Hence, the linear period is manually determined or only assumed. Ultimately, the resulting parameters are influenced by subjective choice of the linear period. For progress curve analysis the entire dataset of a continuous enzyme assay is used. If the fit statistics reveal that the data is not in accordance with the model, either the assay should be repeated due to an underlying issue or the model should be questioned. Contrarily, initial rates method can mostly be applied by arbitrarily choosing any linear subset or only the initial two data points of a time course data set. In discussions with project partners this showed to be a common practice although, scientifically at least, questionable.
The choice of the modeling method is just one decision which impacts kinetic parameters in the long chain of actions from enzyme expression all the way to data modeling. Each individual treatment step needs to be documented and later reported in order to make the results of the experiment reproducible. The STRENDA guidelines offer a valuable orientation on the minimum reporting standards in order to make an enzyme kinetics experiment reproducible. However, the guidelines are mainly focused on laboratory aspects and do not specify reporting standards for data treatment steps. Hence methodological reproducibility is not given by solely adhering to STRENDA guidelines, if kinetic parameters should be reproduced. Therefore, the method for concentration calculation, as well as the method for parameter estimation should be added to the STRENDA guidelines. As a result, the importance of data analytic aspects for methodologically reproducible kinetic parameters would receive more attention.
Repeatability, reproducibility and FAIRness of the analysis workflow#
Availability of raw data is the prerequisite to repeat a data analysis process, which additionally reinforces trust in scientific findings and the process itself. The lack of raw data is therefore also described as one of the reasons for the reproducibility crisis, since reported findings cannot be verified (Miyakawa, 2020).
A second necessity for reanalysis is documentation to provide information on data treatment steps that led to a particular result. Furthermore, detailed documentation makes the analysis comprehensive for others.
The developed parameter estimation workflow unifies these two requirements. On the one hand, the workflow is based on raw data. On the other hand, the Jupyter Notebook facilitates the analysis, whilst containing documentation on all analysis steps. Thus, the workflow is repeatable and reproducible on a methods and results level. Moreover, by unifying data with executable code the implementation of the analysis workflow satisfies the highest reproducibility standards in computer science (Peng, 2011).
In conclusion, the Jupyter Notebook with the implemented workflow can be understood as a graph, describing how experimental data is transform yielding kinetic parameters. As such, the measurement data and modeling results are stored within an EnzymeML document and the information of the modeling process is stored in the notebook file.
Starting from data, FAIR guiding principles were implemented on multiple levels of this thesis. On data level, all experimental data as well as the modeling results were stored as EnzymeML files, which are compliant with FAIR data principles (Pleiss, 2021) (Range et al., 2022). On a methods level, the workflow and all of its components were designed in FAIR fashion.
The Python packages are findable and accessible on PyPI and GitHub, which represent the most important distribution platforms for Python code. Additionally, the software is interoperable with other software. This was achieved by segregating the data model from the functionality of the software. Therefore, other Python tools can utilize the functionalities of CaliPytion and EnzymePynetics by serving the underlying data models, and thus no modifications on the software functionalities are necessary. Furthermore, the data model and its vocabulary is described in the specifications of CaliPytion as well as EnzymePynetics, making the data model comprehensive for others. Furthermore, the software packages are reusable, since the software can be installed and applied by anyone due to the documentation.
Since data integrity as well as reproducibility of analysis were key parts of this work, the thesis itself was conceptualized enabling reproducible data analysis. Therefore, the Jupyter Notebooks were integrated into the Jupyter Book, which constitutes this thesis. Thus, all data analysis is reproducible by launching the individual notebooks with Binder. As a result, a level of data integrity and transparency is achieved, which is not reachable with established word processing software. Furthermore, the Jupyter Book format allows to display computational content in its native form. Due to the toggle functionality of code cells, the readability of text and code is equally maintained, which is not possible with analog formats. Hence, preserving readability as well as reproducibility was weighted higher than printed readability. The Jupyter Book formate allows this thesis to be findable and accessible on GibHub, whereas interoperability between book and notebooks makes all code content of this work reusable and reanalyzable. In conclusion, the Jupyter Book format allows scientific reporting of computational content which is compliant with FAIR guiding principles.