An entropy-reducing data representation approach for bioinformatic data

McCulloch, Alan; Jauregui, Ruy; Maclean, Paul; Ashby, Rachael; Moraga, Roger; Laugraud, Aurelie; Brauning, Rudiger; Dodds, Ken; McEwan, John

An entropy-reducing data representation approach for bioinformatic data

journal contribution

posted on 2023-05-03, 10:14 authored by Alan McCulloch, Ruy Jauregui, Paul MacleanPaul Maclean, Rachael AshbyRachael Ashby, Roger Moraga, Aurelie LaugraudAurelie Laugraud, Rudiger BrauningRudiger Brauning, Ken Dodds, John McEwanJohn McEwan

Non-semantic approaches to bioinformatic data analysis have potential relevance where semantic resources such as annotated finished reference genomes are lacking, such as in the analysis and utilisation of growing amounts of sequence data from non-model organisms, often associated with sequence-based agricultural, aqua-cultural and environmental sampling studies and commercial services. Even where rich semantic resources are available, semantic approaches to problems such as contrasting and comparing reference assemblies, and utilising multiple references in parallel to avoid reference bias, are costly and difficult to fully automate. We introduce and discuss a non-semantic data representation approach intended mainly for bioinformatic data called non-semantic labelling. Non-semantic labelling involves tensorially combining multiple kinds of model-based entropy-reducing data representation, with multiple representation models, so as to map both data and models into dual metric representation spaces, with goals of both reducing the statistical complexity of the data, and highlighting latent structure via machine learning and statistical analyses conducted within the dual representation spaces. As part of the framework, we introduce a novel algebraic abstraction of data representation mappings, and present four proof-of-concept examples of its application, to problems such as comparing and contrasting sequence assemblies, utilisation of multiple references for annotation and development of quality control diagnostics in a variety of high-throughput sequencing contexts.

History

Rights statement

© The Author(s) 2018. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Language

English

Does this contain Māori information or data?

No

Publisher

Oxford University Press

Journal title

Database

ISSN

1758-0463

Citation

McCulloch, A. F., Jauregui, R., Maclean, P. H., Ashby, R. L., Moraga, R. A., Laugraud, A., … McEwan, J. C. (2018). An entropy-reducing data representation approach for bioinformatic data. Database, 2018, bay029. doi:10.1093/database/bay029

Funder

Ministry of Business Innovation & Employment

Contract number

A20201

Job code

49050x01

Usage metrics

Keywords

bioinformatics gbs control information quantum NGS quality sequencing

Licence

In Copyright

An entropy-reducing data representation approach for bioinformatic data

History

Rights statement

Language

Does this contain Māori information or data?

Publisher

Journal title

ISSN

Citation

Funder

Contract number

Job code

Usage metrics

Categories

Keywords

Licence

Exports