Submit a preprint

Latest recommendations

IdTitle * Authors * Abstract * Picture * Thematic fields * RecommenderReviewers▲Submission date
25 Feb 2025
article picture

Proper account of auto-correlations improves decoding performances of state-space (semi) Markov models

An empirical study on the impact of neglecting dependencies in the observed or the hidden layer of a H(S)MM model on decoding performances

Recommended by ORCID_LOGO based on reviews by Sandra Plancade and 1 anonymous reviewer

The article by Bez et al [1] addresses an important issue for statisticians and ecological modellers: the impact of modelling choices when considering state-space models to represent time series with hidden regimes.

The authors present an empirical study of the impact of model misspecification for models in the HMM and HSMM family. The misspecification can be at the level of the hidden chain (Markovian or semi-Markovian assumption) or at the level of the observed chain (AR0 or AR1 assumption). 

The study uses data on the movements of fishing vessels. Vessels can exert pressure on fish stocks when they are fishing, and the aim is to identify the periods during which fishing vessels are fishing or not fishing, based on GPS tracking data. Two sets of data are available, from two vessels with contrasting fishing behaviour. The empirical study combines experiments on the two real datasets and on data simulated from models whose parameters are estimated on the real datasets. In both cases, the actual sequence of activities is available. The impact of a model misspecification is mainly evaluated on the restored hidden chain (decoding task), which is very relevant since in many applications we are more interested in the quality of decoding than in the accuracy of parameters estimation. Results on parameter estimation are also presented and metrics are developed to help interpret the results. The study is conducted in a rigorous manner and extensive experiments are carried out, making the results robust.

The main conclusion of the study is that choosing the wrong AR model at the observed sequence level has more impact than choosing the wrong model at the hidden chain level.

The article ends with an interesting discussion of this finding, in particular the impact of resolution on the quality of the decoding results. As the authors point out in this discussion, the results of this study are not limited to the application of GPS data to the activities of fishing vessels Beyond ecology, H(S)MMs are also widely used epidemiology, seismology, speech recognition, human activity recognition ... The conclusion of this study will therefore be useful in a wide range of applications. It is a warning that should encourage modellers to design their hidden Markov models carefully or to interpret their results cautiously.

References

[1] Nicolas Bez, Pierre Gloaguen, Marie-Pierre Etienne, Rocio Joo, Sophie Lanco, Etienne Rivot, Emily Walker, Mathieu Woillez, Stéphanie Mahévas (2024) Proper account of auto-correlations improves decoding performances of state-space (semi) Markov models. HAL, ver.3 peer-reviewed and recommended by PCI Math Comp Biol https://hal.science/hal-04547315v3

Proper account of auto-correlations improves decoding performances of state-space (semi) Markov modelsNicolas Bez, Pierre Gloaguen, Marie-Pierre Etienne, Rocio Joo, Sophie Lanco, Etienne Rivot, Emily Walker, Mathieu Woillez, Stéphanie Mahévas<p>State-space models are widely used in ecology to infer hidden behaviors. This study develops an extensive numerical simulation-estimation experiment to evaluate the state decoding accuracy of four simple state-space models. These models are obt...Dynamical systems, Ecology, Probability and statisticsNathalie Peyrard2024-05-29 16:29:25 View
13 Aug 2024
article picture

Phenotype control and elimination of variables in Boolean networks

Disclosing effects of Boolean network reduction on dynamical properties and control strategies

Recommended by ORCID_LOGO based on reviews by Tomas Gedeon and David Safranek

Boolean networks stem from seminal work by M. Sugita [1], S. Kauffman [2] and R. Thomas [3] over half a century ago. Since then, a very active field of research has been developed, leading to theoretical advances accompanied by a wealth of work on modelling genetic and signalling networks involved in a wide range of cellular processes. Boolean networks provide a successful formalism for the mathematical modelling of biological processes, with a qualitative abstraction particularly well adapted to handle the modelling of processes for which precise, quantitative data is barely available. Nevertheless, these abstract models reveal fundamental dynamical properties, such as the existence and reachability of attractors, which embody stable cellular responses (e.g. differentiated states). Analysing these properties still faces serious computational complexity. Reduction of model size was proposed as a mean to cope with this issue. Furthermore, to enhance the capacity of Boolean networks to produce relevant predictions, formal methods have been developed to systematically identify control strategies enforcing desired behaviours.

In their paper, E. Tonello and L. Paulevé [4] assess the most popular reduction that consists in eliminating a model component. Considering three typical update schemes (synchronous, asynchronous and general asynchronous updates), they thoroughly study the effects of the reduction on attractors, minimal trap spaces (minimal subspaces from which the model dynamics cannot leave), and on phenotype controls (interventions which guarantee that the dynamics ends in a phenotype defined by specific component values). Because they embody potential behaviours of the biological process under study, these are all properties of great interest for a modeller.

The authors show that eliminating a component can significantly affect some dynamical properties and may turn a control strategy ineffective. The different update schemes, targets of phenotype control and control strategies are carefully handled with useful supporting examples.

Whether the component eliminated does not share any of its regulators with its targets is shown to impact the preservation of minimal trap space. Since, in practice, model reduction amounts to eliminating several components, it would have been interesting to further explore this type of structural constraints, e.g. members of acyclical pathways or of circuits.

Overall, E. Tonello and L. Paulevé’s contribution underlines the need for caution when defining a regulatory network and characterises the consequences on critical model properties when discarding a component [4].

References

[1] Motoyosi Sugita (1963) Functional analysis of chemical systems in vivo using a logical circuit equivalent. II. The idea of a molecular automation. Journal of Theoretical Biology, 4, 179–92. https://doi.org/10.1016/0022-5193(63)90027-4

[2] Stuart Kauffman (1969) Metabolic stability and epigenesis in randomly constructed genetic nets. Journal of Theoretical Biology, 22, 437–67. https://doi.org/10.1016/0022-5193(69)90015-0

[3] René Thomas (1973)  Boolean formalization of genetic control circuits. Journal of Theoretical Biology, 42, 563–85. https://doi.org/10.1016/0022-5193(73)90247-6

[4] Elisa Tonello, Loïc Paulevé (2024) Phenotype control and elimination of variables in Boolean networks. arXiv, ver.2 peer-reviewed and recommended by PCI Math Comp Biol https://arxiv.org/abs/2406.02304

 
Phenotype control and elimination of variables in Boolean networksElisa Tonello, Loïc Paulevé<p>We investigate how elimination of variables can affect the asymptotic dynamics and phenotype control of Boolean networks. In particular, we look at the impact on minimal trap spaces, and identify a structural condition that guarantees their pre...Dynamical systems, Systems biologyClaudine Chaouiya2024-06-05 10:12:39 View
18 Sep 2023
article picture

General encoding of canonical k-mers

Minimal encodings of canonical k-mers for general alphabets and even k-mer sizes

Recommended by based on reviews by 2 anonymous reviewers

As part of many bioinformatics tools, one encodes a k-mer, which is a string, into an integer. The natural encoding uses a bijective function to map the k-mers onto the interval [0, s^k - ], where s is the alphabet size. This encoding is minimal, in the sense that the encoded integer ranges from 0 to the number of represented k-mers minus 1. 

However, often one is only interested in encoding canonical k-mers. One common definition is that a k-mer is canonical if it is lexicographically not larger than its reverse complement. In this case, only about half the k-mers from the universe of k-mers are canonical, and the natural encoding is no longer minimal. For the special case of a DNA alphabet and odd k, there exists a "parity-based" encoding for canonical k-mers which is minimal. 

In [1], the author presents a minimal encoding for canonical k-mers that works for general alphabets and both odd and even k. They also give an efficient bit-based representation for the DNA alphabet. 

This paper fills a theoretically interesting and often overlooked gap in how to encode k-mers as integers. It is not yet clear what practical applications this encoding will have, as the author readily acknowledges in the manuscript. Neither the author nor the reviewers are aware of any practical situations where the lack of a minimal encoding "leads to serious limitations." However, even in an applied field like bioinformatics, it would be short-sighted to only value theoretical work that has an immediate application; often, the application is several hops away and not apparent at the time of the original work. 

In fact, I would speculate that there may be significant benefits reaped if there was more theoretical attention paid to the fact that k-mers are often restricted to be canonical. Many papers in the field sweep under the rug the fact that k-mers are made canonical, leaving it as an implementation detail. This may indicate that the theory to describe and analyze this situation is underdeveloped. This paper makes a step forward to develop this theory, and I am hopeful that it may lead to substantial practical impact in the future. 

References

[1] Roland Wittler (2023) "General encoding of canonical k-mers. bioRxiv, ver.2, peer-reviewed and recommended by Peer Community in Mathematical and Computational Biology https://doi.org/10.1101/2023.03.09.531845

General encoding of canonical *k*-mersRoland Wittler<p style="text-align: justify;">To index or compare sequences efficiently, often <em>k</em>-mers, i.e., substrings of fixed length <em>k</em>, are used. For efficient indexing or storage, <em>k</em>-mers are encoded as integers, e.g., applying som...Combinatorics, Computational complexity, Genomics and TranscriptomicsPaul MedvedevAnonymous2023-03-13 17:01:37 View
27 Sep 2024
article picture

In silico identification of switching nodes in metabolic networks

A computational method to identify key players in metabolic rewiring

Recommended by ORCID_LOGO based on reviews by 2 anonymous reviewers

Significant progress has been made in developing computational methods to tackle the analysis of the numerous (genome-wide scale) metabolic networks that have been documented for a wide range of species. Understanding the behaviours of these complex reaction networks is crucial in various domains such as biotechnology and medicine.

Metabolic rewiring is essential as it enables cells to adapt their metabolism to changing environmental conditions. Identifying the metabolites around which metabolic rewiring occurs is certainly useful in the case of metabolic engineering, which relies on metabolic rewiring to transform micro-organisms into cellular factories [1], as well as in other contexts.

This paper by F. Mairet [2] introduces a method to disclose these metabolites, named switch nodes, relying on the analysis of the flux distributions for different input conditions. Basically, considering fluxes for different inputs, which can be computed using e.g. Parsimonious Flux Balance Analysis (pFBA), the proposed method consists in identifying metabolites involved in reactions whose different flux vectors are not collinear. The approach is supported by four case studies, considering core and genome-scale metabolic networks of Escherichia coli, Saccharomyces cerevisiae and the diatom Phaeodactylum tricornutum.

Whilst identified switch nodes may be biased because computed flux vectors satisfying given objectives are not necessarily unique, the proposed method has still a relevant predictive potential, complementing the current array of computational methods to study metabolism.

References

[1] Tao Yu, Yasaman Dabirian, Quanli Liu, Verena Siewers, Jens Nielsen (2019) Strategies and challenges for metabolic rewiring. Current Opinion in Systems Biology, Vol 15, pp 30-38. https://doi.org/10.1016/j.coisb.2019.03.004.

[2] Francis Mairet (2024) In silico identification of switching nodes in metabolic networks. bioRxiv, ver.3 peer-reviewed and recommended by PCI Math Comp Biol https://doi.org/10.1101/2023.05.17.541195

In silico identification of switching nodes in metabolic networksFrancis Mairet<p>Cells modulate their metabolism according to environmental conditions. A major challenge to better understand metabolic regulation is to identify, from the hundreds or thousands of molecules, the key metabolites where the re-orientation of flux...Graph theory, Physiology, Systems biologyClaudine ChaouiyaAnonymous2023-05-26 17:24:26 View
07 Dec 2021
article picture

The emergence of a birth-dependent mutation rate in asexuals: causes and consequences

A new perspective in modeling mutation rate for phenotypically structured populations

Recommended by based on reviews by Hirohisa Kishino and 1 anonymous reviewer

In standard mutation-selection models for describing the dynamics of phenotypically structured populations, it is often assumed that the mutation rate is constant across the phenotypes. In particular, this assumption leads to a constant diffusion coefficient for diffusion approximation models (Perthame, 2007 and references therein).   

Patout et al (2021) study the dependence of the mutation rate on the birth rate, by introducing some diffusion approximations at the population level, derived from the large population limit of a stochastic, individual-based model. The reaction-diffusion model in this article is of the “cross-diffusion” type: The form of “cross-diffusion” also appeared in ecological literature as a type of biased movement behaviors for organisms (Shigesada et al., 1979). The key underlying assumption for “cross-diffusion” is that the transition probability at the individual level depends solely upon the condition at the departure point. Patout et al (2021) envision that a higher birth rate yields more mutations per unit of time. One of their motivations is that during cancer development, the mutation rates of cancer cells at the population level could be correlated with reproduction success.   

The reaction-diffusion approximation model derived in this article illustrates several interesting phenomena: For the time evolution situation, their model predicts different solution trajectories under various assumptions on the fitness function, e.g. the trajectory could initially move towards the birth optimum but eventually end up at the survival optimum. Their model also predicts that the mean fitness could be flat for some period of time, which might provide another alternative to explain observed data. At the steady-state level, their model suggests that the populations are more concentrated around the survival optimum, which agrees with the evolution of the time-dependent solution trajectories.   

Perhaps one of the most interesting contributions of the study of Patout et al (2021) is to give us a new perspective to model the mutation rate in phenotypically structured populations and subsequently, and to help us better understand the connection between mutation and selection. More broadly, this article offers some new insights into the evolutionary dynamics of phenotypically structured populations, along with potential implications in empirical studies.   

References

Perthame B (2007) Transport Equations in Biology Frontiers in Mathematics. Birkhäuser, Basel. https://doi.org/10.1007/978-3-7643-7842-4_2

Patout F, Forien R, Alfaro M, Papaïx J, Roques L (2021) The emergence of a birth-dependent mutation rate in asexuals: causes and consequences. bioRxiv, 2021.06.11.448026, ver. 3 peer-reviewed and recommended by Peer Community in Mathematical and Computational Biology. https://doi.org/10.1101/2021.06.11.448026

Shigesada N, Kawasaki K, Teramoto E (1979) Spatial segregation of interacting species. Journal of Theoretical Biology, 79, 83–99. https://doi.org/10.1016/0022-5193(79)90258-3

The emergence of a birth-dependent mutation rate in asexuals: causes and consequencesFlorian Patout, Raphaël Forien, Matthieu Alfaro, Julien Papaïx, Lionel Roques<p style="text-align: justify;">In unicellular organisms such as bacteria and in most viruses, mutations mainly occur during reproduction. Thus, genotypes with a high birth rate should have a higher mutation rate. However, standard models of asexu...Dynamical systems, Evolutionary Biology, Probability and statistics, Stochastic dynamicsYuan LouAnonymous, Hirohisa Kishino2021-06-12 13:59:45 View
21 Oct 2024
article picture

Benchmarking the identification of a single degraded protein to explore optimal search strategies for ancient proteins

Systematic investigation of software tools and design of a tailored pipeline for paleoproteomics research

Recommended by based on reviews by Shevan Wilkin and 1 anonymous reviewer

Paleoproteomics is a rapidly growing field with numerous challenges, many of which are due to the highly fragmented, modified, and degraded nature of ancient proteins. Though there are established standards for analysis, it is unclear how different software tools affect the identification and quantification of peptides, proteins, and post-translational modifications. To address this knowledge gap, Rodriguez Palomo et al. design a controlled system by experimentally degrading and purifying bovine beta-lactoglobulin, and then systematically compare the performance of many commonly used tools in its analysis.

They present comprehensive investigations of false discovery rates, open and narrow searches, de novo sequencing coverage bias and accuracy, and peptide chemical properties and bias. In each investigation, they explore wide ranges of appropriate tools and parameters, providing guidelines and recommendations for best practices. Based on their findings, Rodriguez Palomo et al. develop a proposed pipeline that is tailored for the analysis of ancient proteins. This pipeline is an important contribution to paleoproteomics and is likely to be of great value to the research community, as it is designed to enhance power, accuracy, and consistency in studies of ancient proteins.

References

Ismael Rodriguez-Palomo, Bharath Nair, Yun Chiang, Joannes Dekker, Benjamin Dartigues, Meaghan Mackie, Miranda Evans, Ruairidh Macleod, Jesper V. Olsen, Matthew J. Collins (2023) Benchmarking the identification of a single degraded protein to explore optimal search strategies for ancient proteins. bioRxiv, ver.3 peer-reviewed and recommended by PCI Math Comp Biol https://doi.org/10.1101/2023.12.15.571577

Benchmarking the identification of a single degraded protein to explore optimal search strategies for ancient proteinsIsmael Rodriguez-Palomo, Bharath Nair, Yun Chiang, Joannes Dekker, Benjamin Dartigues, Meaghan Mackie, Miranda Evans, Ruairidh Macleod, Jesper V. Olsen, Matthew J. Collins<p style="text-align: justify;">Palaeoproteomics is a rapidly evolving discipline, and practitioners are constantly developing novel strategies for the analyses and interpretations of complex, degraded protein mixtures. The community has also esta...Genomics and Transcriptomics, Probability and statisticsRaquel AssisAnonymous, Shevan Wilkin2024-03-12 15:17:08 View
10 Apr 2024
article picture

Revisiting pangenome openness with k-mers

Faster method for estimating the openness of species

Recommended by based on reviews by Guillaume Marçais, Abiola Akinnubi and 1 anonymous reviewer

When sequencing more and more genomes of a species (or a group of closely related species), a natural question to ask is how quickly the total number of distinct sequences grows as a function of the total number of sequenced genomes. A similar question can be asked about the number of distinct genes or the number of distinct k-mers (length-k subsequences).
 
The paper “Revisiting pangenome openness with k-mers” [1] describes a general mathematical framework that can be applied to each of these versions. A genome is abstractly seen as a set of “items” and a species as a set of genomes. The question then is how fast the function f_tot, the average size of the union of m genomes of the species, grows as a function of m. Basically, the faster the growth the more “open” the species is. More precisely, the function f_tot can be described by a power law plus a constant and the openness $\alpha$ refers to one minus the exponent $\gamma$ of the power law.
 
With these definitions one can make a distinction between “open” genomes ($\alpha < 1$​) where the total size f_tot tends to infinity and “closed” genomes  ($\alpha > 1$)​ where the total size f_tot tends to a constant. However, performing this classification is difficult in practice and the relevance of such a disjunction is debatable. Hence, the authors of the current paper focus on estimating the openness parameter $\alpha$.
 
The definition of openness given in the paper was suggested by one of the reviewers and fixes a problem with a previous definition (in which it was mathematically impossible for a pangenome to be closed).
 
While the framework is very general, the authors apply it by using k-mers to estimate pangenome openness. This is an innovative approach because, even though k-mers are used frequently in pangenomics, they had not been used before to estimate openness. One major advantage of using k-mers is that it can be applied directly to data consisting of sequencing reads, without the need for preprocessing. In addition, k-mers also cover non-coding regions of the genomes which is in particular relevant when studying openness of eukaryotic species.
 
The method is evaluated on 12 bacterial pangenomes with impressive results. The estimated openness is very close to the results of several gene-based tools (Roary, Pantools and BPGA) but the running time is much better: it is one to three orders of magnitude faster than the other methods.
 
Another appealing aspect of the method is that it computes the function f_tot exactly using a method that was known in the ecology literature but had not been noticed in the pangenomics field. The openness is then estimated by fitting a power law function.
 
Finally, the paper [1] offers a clear presentation of the problem, the approach and the results, with nice examples using real data.

References

[1] Parmigiani L., Wittler, R. and Stoye, J. (2024) "Revisiting pangenome openness with k-mers". bioRxiv, ver. 4 peer-reviewed and recommended by Peer Community In Mathematical and Computational Biology. https://doi.org/10.1101/2022.11.15.516472

Revisiting pangenome openness with k-mersLuca Parmigiani, Roland Wittler, Jens Stoye<p style="text-align: justify;">Pangenomics is the study of related genomes collectively, usually from the same species or closely related taxa. Originally, pangenomes were defined for bacterial species. After the concept was extended to eukaryoti...Combinatorics, Genomics and TranscriptomicsLeo van Iersel Guillaume Marçais, Yadong Zhang2022-11-22 14:48:18 View
14 Mar 2023
article picture

Marker and source-marker reprogramming of Most Permissive Boolean networks and ensembles with BoNesis

Reprogramming of locally-monotone Boolean networks with BoNesis

Recommended by based on reviews by Ismail Belgacem and 1 anonymous reviewer

Reprogramming of cellular networks is a well known challenge in computational biology consisting first of all in properly representing an ensemble of networks having a role in a phenomenon of interest, and secondly in designing strategies to alter the functioning of this ensemble in the desired direction.  Important applications involve disease study: a therapy can be seen as a reprogramming strategy, and the disease itself can be considered a result of a series of adversarial reprogramming actions.  The origins of this domain go back to the seminal paper by Barabási et al. [1] which formalized the concept of network medicine.

An abstract tool which has gathered considerable success in network medicine and network biology are Boolean networks: sets of Boolean variables, each equipped with a Boolean update function describing how to compute the next value of the variable from the values of the other variables.  Despite apparent dissimilarity with the biological systems which involve varying quantities and continuous processes, Boolean networks have been very effective in representing biological networks whose entities are typically seen as being on or off.  Particular examples are protein signalling networks as well as gene regulatory networks.

The paper [2] by Loïc Paulevé presents a versatile tool for tackling reprogramming of Boolean networks seen as models of biological networks.  The problem of reprogramming is often formulated as the problem of finding a set of perturbations which guarantee some properties on the attractors.  The work [2] relies on the most permissive semantics [3], which together with the modelling assumption allows for considerable speed-up in the practically relevant subclass of locally-monotone Boolean networks.

The paper is structured as a tutorial.  It starts by introducing the formalism, defining 4 different general variants of reprogramming under the most permissive semantics, and presenting evaluations of their complexity in terms of the polynomial hierarchy.  The author then describes the software tool BoNesis which can handle different problems related to Boolean networks, and in particular the 4 reprogramming variants.  The presentation includes concrete code examples with their output, which should be very helpful for future users.

The paper [2] introduces a novel scenario: reprogramming of ensembles of Boolean networks delineated by some properties, including for example the property of having a given interaction graph.  Ensemble reprogramming looks particularly promising in situations in which the biological knowledge is insufficient to fully determine all the update functions, i.e. in the majority of modelling situations.  Finally, the author also shows how BoNesis can be used to deal with sequential reprogramming, which is another promising direction in computational controllability, potentially enabling more efficient therapies [4,5].

REFERENCES
  1. Barabási A-L, Gulbahce N, Loscalzo J (2011) Network medicine: a network-based approach to human disease. Nature Reviews Genetics, 12, 56–68. https://doi.org/10.1038/nrg2918
  2. Paulevé L (2023) Marker and source-marker reprogramming of Most Permissive Boolean networks and ensembles with BoNesis. arXiv, ver. 2 peer-reviewed and recommended by Peer Community in Mathematical and Computational Biology. https://doi.org/10.48550/arXiv.2207.13307
  3. Paulevé L, Kolčák J, Chatain T, Haar S (2020) Reconciling qualitative, abstract, and scalable modeling of biological networks. Nature Communications, 11, 4256. https://doi.org/10.1038/s41467-020-18112-5
  4. Mandon H, Su C, Pang J, Paul S, Haar S, Paulevé L (2019) Algorithms for the Sequential Reprogramming of Boolean Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 16, 1610–1619. https://doi.org/10.1109/TCBB.2019.2914383
  5. Pardo J, Ivanov S, Delaplace F (2021) Sequential reprogramming of biological network fate. Theoretical Computer Science, 872, 97–116. https://doi.org/10.1016/j.tcs.2021.03.013
Marker and source-marker reprogramming of Most Permissive Boolean networks and ensembles with BoNesisLoïc Paulevé<p style="text-align: justify;">Boolean networks (BNs) are discrete dynamical systems with applications to the modeling of cellular behaviors. In this paper, we demonstrate how the software BoNesis can be employed to exhaustively identify combinat...Combinatorics, Computational complexity, Dynamical systems, Molecular Biology, Systems biologySergiu Ivanov Ismail Belgacem, Anonymous2022-08-31 15:00:21 View
02 May 2023
article picture

Population genetics: coalescence rate and demographic parameters inference

Estimates of Effective Population Size in Subdivided Populations

Recommended by ORCID_LOGO based on reviews by 2 anonymous reviewers

We often use genetic data from a single site, or even a single individual, to estimate the history of effective population size, Ne, over time scales in excess of a million years. Mazet and Noûs [2] emphasize that such estimates may not mean what they seem to mean.  The ups and downs of Ne may reflect changes in gene flow or selection, rather than changes in census population size. In fact, gene flow may cause Ne to decline even if the rate of gene flow has remained constant.

Consider for example the estimates of archaic population size in Fig. 1, which show an apparent decline in population size between roughly 700 kya and 300 kya. It is tempting to interpret this as evidence of a declining number of individuals, but that is not the only plausible interpretation.

Each of these estimates is based on the genome of a single diploid individual. As we trace the ancestry of that individual backwards into the past, the ancestors are likely to remain in the same locale for at least a generation or two. Being neighbors, there’s a chance they will mate. This implies that in the recent past, the ancestors of a sampled individual lived in a population of small effective size.

As we continue backwards into the past, there is more and more time for the ancestors to move around on the landscape. The farther back we go, the less likely they are to be neighbors, and the less likely they are to mate. In this more remote past, the ancestors of our sample lived in a population of larger effective size, even if neither the number of individuals nor the rate of gene flow has changed.

For awhile then, Ne should increase as we move backwards into the past. This process does not continue forever, because eventually the ancestors will be randomly distributed across the population as a whole. We therefore expect Ne to increase towards an asymptote, which represents the effective size of the entire population.

This simple story gets more complex if there is change in either the census size or the rate of gene flow.  Mazet and Noûs [2] have shown that one can mimic real estimates of population history using models in which the rate of gene flow varies, but census size does not. This implies that the curves in Fig. 1 are ambiguous. The observed changes in Ne could reflect changes in census size, gene flow, or both.

For  this  reason,  Mazet  and  Noûs [2]  would  like  to  replace  the  term  “effective  population size” with an alternative, the “inverse instantaneous coalescent rate,” or IIRC. I don’t share this preference, because the same critique could be made of all definitions of Ne. For example, Wright [3, p. 108] showed in 1931 that Ne varies in response to the sex ratio, and this implies that changes in Ne need not involve any change in census size. This is also true when populations are geographically structured, as Mazet and Noûs [2] have emphasized, but this does not seem to require a new vocabulary.

Figure 1: PSMC estimates of the history of population size based on three archaic genomes: two Neanderthals and a Denisovan [1].

Mazet  and  Noûs  [2]  also  show  that  estimates  of  Ne  can  vary  in  response  to  selection.   It is not hard to see why such an effect might exist. In genomic regions affected by directional or purifying selection, heterozygosity is low, and common ancestors tend to be recent. Such regions may contribute to small estimates of recent Ne. In regions under balancing selection, heterozygosity is high, and common ancestors tend to be ancient. Such regions may contribute to large estimates of ancient Ne. The magnitude of this effect presumably depends on the fraction of the genome under selection and the rate of recombination.

In summary, this article describes several processes that can affect estimates of the history of effective population size. This makes existing estimates ambiguous. For example, should we interpret Fig. 1 as evidence of a declining number of archaic individuals, or in terms of gene flow among archaic subpopulations? But these questions also present research opportunities. If the observed decline reflects gene flow, what does this imply about the geographic structure of archaic populations? Can we resolve the ambiguity by integrating samples from different locales, or using archaeological estimates of population density or interregional trade?

REFERENCES

[1] Fabrizio Mafessoni et al. “A high-coverage Neandertal genome from Chagyrskaya Cave”. Proceedings of the National Academy of Sciences, USA  117.26 (2020), pp. 15132–15136.  https://doi.org/10.1073/pnas.2004944117

[2] Olivier Mazet and Camille Noûs. “Population genetics: coalescence rate and demographic parameters inference”. arXiv, ver. 2 peer-reviewed and recommended by Peer Community In Mathematical and Computational Biology (2023). https://doi.org/10.48550/ARXIV.2207.02111.

[3] Sewall Wright. “Evolution in mendelian populations”. Genetics 16 (1931), pp. 97–159. https://doi.org/10.48550/ARXIV.2207.0211110.1093/genetics/16.2.97.

Population genetics: coalescence rate and demographic parameters inferenceOlivier Mazet, Camille Noûs<p style="text-align: justify;">We propose in this article a brief description of the work, over almost a decade, resulting from a collaboration between mathematicians and biologists from four different research laboratories, identifiable as the c...Genetics and population Genetics, Probability and statisticsAlan Rogers Joseph Lachance, Anonymous2022-07-11 14:03:04 View
08 Nov 2024
article picture

Bayesian joint-regression analysis of unbalanced series of on-farm trials

Handling Data Imbalance and G×E Interactions in On-Farm Trials Using Bayesian Hierarchical Models

Recommended by ORCID_LOGO based on reviews by Pierre Druilhet and David Makowski

The article, "Bayesian Joint-Regression Analysis of Unbalanced Series of On-Farm Trials," presents a Bayesian statistical framework tailored for analyzing highly unbalanced datasets from participatory plant breeding (PPB) trials, specifically wheat trials. The key goal of this research is to address the challenges of genotype-environment (G×E) interactions in on-farm trials, which often have limited replication and varied testing conditions across farms.

The study applies a hierarchical Bayesian model with Finlay-Wilkinson regression, which improves the estimation of G×E effects despite substantial data imbalance. By incorporating a Student’s t-distribution for residuals, the model is more robust to extreme values, which are common in on-farm trials due to variable environments.  Note that the model allows a detailed breakdown of variance, identifying environment effects as the most significant contributors, thus highlighting areas for future breeding focus. Using Hamiltonian Monte Carlo methods, the study achieves reasonable computation times, even for large datasets. 

Obviously, the limitation of the methods comes from the level of data balance and replication. The method requires a minimum level of data balance and replication, which can be a challenge in very decentralized breeding networks Moreover, the Bayesian framework, though computationally feasible, may still be complex for widespread adoption without computational resources or statistical expertise.

The paper presents a sophisticated Bayesian framework specifically designed to tackle the challenges of unbalanced data in participatory plant breeding (PPB). It showcases a novel way to manage the variability in on-farm trials, a common issue in decentralized breeding programs. 

This study's methods accommodate the inconsistencies inherent in on-farm trials, such as extreme values and minimal replication. By using a hierarchical Bayesian approach with a Student’s t-distribution for robustness, it provides a model that maintains precision despite these real-world challenges. This makes it especially relevant for those working in unpredictable agricultural settings or decentralized trials. From a more general perspective, this paper’s findings support breeding methods that prioritize specific adaptation to local conditions. It is particularly useful for researchers and practitioners interested in breeding for agroecological or organic farming systems, where G×E interactions are critical but hard to capture in traditional trial setups.

Beyond agriculture, the paper serves as an excellent example of advanced statistical modeling in highly variable datasets. Its applications extend to any field where data is incomplete or irregular, offering a clear case for hierarchical Bayesian methods to achieve reliable results.

Finally, although begin quite methodological, the paper provides practical insights into how breeders and researchers can work with farmers to achieve meaningful varietal evaluations.  

References

Michel Turbet Delof , Pierre Rivière , Julie C Dawson, Arnaud Gauffreteau , Isabelle Goldringer , Gaëlle van Frank , Olivier David (2024) Bayesian joint-regression analysis of unbalanced series of on-farm trials. HAL, ver.2 peer-reviewed and recommended by PCI Math Comp Biol https://hal.science/hal-04380787

Bayesian joint-regression analysis of unbalanced series of on-farm trials Michel Turbet Delof , Pierre Rivière , Julie C Dawson, Arnaud Gauffreteau , Isabelle Goldringer , Gaëlle van Frank , Olivier David<p>Participatory plant breeding (PPB) is aimed at developing varieties adapted to agroecologically-based systems. In PPB, selection is decentralized in the target environments, and relies on collaboration between farmers, farmers' organisations an...Agricultural Science, Genetics and population Genetics, Probability and statisticsSophie Donnet Pierre Druilhet, David Makowski2024-01-11 14:17:41 View