# Can amount of gene share in descendants be easily calculated?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I saw on the web an explanation on how to calculate amount of certain genes in a person. For instance: if your grandfather was French and your granny was an American then your father is 50 French and 50% American. Then he met an american woman and their child going to be 50/2=25% French and 50% American. And so on… dividing each time by two. Is it truth?

Every parent passes half their DNA onto their kids. There is a not infinite, but very high, amount of DNA. This means for the first generation it is precisely true: if Mom has 100% red genes and Dad has 100% green genes, child will have 50% green and 50% red genes. For the second generation, for people with mixed genes, it becomes true only on average. So if child has 50% green and 50% red genes, by passing on half its genes it could theoretically pass on all red genes, or all green genes. It very likely won't pass on exactly half of each. But because there are a lot of genes, the law of averages does that in practice you can say they pass on half of each, meaning if the other parent has all green genes the resulting child will be 75% green genes, 25% red. This indeed goes on through the generations for awhile until you run into the finiteness of DNA; at some point the amounts will be small enough that they no longer follow the laws of averages, and it become much more variable whether a child gets the full complement of their parent's red genes, or none, or some intermediate amount.

This is further complicated when we aren't talking about abstract "red" or "green" genes, but "American" or "French" genes. What the heck are "American" or "French" genes anyway? The human gene pool is fairly well-mixed, with most genes being widely shared. Those that can be used to identify specific ethnic or even national origins are low enough in number that what I said earlier about the law of averages no longer applying happens earlier if you look at specific subcategories of genes. Still, it works for several generations I believe.

## VITCOMIC2: visualization tool for the phylogenetic composition of microbial communities based on 16S rRNA gene amplicons and metagenomic shotgun sequencing

The 16S rRNA gene-based amplicon sequencing analysis is widely used to determine the taxonomic composition of microbial communities. Once the taxonomic composition of each community is obtained, evolutionary relationships among taxa are inferred by a phylogenetic tree. Thus, the combined representation of taxonomic composition and phylogenetic relationships among taxa is a powerful method for understanding microbial community structure however, applying phylogenetic tree-based representation with information on the abundance of thousands or more taxa in each community is a difficult task. For this purpose, we previously developed the tool VITCOMIC (VIsualization tool for Taxonomic COmpositions of MIcrobial Community), which is based on the genome-sequenced microbes’ phylogenetic information. Here, we introduce VITCOMIC2, which incorporates substantive improvements over VITCOMIC that were necessary to address several issues associated with 16S rRNA gene-based analysis of microbial communities.

### Results

We developed VITCOMIC2 to provide (i) sequence identity searches against broad reference taxa including uncultured taxa (ii) normalization of 16S rRNA gene copy number differences among taxa (iii) rapid sequence identity searches by applying the graphics processing unit-based sequence identity search tool CLAST (iv) accurate taxonomic composition inference and nearly full-length 16S rRNA gene sequence reconstructions for metagenomic shotgun sequencing and (v) an interactive user interface for simultaneous representation of the taxonomic composition of microbial communities and phylogenetic relationships among taxa. We validated the accuracy of processes (ii) and (iv) by using metagenomic shotgun sequencing data from a mock microbial community.

### Conclusions

The improvements incorporated into VITCOMIC2 enable users to acquire an intuitive understanding of microbial community composition based on the 16S rRNA gene sequence data obtained from both metagenomic shotgun and amplicon sequencing.

## Introduction

The vast amount and increasing variety of genomic and proteomic data generated for model organisms creates an opportunity for in silico prediction of gene function through extrapolation of the functional properties of known genes. Genes with similar patterns of expression [1], synthetic lethality [2], or chemical sensitivity [3] often have similar functions. Additionally, function tends to be shared among genes whose gene products interact physically [4], are part of the same complex [5], or have similar three-dimensional structures [6]. Computational analyses have also revealed shared function among genes with similar phylogenetic profiles [7] or with shared protein domains [8]. More accurate predictions can be made by combining multiple heterogeneous sources of genomic and proteomic data [9]. Collectively, these observations have led to functional categorization of a number of previously uncharacterized genes using the so-called 'guilt-by-association' principle [10–12].

Algorithms that predict gene function using the guilt-by-association principle do so by extending a 'seed list' of genes known to have the given function by adding other genes highly associated with the seed list in one or more genomic and proteomic data sources. These algorithms typically compute a 'functional association network' to represent each dataset in this network the nodes correspond to genes or proteins and the undirected links (or edges) are weighted according to evidence of co-functionality implied by the data source. Types of functional association networks include kernels used by support vector machines (SVMs) [9], functional linkage networks [13], and protein-protein linkage maps [14]. Individual functional association networks are often combined to generate a composite functional association network that summarizes all of the evidence of co-functionality. This network is then used as input to an algorithm that scores each gene based on its proximity to the genes in the seed list. When employed on multiple complementary data sources, these algorithms can accurately predict previously annotated gene functions in blind tests [15], suggesting that their predictions for unannotated genes are also quite accurate.

Despite these successes, guilt-by-association algorithms have yet to achieve widespread use in gene annotation or as sources of new hypotheses about gene function to do so, their predictions need to become more accessible, more accurate, and more regularly updated. In principle, all available data should be used when generating hypotheses about gene function however, compiling a large number of heterogeneous data sources, generating functional association networks to represent these sources, and then mapping gene identifiers among the networks is a complex and onerous task that is best handled by specialists. Centrally managed web-based 'prediction servers' are an efficient strategy to ensure that casual users have access to the best available predictions.

However, maintaining accurate and up-to-date prediction servers can be computationally prohibitive. Though a large number of algorithms have been developed to predict the function of unannotated genes by combining heterogeneous data sources (see [16] for a recent review), the most accurate of these algorithms have long running times, which can range from minutes [17] to hours [9] on yeast. Larger mammalian genomes increase the run time of these algorithms even more. As such, these algorithms cannot feasibly be run online and instead their predictions are made offline based on sets of pre-defined seed lists derived, for example, from Gene Ontology (GO) annotations [18]. However, because new data and annotations are being generated at a rapid rate, maintaining an up-to-date database of the best available predictions for all possible functions requires substantial and potentially unavailable computational resources.

Due to this limitation, most prediction servers sacrifice accuracy for speed by relying on a single, or a small number of, pre-computed composite functional association networks and using simple heuristics to score genes based on a given seed list (for example, see [13, 14, 19]). While the scoring heuristics are fast enough to provide online predictions for arbitrary seed lists, we will show that their predictions are much less accurate than more advanced methods. Furthermore, by using a single pre-computed network, these servers do not take advantage of the fact that different data sources are more relevant for different categories of gene function [2, 9] and are not extensible to new or user-supplied data sources.

Here we demonstrate that it is not necessary to surrender either accuracy nor flexibility when building a prediction server by showing that GeneMANIA (Multiple Association Network Integration Algorithm) can, in seconds, generate genome-wide predictions that achieve state-of-the-art accuracy on arbitrary seed gene lists without relying on a pre-specified association network. We have achieved this goal through a series of algorithmic and technical advances that we have encapsulated in a new software package. With GeneMANIA, it is no longer necessary to maintain lists of in silico predictions of gene function because they can be recomputed as needed.

## Results

### Probability of state transitions along a branch

The probabilistic models can be used to infer whether there has been a change in the gene family size between the ancestor and the descendant along each branch in the species tree. This is done by substituting the rate parameters that optimize the likelihood function in the transition probability matrix P(t) (refer to the Methods section for the definitions), where t is the length of the branch. Using these transition probabilities, the probabilities of each state at LUCA can be calculated. Each of the models discussed in this work suggests that, even as gene losses and gene gains occur in evolution (the off-diagonal entries in the transition probability matrix), the most likely outcome along any branch is that the gene family size remains the same, with higher probabilities for maintaining gene absence than for maintaining gene presence. Another common property of all models (with the exception of model (B1), which is constrained to have the same rates of gene gain and gene loss) is that gene losses are typically from two to four times as likely as gene gains. The median transition probability matrices (with the highest probability in each row highlighted) for a branch with the length 0.35 (the median of the observed branch lengths in the tree) are

Additionally, transition probabilities of models (M1) and (M2) suggest that the state of multiple in-paralogs is more prone to changes along a branch than the state of a single-copy gene. The second rows of these probability matrices indicate that acquiring a new gene is less likely than duplicating the existing gene in the species, and that the loss of an existing gene is more likely than its duplication. The main difference between the models (M1) and (M2) is in the gene loss transition probabilities when there are multiple copies in the ancestor. In model (M2), it is less likely that a gene loses all its copies along a branch, whereas in (M1) the probability of losing all copies of genes along a branch is about the same as the probability of maintaining multiple copies of the gene.

### The ancestral probabilities

For each model discussed in the previous section, the probability that each COG appeared in LUCA can be inferred. A gene set LUCA-MLx consists of genes whose ancestral probabilities are at least x in their preferred model among (M1) and (M2). Table 1 (column II) shows the number of gene sets that are inferred as ancestral under the different values of x from 0.5 to 1. We construct an ancestral COG list using the probability 0.7 whenever the probability level is not stated, we refer to LUCA-ML 0.7 as LUCA-ML.

Our LUCA-ML is not the same as LUCA1.0 reconstructed in [2], most likely because the two ancestors were inferred using different methods, which were moreover applied to different sets of species and COGs. LUCA-ML 0.7 and LUCA-ML 0.6 share, respectively, about 57% and 50% of their genes with LUCA 1.0, and more than 65% of LUCA 1.0 are included in each of our ML ancestral gene sets.

### Gene content of LUCA-ML 0.7 and LUCA-1.0

The proportion of all COGs that is scored as ancestral is similar in the two reconstructed ancestors - 23% of total in the case of LUCA 1.0 (517 COGs) compared to 26% (597 COGs) in LUCA-ML 0.7. On the other hand, the identity of the COGs in the two sets differs considerably, with only 346 COGs found in both sets.

Figure 1 shows the distribution of input set of COGs as well as inferred ancestral sets by the number of genomes in which they are found under different models. The number of COGs in LUCA 1.0 and LUCA-ML 0.7 are similar for those COGs that are found in more than 80 genomes, but differ considerably for rare COGs model (M2) and other ML approaches tend to place higher proportion of sparsely distributed COGs into LUCA.

Distribution of all COGs under models B2 and M2, as well as high-ancestrality COGs (LUCA-ML and LUCA1.0), by the number of genomes in which they are present.

High-level classification of the known and predicted molecular functions of the ancestral COGs is shown in Table 2.

Poorly characterized conserved genes (categories R and S) are more frequent among the COGs that were scored as ancestral by the ML approach only, which correlates with higher proportion of rare COGs in these categories and relative favoring of these COGs by the ML approaches. These “high-ancestrality” COGs from the R and S categories account for about 16% of all COGs in these functional groups, and more insight into their function will be useful for better understanding of ancestral biochemistry.

The other extreme in “ancestrality” is represented by the COGs that belong to the category J (Translation Machinery and Ribosome Biogenesis), as well as category E (Amino Acid Biosynthesis). The vast majority of all COGs in these two categories were predicted to be ancestral by all approaches, which may be attributed in large part to their broad distribution in the genomes.

Figure 2 shows the distribution of all COGs by probability of being ancestral under each model, as well as the number of ancestral COGs under different probability cutoffs. The probabilities are well distributed throughout the range, but a considerable fraction of them (at least 15%) are clustered around 0.5. This is the “gray zone” of ancestrality, which may be resolved by future analysis, some directions of which are discussed below.

Probability distribution of the COG ancestrality under various models. The first panel shows the frequency of COGs with the different probability of occurrence at LUCA, and the second panel shows the number of COGs above the different probability thresholds.

## Phylogenetic Tree Distances

### Abstract

Phylogenetic trees are mathematical objects which summarize the most recent common ancestor relationships between a given set of organisms. There is often a need to quantify the degree of similarity or discordance between two proposed trees. For instance, a person may be interested in knowing whether the phylogenetic trees reconstructed from two distinct sequence alignments are truly different, or if the differences are so minor as to be attributable only to statistical variation. In this article we summarize several of the most widely known methods for defining distances between phylogenetic trees, and provide examples of the calculations when feasible.

## Sharing the whole HeLa genome

In March 2013, a group of researchers at the European Molecular Biology Laboratory sequenced the genome of HeLa cells. With the last decades&rsquo advances in sequencing techniques, the sequencing was done easily. It was also done with good intentions.

The cancer cells, which were first taken from a lump removed from Henrietta Lacks&rsquo cervix months before her death from cervical cancer in 1951, are the most widely used cell line in the world. The cells are hardy and have helped develop many antitumor and viral treatments, including the polio vaccine. However, the genomic data published in 2013, which can be used to glean sensitive medical information about Lacks&rsquo descendants, was shared without their knowledge.

&ldquoIt&rsquos like, &lsquoHere we go again, being involved in research without our permission or our consent,&rsquo&rdquo says David Lacks Jr. He is a grandson of Henrietta Lacks, who was a black tobacco farmer and a mother of five. When Henrietta Lacks went to seek medical attention at Johns Hopkins Hospital for a small mass in her cervix in 1951, the gynecologist on duty, Howard Jones, took a biopsy of the tumor cells. After a diagnosis, the cells made their way to George Gey, the head of tissue culture research at Johns Hopkins, by way of a mutual colleague.

Henrietta Lacks wasn&rsquot asked for permission for her cells to be shared in this manner, although taking samples from patients without permission was a standard practice at the time. While her cells, which divided indefinitely at an unprecedented rate, went on to revolutionize medical research, the Lacks family was kept in the dark until researchers came looking to draw blood samples from family members in the 1970s. The HeLa cells generated billions of dollars of profit for biomedical industries, while the Lacks family was unable to afford medical care and health insurance.

These injustices were brought to the world&rsquos attention with Rebecca Skloot&rsquos bestselling 2010 book, &ldquoThe Immortal Life of Henrietta Lacks.&rdquo Before publishing the book, Skloot established the Henrietta Lacks Foundation, which now has awarded more than 50 grants for education-related, health-care and pre-approved emergency expenses to a number of members of Lacks&rsquo immediate family.

When the genome was put up on the European Nucleotide Archive in early 2013, &ldquothere weren&rsquot any policies that said the data couldn&rsquot be made available,&rdquo says Dina Paltoo at the National Institutes of Health. Paltoo is the director of the scientific data sharing policy division at the NIH&rsquos office of science policy. &ldquoThis is pretty much the standard practice in the genomics community, and a lot of journals require that data has been shared before they&rsquoll publish the findings.&rdquo A study about the genome and epigenome of the HeLa cells by researchers at the University of Washington was also about to be published in Nature.

After the genomic information was put up in a public database by the German researchers at EMBL, Skloot published an op-ed in the New York Times that garnered a significant amount of attention. NIH Director Francis S. Collins met up with the Lacks family to discuss their options.

&ldquoWe could leave it out there as is, for the whole world to see, but the issue with that is when you sequence Henrietta Lacks&rsquo genome, you also include family traits of our genome as well,&rdquo says Lacks. &ldquoWe don&rsquot know what would be known 20 years from now with that sequence just being out there for anybody to use and how that would have an effect on us.&rdquo

### Reaching a consensus

The family came to the conclusion that the best way to handle the HeLa genomic sequence would be to have researchers apply to access it. &ldquoWe didn&rsquot want it to be cut off, because the family is unanimously proud of what the cells have helped accomplish,&rdquo says Lacks.

Collins and Kathy Hudson, who then was the NIH&rsquos deputy director for science, outreach and policy, put together a working group consisting of bioethicists, geneticists, clinicians and members of the Lacks family. According to the terms of the agreement in August 2013 that the family reached with the NIH, any researchers&rsquo plans to use the data had to meet certain criteria: The data should be used only for biomedical research purposes, the requesters must disclose any commercial plans that they would have for the data, and the requesters would agree to acknowledge the family and the contributions of the cells in any publications and presentations. The study from the University of Washington group, which had been put on hold, appeared in an issue of Nature that ran that month with a discussion of the agreement by Hudson and Collins.

The HeLa Genome Data Access Working Group and includes Lacks and Veronica Spencer, a great-grandchild of Henrietta Lacks. The group evaluates requests to access this data and then sends its findings to the advisory committee to the NIH director. That committee then makes a recommendation to Collins, who makes a final decision.

&ldquoThe NIH director has also reached out to journals and has encouraged them to make sure that investigators that are pursuing publication are abiding by the HeLa genome data use agreement and are also acknowledging the agreement and the family appropriately,&rdquo says Paltoo.

David Lacks Jr. (right) and his cousin Jeri Lacks&ndashWhye often speak publically about the Lacks family&rsquos experiences with the HeLa cell line. PHOTO PROVDED BY JERI LACKS-WHYE

### Fruits of the database

The NIH&rsquos database of genotypes and phenotypes, or dbGaP, currently contains five data sets related to the sequenced HeLa genome. So far, Collins has approved 47 requests from researchers from 20 different countries. The only rejected request was for a group that didn&rsquot want to share its findings. The two papers that caused the uproar were published after they were approved by the group.

One of those approved investigators is Andrew Adey at Oregon Health & Science University. As a graduate student, Adey was the first author on the University of Washington genome paper led by Jay Shendure.

Early in his career, Adey helped investigate what gives the HeLa cells the ability to divide in such an aggressive manner. The capability arose from the integration of DNA from the human papilloma virus into the genome of a cell in Henrietta Lacks that led to her cervical carcinoma.

&ldquoThe viral foreign DNA integration that occurred in the HeLa genome happens in some subset of cervical carcinomas, but in this case it happened in a very unfortunate way,&rdquo says Adey. &ldquoIt happened to integrate in a location that activates a cancer gene, so it was really a perfect storm of events that happened in the cell that resulted in this extremely aggressive form of cancer and, ultimately, immortalization of the cell.&rdquo

The E6 and E7 viral oncogenes were present on the inserted viral DNA that inhibit tumor suppressors, such as the well-known p53. The virus also inserted 30 copies of a regulatory enhancer near a proto-oncogene, MYC, which can cause unregulated cell division when hijacked. This interaction contributed to a much more aggressive form of cancer.

Adey and colleagues recently characterized the stability and heterogeneity of HeLa cells using a technique called combinatorial indexing. The technique allows them to perform single-cell whole-genome sequencing at a higher throughput than was previously possible by barcoding individual cells.

The researchers first applied the technique to cancer cells from an advanced adenocarcinoma and were able to identify subpopulations within the tumor. In future uses, &ldquowe&rsquoll be able to sample very low abundance subpopulations,&rdquo says Adey. &ldquoWe might be able to then infer and detect some aspects that could be targetable in a different way than the rest of the tumor.&rdquo

In addition to all of the lifesaving medicines developed with HeLa cells, researchers trying to develop new medical technologies can use the HeLa genome as a powerful calibration tool.

&ldquoWe&rsquore developing new technologies and tools to look at cancer as well as other aspects or other diseases,&rdquo says Adey. &ldquoWhen we develop these tools, we want to test them out on something where we know the answer, so that&rsquos what we use HeLa for. We know exactly what it&rsquos going to look like.&rdquo

Controlled access to the HeLa genomic data has also resulted in the development of a new analytical method by Shendure&rsquos group. The method involves chromosome-scale scaffoldings to assemble highly contiguous genomes from short reads. The reassembly is made possible by an algorithm that clusters fragments of the genome based on chromatin interaction data sets, which are useful for assigning, ordering and orienting the genomic sequences to chromosomes. The researchers first described the method, for which Shendure has also filed a patent, in a paper in the journal Nature Biotechnology in November 2013. In the paper, the researchers used the HeLa genome as one way to test the method to find interchromosomal rearrangements in cancer genomes.

Additionally, new insights into the effect of the genome&rsquos spatial organization on transcription, which has significant implications for aberrations that occur in diseases, have been made by Yijuan Ruan&rsquos group at The Jackson Laboratory Cancer Center in Bar Harbor, Maine.

While researchers use the HeLa cells to better understand countless aspects of cell biology, Lacks and Jeri Lacks&ndashWhye, another one of Henrietta Lacks&rsquo grandchildren, have traveled to speak to audiences of up to 4,000 about their family and the broader issues raised in Skloot&rsquos book.

&ldquoEven though we speak a lot on the book, we&rsquore also starting to speak more on the issues that are encompassed in the book, like health, prosperity and precision medicine,&rdquo says Lacks.

&ldquoEverybody is going to be sick at some point in time or affected by somebody who&rsquos sick,&rdquo he adds. &ldquoWe want to help scientists find cures.&rdquo

The top image, which is the same image shown on this month's cover, is a multiphoton fluorescence image of HeLa cells. Microtubules are in magenta DNA is in cyan. Image is courtesy of Tom Derrinck at the National Center for Microscopy and Imaging Research.

## 1 INTRODUCTION

Admixture between populations and hybridisation between species are common and a bifurcating tree is often insufficient to capture their evolutionary history (Green et al., 2010 Kozak et al., 2018 Malinsky et al., 2018 Patterson et al., 2012 Tung & Barreiro, 2017 ). Patterson's D statistic, first used to detect introgression between modern human and Neanderthal populations (Durand et al., 2011 Green et al., 2010 ), has been widely applied across a broad range of taxa (Fontaine et al., 2015 Kozak et al., 2018 Malinsky et al., 2018 Tung & Barreiro, 2017 vonHoldt et al., 2016 ). The D statistic and the related estimate of admixture fraction f, referred to as the f4-ratio (Patterson et al., 2012 ), are simple to calculate and well suited for taking advantage of genomic-scale data sets, while being robust under most demographic scenarios (Durand et al., 2011 ).

The D and f4-ratio statistics belong to a class of methods based on studying correlations of allele frequencies across populations and were developed within a population genetic framework (Patterson et al., 2012 ). However, the methods can be successfully applied for learning about hybridisation and introgression within groups of closely related species, as long as common population genetic assumptions hold – namely that (a) the species share a substantial amount of genetic variation due to common ancestry and incomplete lineage sorting (b) recurrent and back mutations at the same sites are negligible and (c) substitution rates are uniform across species (Patterson et al., 2012 Pease & Hahn, 2015 ).

With more genomic data becoming available, there is a need for handling data sets with tens or hundreds of taxa. Applying the D and f4-ratio statistics has the advantage of computational efficiency and is powerful even when using whole genome data from only a single individual per population (Green et al., 2010 ). On the other hand, as each calculation of D and f applies to four populations or taxa, the number of calculations/quartets grows rapidly with the size of the data set. The number of quartets is , i.e. n choose 4, where n is the number of populations. This can present challenges in terms of increased computational requirements. Moreover, the resulting test statistics are correlated when quartets share an (internal) branch in the overall population or species tree, which may make a system of all possible four taxon tests across a data set difficult to interpret.

Because pinpointing specific introgression events in data sets with tens or hundreds of populations or species remains challenging, the f-branch or fb(C) metric was introduced in Malinsky et al. ( 2018 ) to disentangle correlated f4-ratio results and assign gene flow evidence to specific, possibly internal, branches on a phylogeny. The f-branch metric builds upon and formalises verbal arguments employed by Martin et al. ( 2013 ) to assign gene flow to specific internal branches on the phylogeny of Heliconius butterflies. Thus, the f-branch statistic can be seen as an aid for formulating gene flow hypotheses in data sets of many populations or species.

Patterson's D and related statistics have also been used to identify introgressed loci by sliding window scans along the genome (Fontaine et al., 2015 Heliconius Genome Consortium, 2012 ), or by calculating these statistics for particular short genomic regions. Because the D statistic itself has large variance when applied to small genomic windows and because it is a poor estimator of the amount of introgression (Martin et al., 2015 ), additional statistics which are related to the f4-ratio have been designed specifically to investigate signatures of introgression in genomic windows along chromosomes. These statistics include fd (Martin et al., 2015 ), its extension fdM (Malinsky et al., 2015 ), and the distance fraction df (Pfeifer & Kapan, 2019 ).

Programs for calculating Patterson's D and related statistics include admixtools (Patterson et al., 2012 ), hyde (Blischak et al., 2018 ), angsd (Paul et al., 2011 Soraggi et al., 2018 ), popgenome (Pfeifer & Kapan, 2019 Pfeifer et al., 2014 ), and comp-d (Mussmann et al., 2020 ). However, a number of factors call for an introduction of new software. First, most of the existing programs cannot handle the variant call format (VCF) (Danecek et al., 2011 ), the standard file format for storing genetic polymorphism data produced by variant callers such as samtools (Li, 2011 ) and gatk (DePristo et al., 2011 ). Second, the computational requirements of these programs in terms of either run time or memory (or both) make comprehensive analyses of data sets with tens or hundreds of populations or species either difficult or infeasible. Third, the programs implement only a subset of the statistics discussed above, and there are some statistics, namely fdM, and f-branch, which have not yet been implemented in any publicly available software package.

To address these issues, we introduce the Dsuite software package. Dsuite brings the calculation of different related statistics together into one software package, combining genome-wide and sliding window analyses, and downstream analyses aiding their interpretation (Table 1). Dsuite has a user-friendly straightforward workflow and uses the standard VCF format, thus generally avoiding the need for format conversions or data duplication. Moreover, Dsuite is computationally more efficient than other software in the core task in calculating the D statistics, making it more practical for analysing large genome-wide data sets with tens or even hundreds of populations or species. Finally, Dsuite implements the calculation of the fdM and f-branch statistics for the first time in publicly available software. While researchers can implement these and other statistics in their own custom scripts, the inclusion of the whole package of statistics in Dsuite facilitates their use and reproducibility of results.

Software VCF input Genome-wide tests/statistics Sliding window statistics
D f4-ratio f-branch D f d f dM df
angsd
comp-d
hyde
popgenome
dsuite

## How much of human height is genetic and how much is due to nutrition?

This question can be rephrased as: "How much variation (difference between individuals) in height is attributable to genetic effects and how much to nutritional effects?" The short answer to this question is that about 60 to 80 percent of the difference in height between individuals is determined by genetic factors, whereas 20 to 40 percent can be attributed to environmental effects, mainly nutrition. This answer is based on estimates of the "heritability" of human height: the proportion of the total variation in height due to genetic factors.

Human height is a quantitative, or metric, trait, i.e., a characteristic that is measured in quantity, and is controlled by multiple genes and environmental factors. Many studies have estimated the heritability of human height. Often, these studies determine heritability by estimating the degree of resemblance between relatives. One can separate genetic effect from environmental effects by correlating genetic similarity between relatives (twin, siblings, parents and offspring) with their similarity in height. To accurately measure how genetically similar relatives are, one can measure the number of genetic markers they share. For example, Peter M. Visscher of the Queensland Institute of Medical Research in Australia recently reported that the heritability of height is 80 percent, based on 3,375 pairs of Australian twins and siblings. This estimate is considered to be unbiased, as it was based on a large population of twins and siblings and a broad survey of genetic markers. In the U.S., the heritability of height was estimated as 80 percent for white men. These estimates are well supported by another study of 8,798 pairs of Finnish twins, in which the heritability was 78 percent for men and 75 percent for women. Other studies have shown height heritability among whites to be even higher than 80 percent.

Because different ethnic populations have different genetic backgrounds and live in different environments, however, height heritability can vary from one population to another, and even from men to women. In Asian populations, the heritability of height is much lower than 80 percent. For example, in 2004 Miao-Xin Li of Hunan Normal University in China and his colleagues estimated a height heritability of 65 percent, based on a Chinese population of 385 families. In African populations, height heritability is also lower: 65 percent for the population of western Africa, according to a 1978 study by D. F. Roberts, then at Newcastle University in England, and colleagues. Such diversities in heritability are mainly due to the different genetic background of ethnic groups and the distinct environments (climates, dietary habits and lifestyle) they experience.

Heritability allows us to examine how genetics directly impact an individual's height. For example, a population of white men has a heritability of 80 percent and an average height of 178 centimeters (roughly five feet, 10 inches). If we meet a white man in the street who is 183 cm (six feet) tall, the heritability tells us what fraction of his extra height is caused by genetic variants and what fraction is due to his environment (dietary habit and lifestyle). The man is five centimeters taller than the average. Thus, 80 percent of the extra five centimeters, or four centimeters, is due to genetic variants, whereas one centimeter is due to environmental effects, such as nutrition.

Heritability can also be used to predict an individual's height if the parents' heights are known. For example, say a man 175 cm tall marries a woman 165 cm tall, and both are from a Chinese population with a population mean of 170 cm for men and 160 cm for women. We can predict the height of their children, assuming the heritability is 65 percent for men and 60 percent for women in this population. For a son, the expected height difference from the population mean is: 0.65 x [(175 - 170) + (165 - 160)] / 2, which equals 3.25 cm for a daughter, the difference is 0.6 x [(175 - 170) + (165 - 160)] / 2, which equals 3 cm. Thus, the expected height of a son is 170 + 3.2, or 173.2 cm, and of a daughter 160 + 3, or 163 cm. On the other hand, environmental effects can add 1.75 cm to a son's height: 0.35 x [(175 - 170) + (165 - 160)] / 2, and 2 cm to a daughter's: 0.4 x [(175 - 170) + (165 - 160)] / 2. Of course, these predictions only reflect the mean expected height for each of the two siblings (brothers and sisters) the actual observed height may be different.

From these calculations, we realize the environment (mainly nutrients) can only change about 2 centimeters for a given offspring's height in this Chinese population. Does that mean that no matter what happens in the child's environment, the height can never change more than this? Can special treatment and nutrient supplements increase the height further? The answer is yes. The most important nutrient for final height is protein in childhood. Minerals, in particular calcium, and vitamins A and D also influence height. Because of this, malnutrition in childhood is detrimental to height. In general, boys will reach maximum height in their late teens, whereas girls reach their maximum heights around their mid-teens. Thus, adequate nutrition before puberty is crucial for height.

Reviewer 1: Mikhail Gelfand, Department of Bioengineering and Bioinformatics, Moscow State University, and Institute for Information Transmission Problems RAS, Moscow, Russia

The paper addresses an important problem of selecting a good similarity measure for comparing gene expression patterns. It does not provide definitive answers, but demonstrates correct approaches. The main conclusion, "the choice of a proper measure depends on the biological problem at hand" is difficult to argue against. The following comments are mainly of the discussion and editorial nature.

While the basic assumption, that homologous tissues in different organisms should be more similar in the terms of gene expression than tissues in one organism, is reasonable, some caveats are due. For instance, if the tissues in question are very close developmentally, one can easily expect concerted, organism-specific changes in expression. In fact, the papers results demonstrate exactly that.

The rat spleen and thymus are clustered by all measures (Fig. 1). The human spleen and thymus are clustered by some measures, and I think that clustering [(thymus_rat + spleen_rat) + (thymus_human + spleen_human)] should not be counted as an error, as opposed to a version with human spleen being an outlier: [((thymus_rat + spleen_rat) + thymus_human) + spleen_human]. Similarly, I'd assume that both versions [(muscle_human + heart_human) + (muscle_rat + heart_rat)] and [(muscle_human + muscle_rat) + (heart_human + heart_rat)] are biologically relevant, as opposed to [((muscle_human + heart_human) + muscle_rat) + heart_rat)]. Hence, the procedure of counting errors should not be limited to considering pairs of non-clustered homologous tissues, but should tale into account finer topological detail (as well as, maybe, branch length).

Authors' response: We agree with the reviewer that there may be more than one biologically relevant clustering solution, and concerted organism-specific co-expression of genes might cause species-specific tissue cluster. However, we believe that in most cases non-homologous tissues clustering is directly related to tissues sampling and the number of replicates available. Curiously, the pattern [((thymus_rat + spleen_rat) + thymus_human) + spleen_human], was observed with all four distance measures that we tried. Also note that part of our intention was to demonstrate that in the problem of tissue clustering there is no valid reason to dismiss the correlation-based distance, despite the concerns raised in ref. [13] and indeed, correlation-based distance and the Euclidean distances gave the same results in our hands, and even for the binary transformed data the correlation-based distance detected some of the relevant signal.

While this may go beyond the limits of the present study, I think it would be interesting to look into more detail into the cluster trees generated by different measures, and specifically, into what genes contribute most into different clusters, dependent on the expression patterns. At that, one should keep in mind that in each tissue we observe an averaged expression of genes from a mixture of quite different cell types. For instance, clustering of the spleen, thymus and the bone marrow may be related to the blood cells development, while clustering of the spleen, thymus and the pituitary gland may be caused by genes expressed in the gland tissue.

Some hint of analysis is given in the last paragraph of "Distance estimates". The overrepresentation of heart and muscle development genes is not surprising, given the robust clustering of these tissues in all trees. On the other hand, the statement that the Eucledian distance does not provide a functionally meaningful set: one can easily see blood cell development genes there (not surprising given spleen, thymus and bone marrow data) and neurological process (the sources for which is admittedly less clear: could it be the pituitary gland?)

Authors' response: We agree that there is good information in the clusters produced by Euclidean distance, even if there is no single dominant theme there. Note, however, that genes selected using the Euclidean distance tend to be expressed in all tissues at the uniform low level, while genes selected using correlation-based distance tend to be expressed in several orthologous tissues at the much higher level.

Reviewer 2: Eugene Koonin, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health

The paper by Glazko and Mushegian makes the case that different measures of expression divergence (in particular, Euclidean distances and correlation-based distances) are best suited for revealing different trends in the evolution of gene expression. I would like to strongly endorse this work that shows flexibility which is vital for understanding such a complex phenomenon as evolution of gene expression in multicellular organisms. A versatile approach like this gives the only hope of progress in this field and is a welcome contrast to the common attempts to propose one approach claimed to be best for all purposes.

Authors' response: We appreciate the reviewer positive comment. Taking a more familiar example of distances between biological sequences, we know that those can be roughly estimated even without an explicit model of sequence evolution, but it is also known that, as sequences diverge, the error of the estimate becomes more and more significant. Similarly, the ultimate goal in gene expression analysis is to have an evolutionary model for gene expression. Short of that, the divergence between expression profiles can be estimated with appropriate distance measures.

Reviewer 3: Subhajyoti De (nominated by Sarah Teichmann), Computational Biology Program, Memorial Sloan-Kettering Cancer Center

In the paper entitled "Measuring gene expression divergence: the distance to keep", Glazko and Mushegian present a discussion about which distance measure to use in inter-species expression divergence analyses. While the topic is of broad interest, I have some comments

1. How were the transcripts with multiple probes treated? How were the probes that map to multiple genes treated?

Authors' response: Raw data preprocessing step is described in the Method section.

If a gene had multiple transcripts, how did the authors choose the representative transcript?

Authors' response: Affymetrix Human hgu133a and Rat rgu34a arrays do not provide information about multiple transcripts.

Why no between-array normalization was performed for rat samples?

Authors' response: RMA procedure was implemented for both human and rat arrays.

2. The distributions of Euclidean distance and correlation-based distance for pairs of randomly chosen gene pairs differ in their shapes. Can the authors discuss this issue and also how that may affect their comparative analysis and tree-building?

Authors' response: This is exactly the point of the presented paper. Not only the distributions between randomly chosen gene pairs are different, but also the distributions between orthologous gene pairs are different for all distance measures that we tried. As we have shown in the paper, this difference most certainly may have an effect on the analysis, and the kind of effect depends on the type of the analysis, i.e., on the biological question that is asked.

3. In the recent releases of Ensembl, there are about 14,000 one-to-one orthologs. The authors present results based on 3152 genes. It remains to be clear why the dataset analyzed is so small and whether the conclusions made in this paper can be extended to the whole genome dataset.

Authors' response: hgu133a and rgu34a arrays contain 22283 and 8799 probe sets, respectively. After mapping them to unique genes, only 4939 genes for rat were left. The conclusions made in this paper refer to the distance properties and hardly depend on the number of the orthologs studied.

4. In Figure 1 it is not clear how the tree was drawn (e.g. Neighbour joining, Maximum likelihood) and how that method may affect the tree structure. Furthermore, the authors should perform bootstrapping to assess the quality of the trees.

Authors' response: We used average-link clustering for tree inference. As we were interested in how different distance measures affect the tree structure, we applied the same clustering approach to each distance matrix. Different clustering approach may indeed produce trees with different topologies, but we expect that the effect of varying distance measure would be observed in any clustering algorithm. As for the support of the trees, we expect it to be relatively low given the sample size and the amount of replicates, and our focus here is on the qualitative estimate of how different distances perform in the problem of tissues clustering.

5. In Figure 2 the histogram bars corresponding to orthologus and random gene pairs should be provided side-by-side. In its current form, it is hard to interpret how the distributions of orthologus gene-pairs differ from the random pairs.

Authors' response: We think that bar plots with stacked columns demonstrate the difference between these distributions quite clearly.

6. In Figure 3, y-axis label is missing. Why skeletal muscle shows high Euclidian and correlation distance that is significantly above other tissue-types (as seen by boxplot) and the trend is consistent in all the four panels? Is it an array normalization artifact or a biologically meaningful pattern?

Authors' response: We labeled y-axis in Figure 3. The meaning of the pattern observed in Figure 3, we believe, is that genes selected using the Euclidean distance tend to be expressed in all tissues at the uniformly low level (close to the background), while genes selected using correlation-based distance tend to be expressed in several orthologous tissues at a higher level.

1. The Ensembl Release version is not provided.

Authors' response: The release version is now included.

2. GO has many functional categories organized in a hierarchical structure. It is unclear which level of GO hierarchy was used in the current analysis.

Authors' response: The levels were chosen based on the significant p-values provided by the enrichment test, and therefore the categories from different levels of the hierarchy could be reported.

3. Table S1 and S2 carry insufficient detail about the methodology involved and the message they convey. For instance, it is unclear whether the over-represented GO categories in Table S1 arise from analysis on heart tissue? How is the p-value calculated?

Authors' response: We now provide more comprehensive description of Tables S1 and S2 in Additional file 4. We first identified orthologous gene pairs with expression profiles conserved at the 1% significance level, using different distances. For these gene pairs we implemented GO enrichment analysis. Genes identified using correlation-based distance, binary correlation distance, and GA distances shared 15 overrepresented GO categories (Table S1), whereas genes identified using the Euclidean distance were from completely different GO categories (Table S2). This was the lesson learned from the analysis, i.e., that different distances select functionally different conserved orthologous gene pairs. The over-represented GO categories in Table S1 arise from the genes expressed in all tissues and identified as conserved by three different distances. p-values were calculated by hypergeometric test using the GOstat module from Bioconductor.

4. In Figure S3, in each panel, the outliers cross the whisker and also appear to be shifted. Please revise the figure. Also please adjust the y-axis scale in the two bottom panels to make the figures easier to visualize.

Authors' response: In R implementation, whiskers extend to 1.5*IQR but the parameters can be adjusted so that outliers are not displayed at all. The message of Figure S3 is that genes with high entropy are not 'genes with a conserved uniform pattern of expression'.

### Why We Procrastinate

In light of the evidence that goal-management ability may be a central underlying problem for both procrastination and impulsivity, executive functions may also be predictive of individual differences in both of these traits, especially at the genetic level (p. 9).

I couldn’t agree more, and I’m confident that findings of future twin studies that include measures of executive function and conscientiousness will take the emphasis off of the risk factor of impulsivity alone in an understanding of the evolutionary etiology of procrastination. In fact, impulsivity can be seen as a failure of executive function, particularly a key function commonly labeled inhibition.

As with all complex behaviors, procrastination does not have a single causal factor such as impulsivity. There are both risk and resilience factors, each of which is partially explained by genetic variation. Of course, this nuanced answer is not such an appealing message for a media headline where we simply want to say “you inherited your procrastination!” We’re eager to read an article that explains our procrastination today as a by-product of human evolutionary history. Doesn’t it feel great to blame it our genes and evolutionary history? It’s only human after all.

Of course procrastination is only human. I agree. I also agree that impulsivity“a bird in the hand”may have paid off for our ancestors leading to a selection for this trait, but so did conscientiousness, that planful, organized approach to life. That’s why we see substantial heritability for this trait as well.

So, before you impulsively (pardon the pun) blame your genes and human evolutionary history for your procrastination and find yet another excuse for justifying needless, self-defeating delay, take a moment to put these new truth claims in the context of your other traits and abilities that show substantial genetic contributions. And, perhaps most importantly, remember that the genetic contributions amount to half of the variability in these traits. The rest is that “nature via nurture” dance where environment makes a great deal of difference. How will you nurture your goal-management ability and better inhibit that only too human desire to impulsively give in to feel good now?

Gustavson, D., Miyake, A., Hewitt, J., & Friedman, N. (2014). Genetic relations among procrastination, impulsivity, and goal-management ability: Implications for the evolutionary origin of procrastination. Psychological Science. DOI: 10.1177/0956797614526260