What is meant by 'genes overlapping inherited structural variants'?

What is meant by 'genes overlapping inherited structural variants'?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am reading a journal paper about genes that are related to autism, and I have come across the following statement:

To assess commonality of biological function among rare risk alleles, we compared functional knowledge of genes overlapping inherited structural variants in idiopathic ASD subjects relative to healthy controls

I am not sure what is meant by 'genes overlapping inherited structural variants'. I searched the definition for structural variants and according to Wikipedia, structural variation refers to the variation in structure of an organism's chromosome, such as deletions, duplications, copy-number variants, insertions, inversions and translocations.

However I am not sure what is meant by genes overlapping the inherited structural variants. Does this mean for structural variants that are common between ASD subjects, there are some common genes that are found in these regions (the inherited structural variants)? Any insights are appreciated.

If I understand correctly, what this paper is looking at is rare copy-number variation in ASD cases relative to controls. So in the context of the paper, the passage you quoted is basically saying we found genes that had copy-number variation between cases and controls, i.e. the cases had a pattern of more/fewer copies of a stretch of the genome that includes a gene than the controls. Then they look at the functions of the genes they found in a gene ontology database to see if they seem like they might relate to ASD.

Coronavirus Mutations and Variants: What Does It Mean?

Organisms in general, be it humans, plants, insects, bacteria, or viruses, undergo genetic mutations which can be beneficial or detrimental. Although viruses are not technically alive, these also mutate and evolve as they infect a host’s cell, replicate and move on to another cell or a new host. The process by which a virus spreads is what we call transmission. There are differences in the rates of mutations amongst different types of viruses. As an example, the SARS-CoV-2 coronavirus which causes the clinical entity we know as COVID-19 mutates approximately every 11-15 days. That is about half of the rate of influenza (flu) and about a quarter of the HIV rates. Mutations generate variability within a population, which allows natural selection to amplify traits that are beneficial, in this case, to the viral particle, as viruses are not considered organisms per-se.

We know the coronavirus currently has 12,700 identified mutations, 12 main types of the virus (identified as 19 A, the original type, through 20 J), five strains and almost 4000 variants. The strains are known as L, the original strain, which mutated into the S strain followed by V and G (further mutating into GR, GH and GV, and several infrequent mutations collectively grouped together as O). The G strains are now the dominant strain around the world. SARS-CoV-2 variants with spike (S)-protein D614G mutations have become the most common variant. It is so named because one amino acid is changed from a D (aspartate) to a G (glycine) at position number 614 of the viral spike proteins. The spike protein mediates the binding to the target receptors and the fusion to the human cell membrane. The S protein extends from the viral membrane giving the virus surface a crown-like appearance, for which the virus is named corona is crown in Latin. Most of the variants of concern contain mutations in the receptor-binding domain (RBD). It seems these mutations are responsible for increased viral infectivity, virulence, and immune evasion potency. It is known that the RBD is involved in viral recognition and cell receptor binding and interaction, thus any structural changes seem to be directly related to viral transmissibility and virulence. It has also been identified in numerous studies that antibodies developed against the RBD have been found to have maximum potency against the SARS-CoV-2.

It all starts in the coronavirus RNA genome, which is composed of 30,000 nucleotides, the basic structural unit of nucleic acids. The best way to think about it is as an alphabet of 30,000 letters that spell the sequences for 29 genes. The virus itself is a coil of genetic material in a protein shell with an outer envelope most of the time. The virus binds to a human target cell receptor, injects its genetic material, and takes over the cell, turning into a virus replicating factory. As it replicates, mutations can take place and either help or compromise the virus. Many of the identified mutations are inconsequential as these do not change the biology of the virus. Mutations are passed down through lineage, best described as a branch in the family tree. A group of coronaviruses that have the same inherited set of very distinctive mutations is called a variant. The lineage becomes known as a strain, and in this specific example, COVID-19 is caused by a coronavirus strain known as SARS-CoV-2. Through the course of the pandemic, we have identified several variants globally, five of which are of concern as the strains are associated with higher transmission rates that may impact the effectiveness of vaccine and therapy, and it seems increased mortality may be associated with at least one variant. More recently there have been several variants identified in the United States that share some mutations with the more aggressive variants initially identified in other countries.

The first step in understanding the variants and the impact these have in infection, reinfection and possible effects on vaccines and treatments is knowing the mutations. Although there are thousands of mutations for the most part, thus far, seven of those are the most critical to know.

D614G Spike Mutation

The D614G Spike mutation was the first mutation of concern identified in China early in the pandemic. This mutation quickly spread around the world allowing the mutated viruses to rapidly replace strains without the mutation. Although it seemed to increase the infectiousness, it was not associated with more severe disease or reduced vaccine effectiveness.

A222V Mutation

The first variant-associated mutation seen in Europe is known as A222V, identified in the B.1.177 (20A.EU1) variant that originated in Spain and dominated the European landscape for months. We do not hear much about it, because it has not been associated with increased transmission.

N501Y Spike Mutation

More of concern are the next five mutations identified. First is the N501Y Spike mutation which has been identified in at least three variants of concern. Found at the tip of the Spike protein, this mutation seems to cause a tighter thus more effective fit to the human cell receptors.

E484K Spike Mutation

The E484K Spike mutation is of significant concern as it has been identified not only in three of the global variants, but also in the newly described American variants. It has been observed in vitro that this mutation alters the shape of the proteins in the viral spike which can potentially mask the antigenic portion from antibodies. There has been much speculation as whether this mutation may impact the effectiveness of monoclonal antibody treatments and cause reinfection in some patients.

L452R Mutation

The next mutation, although infrequent in the United States, has been associated with many cases in California. L452R has triggered the emergence of numerous global variants and is present in the two recently identified California variants which also carry other mutations. The L452R mutation might enhance the interaction between virus and host cell, which in turn may significantly increase viral transmission and virulence. It may also reduce the virus-neutralizing ability of antibodies specifically targeting the spike RBD.

K417N/T Mutation

The sixth is the K417N/T mutation located on the tip of the spike protein, an area important to the antibody recognition process. In some experiments the K417N/T mutation has been associated with decreased antibody recognition and possible resistance to some antibodies. The possibility of a more effective virus/cell binding process has also been described.

Q677 Mutation

More recently the Q677 mutation has been described in at least seven lineages initially identified in Louisiana and New Mexico. It’s now in seven states, mainly in the south central and southeast United States. This mutation is four amino acids away from the S1/S2 cleavage site, an area where other mutations have been identified in the more infectious strains. It is unclear currently whether this mutation increases transmission rates.

B.1.1.7 Variant

The centers for Disease Control have defined three different levels of threat associated with variants. These are variants of interest (B.1.526, B.1.525 and P.2), variants of concern (B.1.1.7, P.1, B.1.351, B.1.427 and B.1.429), and variants of high consequence. In the U.S. we have identified the first two categories only. For the variants of concern there is evidence of increased transmissibility, more severe disease and reduced therapeutic effectiveness. Currently, globally-identified variants have been detected in multiple states. First identified in the United Kingdom is the variant known as B.1.1.7, now present in at least 90 countries and 51 states including Washington. This variant accumulated a high number of mutations including several in the spike protein. Of the 17 identified, the most notable is the N501Y mutation which has been found to help the virus form a tighter attachment to the ACE2 receptors. This variant is approximately 50% more infectious than the wild type of virus and is estimated to be doubling in the U.S. every 10 days.

B.1.351 Variant

Just as the B.1.1.7 variant was being identified, another variant with the same type of N501Y mutation was identified in South Africa. This variant is known as B.1.351 and contains additional mutations such as the K417N, and of more concern the E484K mutation. The latter has been identified in 48 countries and 30 states including Washington. In vitro studies have suggested a potential for a blunted immune response, and a small impact on vaccine efficacy.

P.1 Variant

A variant that originated in Brazil was first reported in Japan, as it was identified on four people screened upon arrival at an airport outside of Tokyo. It is postulated that the travelers acquired the variant known as P.1 while in Brazil where the lineage is traced back to the city of Manaus, the largest city on the Amazon region. This variant has 17 unique mutations in the spike protein that include the N501Y, E484K and K417N previously discussed. It has been identified in at least 25 countries and 22 states, including Washington. Of particular concern is the anecdotal reports of reinfection in people who had recovered from disease.

CAL.20C – B.1.427 and B.1.429

Several other variants have been identified in the U.S. over the past several months. One of these variants with the L452R mutation was identified in California and is considered a variant of concern. The variant designated as CAL.20C has two forms: the B.1.427 and B.1.429. It is believed to cause a stronger attachment that may prevent neutralizing antibodies from interfering with the attachment process. Additional work is needed in determining the impact this has in transmissibility, and disease severity.

B.1.526 and B.1.525

The B.1.526 and B.1.525 Variants identified in New York and traced back to Washington Heights, a Manhattan neighborhood, the B.1.526 variant has two types: one with the E484K spike mutation which may blunt the antibody response and another with the S477N mutation that may increase the effectiveness of the attachment process. The E484K mutation is also present in the Brazilian and South African variants. These, in addition to the P.2 identified in Brazil, are currently classified as variants of interest.

Novel Midwest Variants

In the Midwest, specifically in Columbus, Ohio, two novel SARS-CoV-2 clade 20G variants have been identified. The predominant variant has several mutations, including the Q677H and has been identified in several states in the upper Midwest. It is referred to as the “Midwest” variant. Subsequently, a second variant with the S N501Y mutation, which is a marker of the B.1.1.7, but lacking all other mutations associated with that strain has been identified. This mutation has also been associated with the South African variant. It will be important to further determine what impact these variants will have in the overall pandemic pattern.

It is important to remember that what most of these variants have in common is a more effective transmission pattern, which has been associated with a surge in infections in various areas of the world. In addition, some data suggest that increased morbidity and perhaps mortality can be associated with some of the variants. The impact on treatment and vaccines is still being determined, although preliminary data points to minimal impact on vaccine efficacy. Moreover, vaccine manufacturers have significant capabilities in the reformulation of vaccines.

We also know the transmission mode is the same as with the wild-type (unchanged) coronavirus, thus preventing infection should follow a similar public health guidance: facial coverings, social distancing, avoiding gatherings, and practicing appropriate hygiene and sanitation. These simple rules along with increased immunization and appropriate levels of testing continue to be the key pillars in our management of the pandemic.

IV. How NCBI Displays Variant Data

Capturing Variant Information

Structural Variation (SV) can be complex to represent. Current technologies rarely provide base pair resolution for variant breakpoints. However, there is a core set of data that captures all the necessary information on a variant, including the degree of uncertainty present in the location of breakpoints. This data set includes:

start-stop coordinates: used to define events where breakpoints are known to basepair resolution. For insertions, start=stop, indicating the base immediately prior to the inserted sequence.

inner start-stop coordinates: used to define regions that are known to be affected by a variant, but do not define the actual breakpoints. The breakpoints lie outside of the defined region.

outer start-stop coordinates: used to define the absolute outer boundary of a variation event but do not define the actual breakpoints. The breakpoints lie inside of the defined region.

allele length: the length of the affected variant. For example, paired-end mapping may identify a 5-kb deletion that is known to reside within a defined 40-kb interval, but its breakpoints are not known. Allele length (in this case, 5 kb) does not have to be exact - approximations are acceptable, depending on the method.

Visual representation of Variants

Displaying uncertainty of breakpoint locations

Displaying the uncertainty in defining the regions depends on the combination of coordinates that are associated with the variant:

Start and stop only: This implies that we have breakpoint resolution and is represented simply:

Inner/outer start/stop: Typical of a probe-based method, but could occur with other methods as well. Inner start/stop define region known to be involved with the event. Outer start/stop define region where breakpoint is likely to occur.

Inner start/stop only: May occur in probe studies, curated studies or historical studies.

Outer start/stop only: Likely to occur with mapping studies, but could show up in other studies as well. Note inward-pointing grey arrows, which indicate that the inner boundaries are not known.


No previous study has comprehensively compared the accuracies of existing SV detection algorithms. While papers describing new SV detection algorithms often include some benchmarking, they have done so using only a limited number of comparator algorithms. One recent study has compared the performances of existing seven MEI detection algorithms [74], and the results are well correlated with our evaluation results of MEI detection algorithms. Despite the overall consistency in accuracy rank of algorithms between the datasets (Additional file 1: Figure S12), the recall values for the real data were overall low relative to those for the simulated data. This would be in part due to the presence of overlapping redundant SVs in the NA12878 reference SV data, because the DGV data is derived from multiple sources of studies. Alternatively, several falsely detected SVs might be included in the reference set. In addition, lower levels of precision observed in the real data, especially for DUP and INV calls, would in part be due to a number of unidentified DUPs/INVs absent from the NA12878 reference SV dataset. More elaborate refinement, involving experimental validation, of the NA12878 SV reference data should be made in the future. Despite these shortcomings, the recall and precision values for the real data can be considered as relative values for ranking the relative performances of the algorithms.

Based on our evaluation results, we list the algorithms exhibiting higher precision and recall values for both the simulated and NA12878 real datasets (Table 1, see also Additional file 1: Table S19 for an extended list), although this list can be changed depending on what level of precision or recall is required. It shows the top 2–7 (the top 30% for Table S19) algorithms for each category exhibiting high values of the sum of the normalized F-measures of the simulated and real data and exhibiting short run time (< 200 min in Fig. 5). Overall, GRIDSS, Lumpy, SVseq2, SoftSV, and Manta show good performances in calling DELs of diverse sizes. TIDDIT [75], forestSV [76], ERDS, and CNVnator call large DELs well whereas SV detection algorithms using long reads, including pbsv, Sniffles, and PBHoney, are good at detecting small DELs. For DUP detection, good choices include Wham, SoftSV, MATCHCLIP, and GRIDSS. CNVnator, ERDS, and iCopyDAV [77] achieve good performances in calling large sizes of DUPs. For INSs, MELT, Mobster, inGAP-sv, and SV detection algorithms with long read data would effectively call reliable variants. AS-GENESENG, Control-FREEC, OncoSNP-Seq, and GenomeSTRiP may more accurately detect SVs in other types of applications, such as somatic SV detection or SV calling with whole exome sequencing data or multiple sample data because these algorithms have been more intensively designed for such applications. We also listed the poor performing algorithms in Table S20 in Additional file 1.

In almost all cases, SVs called in common between multiple algorithms exhibit higher precision and lower recall than those called with a single algorithm, but the degree of the increased precision and the decreased recall varies based on the specific combination of algorithms, including both short read- and long read-based algorithms. Mills et al. examined the accuracy of overlapping calls between five methods and demonstrated that combining algorithms based on the same method increased precision, but the increase was lower than when combining algorithms based on different methods [14]. This is consistent with our observations. However, combining algorithms based on same methods gives a moderate increase in precision and less decrease in recall. Previous studies have selected SV calls overlapping between at least two sets from multiple SV call sets in order to increase precision [13, 14, 24,25,26,27,28]. However, this strategy could take overlapping calls from “bad” pairs of algorithms whose overlapping calls give only a small increase in precision with a considerable decrease in recall. It is promising, therefore, to iteratively merge the overlapping calls from the selected pairs of algorithms, giving high quality of overlapping calls, thereby generating an SV call set with high accuracy and recovery. Furthermore, the use of overlapped calls should also improve the accuracies of the BPs, sizes, and genotypes of the SVs because we can select the BPs/sizes/genotypes from algorithms providing higher accuracy for these SV properties, shown in this study.

Trait association and clinical genetics

Most large-scale trait association studies have only considered SNVs in genome-wide association studies (GWAS). Taking advantage of the sample size and resolution of gnomAD-SV, we evaluated whether SNVs associated with human traits might be in linkage disequilibrium with SVs not directly genotyped in GWAS. We identified 15,634 common SVs (allele frequency >1%) in strong linkage disequilibrium (R 2 ≥ 0.8) with at least one common short variant (Supplementary Fig. 7), 14.8% of which matched a reported association from the NHGRI-EBI GWAS catalogue or a recent analysis of 4,203 phenotypes in the UK Biobank 33,34 . Common SVs in linkage disequilibrium with GWAS variants were enriched for genic SVs across multiple functional categories (Supplementary Table 6), and included candidate SVs such as a deletion of a thyroid enhancer in the first intron of ATP6V0D1 at a hypothyroidism-associated locus 34 (Extended Data Fig. 7). We also identified matches for previously proposed causal SVs tagged by common SNVs, including pLoF deletions of CFHR3 or CFHR1 in nephropathies and of LCE3B or LCE3C in psoriasis 35,36 . These results demonstrate the value of imputing SVs into GWAS, and for the eventual unification of short variants and SVs in all trait association studies. Given the potential value of this resource, we have released these linkage disequilibrium maps in Supplementary Table 7.

As genomic medicine advances towards diagnostic screening at sequence resolution, computational methods for variant discovery from WGS and population references for interpretation will become indispensable. One category of disease-associated SVs, recurrent CNVs mediated by homologous segmental duplications known as genomic disorders, are particularly important because they collectively represent a common cause of developmental disorders 37 . Accurate detection of large, repeat-mediated CNVs is thus crucial for WGS-based diagnostic testing as chromosomal microarray is the recommended first-tier diagnostic screen at present for unexplained developmental disorders 37 . Using gnomAD-SV, we evaluated our ability to detect genomic disorders in WGS data by calculating CNV carrier frequencies for 49 genomic disorders across 10,047 unrelated samples with no known neuropsychiatric disease and found that CNV carrier frequencies in gnomAD-SV were consistent with those reported from chromosomal microarray in the UK Biobank 38 (R 2 = 0.669 Pearson correlation test, P = 7.38 × 10 −13 ) (Fig. 6a, Supplementary Table 8, Supplementary Fig. 20). The frequencies of carriers of genomic disorders did not vary significantly among populations, with the exception of duplications of NPHP1 at 2q13, in which carrier frequencies in East Asian samples were up to 4.6-fold higher than in other populations, further highlighting the potential for variant interpretation to be confounded by the limited diversity of existing SV references (Supplementary Fig. 21).

a, Comparison of carrier frequencies for 49 putatively disease-associated deletions (red) and duplications (blue) at genomic disorder loci between gnomAD-SV and microarray analyses in the UK Biobank (UKBB) 38 . Light bars indicate binomial 95% confidence intervals. Solid grey line represents linear best fit. b, At least one pLoF or copy-gain SV was detected in 36.9% and 23.7% of all autosomal genes, respectively. ‘Constrained’ and ‘unconstrained’ includes the least and most constrained 15% of all genes based on LOEUF 4 , respectively. c, Carrier rates for very rare (allele frequency < 0.1%) pLoF SVs in medically relevant genes across several gene lists 7,39,44 . SVs per category listed in Supplementary Table 9. d, Carrier rates for very large (≥1 Mb) rare autosomal SVs among 12,653 genomes. Bars represent binomial 95% confidence intervals. e, A complex SV involving at least 49 breakpoints and seven chromosomes (also see Extended Data Fig. 8). Teal arrows indicate insertion point into chromosome 1.

In the context of variant interpretation, the current gnomAD-SV resource will permit a screening threshold of allele frequencies less than 0.1% when matching on ancestry to the populations sampled here, and allele frequencies less than 0.004% globally. In the current release, we catalogued at least one pLoF or copy-gain variant for 36.9% and 23.7% of all autosomal genes, respectively, and 490 genes with at least one homozygous pLoF SV (Fig. 6b, Extended Data Fig. 6e, Supplementary Fig. 22). We also benchmarked carrier rates for several categories of clinically relevant variants in gnomAD-SV. First, 0.32% of samples carried a very rare (allele frequency < 0.1%) SV resulting in pLoF of a gene for which incidental findings are clinically actionable, nearly half of which (that is, 0.13% of all samples) would meet diagnostic criteria as pathogenic or likely pathogenic based upon the American College of Medical Genetics (ACMG) recommendations 7 (Fig. 6c). Second, 7.22% of individuals were heterozygous carriers of rare pLoF SVs in known recessive developmental disorder genes 39 . Third, we estimated that 3.8% of the general population (95% confidence interval of 3.2–4.6%) carries at least one very large (≥1 Mb) rare autosomal SV, roughly half of which (45.2%) were balanced or complex (Fig. 6d). Among these was an example of localized chromosome shattering involving at least 49 breakpoints, yet resulting in largely balanced products, reminiscent of chromothripsis, in an adult with no known severe disease or DNA repair defect 13,14,22 (Fig. 6e, Extended Data Fig. 8). Collectively, these analyses highlight the potential of gnomAD-SV and WGS-based SV methods to augment disease-association studies and clinical interpretation across a broad spectrum of variant classes and study designs.


We present the first study to use behavioral phenotyping and genomic methods to address the underlying genetics of personality and behavioral traits in domestic dogs. We identified and resequenced a candidate locus associated with WBS in humans and known to be under positive selection in the domestic dog genome (19). We found that this region also harbors a large number of highly polymorphic SVs in canines, some of which are private to an individual dog or breed. This finding is concordant with the genetic heterogeneity of WBS in humans, where deletions range from 100 kb to 1.8 Mb in size with variable breakpoints, attributed to chromosomal instability (4143). Therefore, it is not surprising that the same is true for dogs. Here, we identified SVs found in multiple individuals that were significantly associated with one or more quantified behavioral traits informative on HYP and cognition.

Notably, our study revealed a statistically significant association between SVs in GTF2I and GTF2IRD1, basal transcription factors that regulate vertebrate development (4448), with measures of human-directed social behavior typical of WBS. Haploinsufficiency of GTF2I and GTF2IRD1 has been repeatedly linked to HYP in knockout mice and WBS patients (34, 35, 37, 48, 49). Tellingly, WBS patients with intact GTF2I and GTF2IRD1 did not exhibit HYP (36, 46). Furthermore, a recent study linked GTF2I polymorphisms to social context–dependent salivary oxytocin levels in humans, suggesting a possible mechanism by which GTF2I may exert its effects on sociability (50). The copy number variation associated with WBS is known to reduce transcription of both genes within and flanking the hemizygous deletions, a molecular signature also found in other human syndromes (for example, Smith-Magenis syndrome and DiGeorge syndrome) (42, 51). The causal SVs have been confirmed in a mouse model to reduce transcription, consistent with changes in gene dosage, and result in HYP, delayed growth rates, and cognitive defects (35).

Our third described gene, WBSCR17, has not been previously associated with sociability. However, this gene is up-regulated in cells treated with N-acetylglucosamine, a glucose derivative, suggesting a role in carbohydrate metabolism (52). SVs in WBSCR17 may represent an adaptation to a starch-rich diet typical of living in human settlements, a speculation concordant with a previous study (53).

Two of the SVs most associated with HYP, a trait uniquely displayed in domestic dogs among the canids, were SINE (short interspersed nuclear element) and LINE (long interspersed nuclear element) TEs, subtypes of retrotransposons that have high rates of insertion [for example, 1 in 108 human births has a de novo L1 insertion (54)]. With large phenotypic consequences due to the amplification of a few loci, these mobile elements have been implicated in the evolution of the canid genome (55, 56), as well as canine disease, syndromes, and morphology (5762). Because of their recent development and strong selective breeding, a simple genetic architecture controlling many canine traits is expected. This has been well documented for a number of canine complex traits, such as behavior (16, 63, 64), coat color (59, 65), body size (60), and leg length (61).

We surveyed these TEs in an extended sampling of wild and domestic canines and found them to be extremely rare in coyotes, whereas other insertions were derived and found only to segregate within domestic dogs. With a larger sample size and leveraging behavioral phenotypes from breed stereotypes, we found a significant association between TE copy number and behavior. Hence, it is conceivable that selection acting on HYP-associated TEs may have helped shape the evolution of the canid family. We further suggest that canine WBS-linked SVs likely contribute to the developmental delay that facilitates ease of forming interspecies bonds and the juvenile-like HYP exhibited toward these social companions into adulthood. This coupling presents an intriguing parallel to the same processes observed in WBS-affected individuals (20). Together, these findings suggest a major role for the TFII-I family of transcription factors in a defining behavioral phenotype of domestic dogs, thereby mapping canine HYP to the genes associated with HYP in humans with WBS. Our study exemplifies the successful strategy of canine genetic studies to fine-map a heterogeneous region, informed by and relevant to an orthologous complex human trait.

In light of our findings, we propose a unifying hypothesis to explain one aspect of canid domestication, where individuals with hypersocial tendencies were favored under selective breeding, accentuating a behavior likely influenced by SVs in the canine WBS locus. Unlike the “human-like social cognition” hypothesis of domestication (3), which argues that dogs developed advanced forms of social cognition otherwise unique to human beings, the HYP hypothesis presented here posits that adult dogs show exaggerated motivation to seek social contact, which is absent in adult wolves. Our findings provide insight into one genetic mechanism by which the hypersocial response of domestic dogs toward humans compared with human-reared wolves can be acted on and shaped by selection during species domestication. This mechanism is expected to predispose dogs for hypersocial responses toward any bonded companion. This is consistent with the finding that domestic dogs appear to maintain, or even increase, the duration of social engagements with humans and conspecifics as they approach adulthood, with the opposite trend found in wolves (66). In summary, our findings suggest that the same region affected by structural variants in human WBS is associated with the exuberant sociability of domestic dogs. The evidence presented here represents a shift regarding the role of domestication in the evolution of canine behavior, from a vehicle of advanced social cognition to one of HYP.


The authors thank Caitlin Clements, Patience Gallagher, Stephanie Kravitz, and Preetha Palasuberniam for their assistance in conducting the literature review for this paper. Dr. Dunn was supported in part by funding from the Center on the Developing Child at Harvard University. Dr. Smoller was funded in part by NIMH grant K24MH094614. Dr. Nugent was funded in part by NIMH grant K01MH087240. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of Mental Health or the National Institutes of Health.

Genetic link between face and brain shape

An interdisciplinary team led by KU Leuven and Stanford has identified 76 overlapping genetic locations that shape both our face and our brain. What the researchers didn't find is evidence that this genetic overlap also predicts someone's behavioural-cognitive traits or risk of conditions such as Alzheimer's disease. This means that the findings help to debunk several persistent pseudoscientific claims about what our face reveals about us.

There were already indications of a genetic link between the shape of our face and that of our brain, says Professor Peter Claes from the Laboratory for Imaging Genetics at KU Leuven, who is the joint senior author of the study with Professor Joanna Wysocka from the Stanford University School of Medicine. "But our knowledge on this link was based on model organism research and clinical knowledge of extremely rare conditions," Claes continues. "We set out to map the genetic link between individuals' face and brain shape much more broadly, and for commonly occurring genetic variation in the larger, non-clinical population."

Brain scans and DNA from the UK Biobank

To study genetic underpinnings of brain shape, the team applied a methodology that Peter Claes and his colleagues had already used in the past to identify genes that determine the shape of our face. Claes: "In these previous studies, we analysed 3D images of faces and linked several data points on these faces to genetic information to find correlations." This way, the researchers were able to identify various genes that shape our face.

For the current study, the team relied on these previously acquired insights as well as the data available in the UK Biobank, a database from which they used the MRI brain scans and genetic information of 20,000 individuals. Claes: "To be able to analyse the MRI scans, we had to measure the brains shown on the scans. Our specific focus was on variations in the folded external surface of the brain -- the typical 'walnut shape'. We then went on to link the data from the image analyses to the available genetic information. This way, we identified 472 genomic locations that have an impact on the shape of our brain. 351 of these locations have never been reported before. To our surprise, we found that as many as 76 genomic locations predictive of the brain shape had previously already been found to be linked to the face shape. This makes the genetic link between face and brain shape a convincing one."

The team also found evidence that genetic signals that influence both brain and face shape are enriched in the regions of the genome that regulate gene activity during embryogenesis, either in facial progenitor cells or in the developing brain. This makes sense, Wysocka explains, as the development of the brain and the face are coordinated. "But we did not expect that this developmental cross-talk would be so genetically complex and would have such a broad impact on human variation."

No genetic link with behaviour or neuropsychiatric disorders

At least as important is what the researchers did not find, says Dr Sahin Naqvi from the Stanford University School of Medicine, who is the first author of this study. "We found a clear genetic link between someone's face and their brain shape, but this overlap is almost completely unrelated to that individual's behavioural-cognitive traits."

Concretely: even with advanced technologies, it is impossible to predict someone's behaviour based on their facial features. Peter Claes continues: "Our results confirm that there is no genetic evidence for a link between someone's face and that individual's behaviour. Therefore, we explicitly dissociate ourselves from pseudoscientific claims to the contrary. For instance, some people claim that they can detect aggressive tendencies in faces by means of artificial intelligence. Not only are such projects completely unethical, they also lack a scientific foundation."

In their study, the authors also briefly address conditions such as Alzheimer's, schizophrenia, and bipolar disorder. Claes: "As a starting point, we used the results that were previously published by other teams about the genetic basis of such neuropsychiatric disorders. The possible link with the genes that determine the shape of our face had never been examined before. If you compare existing findings with our new ones, you see a relatively large overlap between the genetic variants that contribute to specific neuropsychiatric disorders and those that play a role in the shape of our brain, but not for those that contribute to our face." In other words: our risk of developing a neuropsychiatric disorder is not written on our face either.

This research is a collaboration between KU Leuven, Stanford University School of Medicine, University of Pittsburgh, Pennsylvania State University, Indiana University Purdue University Indianapolis, Cardiff University, and George Mason University.

Materials and methods

Plant material

Seeds for 1135 Arabidopsis (A. thaliana) genotypes were obtained from the 1001 genomes catalog of A. thaliana genetic variation ( All Arabidopsis genotypes were grown at 22°C/24°C (day/night) under long-day conditions (16 hr of light/8 hr of dark). Two independent replicates were performed, each of them included the full set of genotypes. The replicates obtained from independent maternal plants were grown in randomized fashion. In the analyses, only accessions from Europe and around Europe were included (Figure 3A), resulting in an analysis of 797 accessions. A list of the accessions can be found in Supplementary file 1.

GSL extractions and analyses

3 mg of seeds were harvested in 200 μL of 90% methanol. Samples were homogenized for 3 min in a paint shaker, centrifuged, and the supernatants were transferred to a 96-well filter plate with DEAE sephadex. The filter plate with DEAE sephadex was washed with water, 90% methanol and water again. The sephadex-bound GSLs were eluted after an overnight incubation with 110 μL of sulfatase. Individual desulfo-GSLs within each sample were separated and detected by HPLC-DAD, identified, quantified by comparison to standard curves from purified compounds and further normalized to the weight. A list of GSLs and their structure is given in Supplementary file 1A. Raw GSLs data are given in Supplementary file 1B.

Statistics, heritability and data visualization

Statistical analyses were conducted using R software ( with the RStudio interface ( For each independent GLS, a linear model followed by ANOVA was utilized to analyze the effect of accession, replicate and location in the experiment plate upon the measured GLS amount. Broad-sense heritability (Supplementary file 1C) for the different metabolites was estimated from this model by taking the variance due to accession and dividing it by the total variance. Estimated marginal means (emmeans) for each accession were calculated for each metabolite from the same model using the package emmeans (CRAN, 2021a Supplementary file 1D). PCAs were done with FactoMineR and factoextra packages (Abdi and Williams, 2010). Data analyses and visualization were done using R software with tidyverse (Wickham et al., 2019) and ggplot2 (Kahle and Wickham, 2013) packages.

Maps were generated using ggmap package (Kahle and Wickham, 2013).

Phenotypic classification based on GSL content

For each accession, the expressed enzyme in each of the following families was determined based on the content (presence and amounts) of short-chained aliphatic GSLs.

MAM enzymes: The total amount of three carbon GSLs and four carbon GSLs was calculated for each accession. Three carbon GSLs include 3MT, 3MSO, 3OHP and Allyl GSL. Four carbon GSLs include 4MT, 4MSO, 4OHB, 3-Butenyl and 2-OH-3-Butenyl GSL (for structures and details, see Supplementary file 1). Accessions that the majority of aliphatic short-chained GSL contained three carbons in their side chains were classified as MAM2 expressed (Figure 4—figure supplement 1). Accessions that the majority of aliphatic short-chained GSL contained four carbons in their side chains were classified as MAM1 expressed (Figure 4—figure supplement 1). The accessions were plotted on a map based on their original collection sites (Figure 4—figure supplement 1).

AOP enzymes: The relative amount of alkenyl GSL, alkyl GSL and MSO GSL was calculated in respect to the total short-chained aliphatic GSL as follows:

The expressed AOP enzyme was determined based on those ratios: Accessions with majority alkenyl GSL were classified as AOP2 expressed. Accessions with majority of alkyl GSL were classified as AOP3 expressed. Accessions with majority of MSO GSL were classified as AOP null. The accessions were plotted on a map based on their original collection sites (Figure 4—figure supplement 2).

GS-OH enzyme: The ratio between 2-OH-3-Butenyl GSL to 3-Butenyl GSL was calculated only for MAM1-expressed accessions (accessions that the majority of GSLs contain four carbons in their side chain). Accessions with high amounts of 2-OH-3-Butenyl GSL were classified as GS-OH functional. Accessions with high amounts of 3-Butenyl GSL were classified as GS-OH non-functional. The accessions were plotted on a map based on their original collection sites (Figure 4—figure supplement 3).

Each accession was classified to one of seven aliphatic short-chained GSLs based on the combination of the dominancy of the enzymes as follows: MAM2, AOP null: classified as 3MSO dominant. MAM1, AOP null: classified as 4MSO dominant. MAM2, AOP3: classified as 3OHP dominant. MAM1, AOP3: classified as 4OHB dominant. MAM2, AOP2: classified as Allyl dominant. MAM1, AOP2, GS-OH non-functional: classified as 3-Butenyl dominant. MAM1, AOP2, GS-OH functional: classified as 2-OH-3-Butenyl dominant. The accessions were plotted on a map based on their original collection sites and colored based on their dominant chemotype (Figure 4).

Environmental and demographic data

Environmental and demographic data (referred to as ‘genomic group’) were obtained from the 1001 genomes website (, for geographical and demographic data) and from the Arabidopsis CLIMtools (, Ferrero-Serrano and Assmann, 2019) for environmental data. We chose the five variables that captured a majority of the variance in this dataset based on PCA using different combinations of variables. The chosen variables are maximal temperature of warmest month (WC2_BIO5), minimal temperature of coldest month (WC2_BIO6), precipitation of wettest month (WC2_BIO13), precipitation of driest month (WC2_BIO14) and distance to the coast (in km). Each one of the above variables (including genomic group) was assigned to each one of the accessions.

Environmental models

Linear models to test the effect of geographical and environmental parameters (Figure 3—figure supplement 1 and Figure 4—source data 1) were conducted using dplyr package (CRAN, 2021b) and included the following parameters:

Figure 3—figure supplement 1 linear models for collection sites: PC score

Latitude + Longitude + Latitude * Longitude.

Table 1 and Figure 4—source data 1 for all the data: C length (C3 and C4) or the chemotypes (Allyl and 2-OH-3Butenyl)

Genomic group + Geography (north versus south) + Max temperature of warmest month + Min temperature of coldest month + Precipitation of wettest month + Precipitation of driest month + Distance to the coast + Geography * Genomic group + Geography * Max temperature of warmest month + Geography * Min temperature of coldest month + Geography * Precipitation of driest month + Geography * Precipitation of wettest month + Geography * Distance to the coast.

For the north and the south: C length (C3 and C4) or the chemotypes (Allyl and 2-OH-3Butenyl)

Genomic group + Geography (north versus south)+ Max temperature of warmest month + Min temperature of coldest month + Precipitation of wettest month + Precipitation of driest month + Distance to the coast.

Genome-wide association studies

The phenotypes for GWA studies were each accession value for PC1 and 2. GWA was implemented with the easyGWAS tool (Grimm et al., 2017) using the EMMAX algorithms (Kang et al., 2010) and a minor allele frequency (MAF) cutoff of 5%. The results were visualized as Manhattan plots using the qqman package in R (Turner, 2014).


Genomic sequences from the accessions for MAM3 – AT5G23020, AOP2 – Chr4, 1351568 until 1354216, AOP3 – AT4G03050.2, GS-OH – AT2G25450 and MYB37 – AT5G23000 were obtained using the Pseudogenomes tool (

Multiple sequence alignment was done with the msa package (default settings) in R using the ClustalW, ClustalOmega and Muscle algorithms (Bodenhofer et al., 2015). Phylogenetic trees were generated with the ‘ape’ package (neighbor-joining tree) (Paradis and Schliep, 2019) and were visualized with ggtree package in R (Yu, 2020). Each tree was rooted by the genes matching A. lyrata’s functional orthologue or closest homologue.

Bootstrap analyses (Bootstrap = 100) was done with ‘ape’ package in R (Paradis and Schliep, 2019), with the same tree inference method as described before. For MAM3 bootstrap analysis, the accessions with low-quality sequencing were excluded.

Amino acid phylogenies: Sequences were taken from Abrahams et al., 2020, which uses A. thaliana Col-0 genome and the MAM2 amino acid sequence 1006452109 from the Arabidopsis Information Resource (TAIR) database. Alignments were run using MAFFT (Katoh et al., 2017 Kuraku et al., 2013) and cleaned using Phyutility at a 50% occupancy threshold (Smith and Dunn, 2008). RAxML was used for phylogenetic inference (Stamatakis, 2014) with the PROTCATWAG model (Bootstrap = 1000).


PacBio long read-based de novo genome assemblies of the relevant accession were generated as part of the 1001 Genomes Plus project. The genomes were assembled with Canu (v1.71) (Koren et al., 2017) and polished using the long reads followed by a second polishing step with PCR-free short reads.


  1. Ridgely

    Talent did not say ..

  2. Sazilkree

    I apologise, but, in my opinion, you commit an error. I can prove it. Write to me in PM.

  3. Moogujinn

    Thanks, the post helped a lot.

  4. Shami

    I apologize for interrupting you, but I need a little more information.

  5. Bryen

    Yeah right.

Write a message