Information

How to compare two RNA transcripts?


In this paper, the cuffcompare RNA package is reported:

Because of the stochastic nature of sequencing, assembly of the same transcript in two different samples may result in transfrags of slightly different lengths. A Cufflinks transfrag was considered a complete match when there was a transcript with an identical chain of introns in the combined annotation.

What does this mean in terms of comparing two transcripts to see if they are the same? Does that mean two transcripts are considered equal if they have the same introns, even some exons are missing in one of the transcript? What do they mean by transfrag? Is there an example?


If some exons are missing in one of the samples then the introns will, by definition, be different. What this allows is for the bounds of the outermost exons to vary a bit. This is particularly useful for exon 1, which often has lower coverage.


Counts vs. FPKMs in RNA-seq

Most of the time, the reason people perform RNA-seq is to quantify gene expression levels. In theory, RNA-seq is ratio-level data, and you should be legitimately able to compare Gene A in Sample 1 vs. Sample 2 as well as Gene A vs. Gene B within Sample 1.

There are two main ways of measuring the expression of a gene, or transcript, or whatever, in RNA-seq data:

  1. counts are simply the number of reads overlapping a given feature such as a gene.
  2. FPKMs or F ragments P er K ilobase of exon per M illion reads are much more complicated. Fragment means fragment of DNA, so the two reads that comprise a paired-end read count as one. Per kilobase of exon means the counts of fragments are then normalized by dividing by the total length of all exons in the gene (or transcript). This bit of magic makes it possible to compare Gene A to Gene B even if they are of different lengths. Per million reads means this value is then normalized against the library size. This bit of magic makes it possible to compare Gene A in Sample 1 to Sample 2 even if Sample 1′s RNA-seq library has 60 million pairs of reads and Sample 2′s library has only 30 million pairs of reads.

(In fact, as this post will show, there are more differences between the two methods than just these – I’ll return to this in the conclusion.)

To my mind, normalizing by exonic length and library size seems like a no-brainer, so I use FPKMs and had never understood why anyone would use counts. But if you really want to defend your analysis you need to be able to answer any question with “Yes, I tried that and here’s what I found,” and so I wanted to repeat my analysis using counts. Meanwhile, a colleague who’s into counts told me that FPKMs apply too much normalization, glossing away some of the difference between one sample and another. Why would that be the case? I decided as long as I was going to repeat my analyses using counts I might as well do a side-by-side comparison with FPKMs in order to really understand how the behavior differs.

To compare the two, I turned to my go-to RNA-seq dataset: Human BodyMap 2.0. For the purposes of this exercise, I’ll look only at known transcripts.

how to calculate FPKMs

how to calculate counts

You can calculate counts using bedtools multicov, but you need a transcript annotation file in BED format to tell bedtools where to look – unlike Cufflinks with the -N 1 setting, multicov is not going to go out and discover novel transcripts for you. To make the counts directly comparable to the FPKMs I calculated earlier, I wanted to use that same transcript annotation file and convert it from GTF to BED format.

Right off the bat, things get complicated. I noticed that the original transcript annotation file has one row per every combination of one transcript with an exon or coding sequence or start or stop codon. Consider PRNP, which only has two exons (exon 1 is the 5′ UTR and exon 2 is the coding sequence and 3′UTR) and really only one transcript – there are no major splicing variations I’m aware of. It has 18 rows in this file.

That’s because 4 distinct versions of PRNP somehow made it into Ensembl’s database as distinct transcripts – some blatantly have the wrong coding sequence coordinates (the real coding sequence ends at chr20:4680625 in hg19).

In any case, if it’s that bad for PRNP you can imagine how many rows are present for genes that legitimately have a lot of splicing variants:

This presents us with a problem. Now, if we wanted counts for each possible exon, we could just use the gtf2bed tool in bedops which will convert that original GTF file into a BED file, line by line:

5 mins). But more likely our unit of analysis is transcripts or gene symbols. If we were to do counts by exon and then group by transcript or gene symbol and take the sum of exon counts, we’d be quadruple-counting each exon in PRNP and counting each exon in TTN many more times than that! What we need is to convert the GTF file to one row per, say, gene symbol, if gene symbol is our unit of analysis.

It turns out that Erik Aronesty‘s ea-utils contains a Perl script to do just that. It is named gtf2bed just like the bedops tool above, so for clarity I’ve renamed it gtf2bed_2.pl . To download and run it:

1 minute. (Warning: if you’re using this post as a pipeline, note that using the resulting BED file without modification can give very nonsensical results for genes listed in multiple loci – see discussion of SNORD60 further on in this post).

If you open the resulting bed file you’ll see that the first three columns are simply the chromosome, (earliest) transcription start site and (latest) transcription end site for the gene – or in other words, the union of all transcribed sites in that gene over all possible transcripts.

Now, gtf2bed_2.pl observes very proper BED12 format and therefore does preserve the exon structure information in the form of the blockSizes and blockStarts columns. But multicov doesn’t read anything but the first three columns. Therefore when you do counts on this file we just created, you’ll be counting introns and exons alike. As far as I can tell by asking around, this is how everyone does their RNA-seq counts.

Contrast this with FPKMs, where Cufflinks will count only exonic reads and will normalize by a gene’s (or transcript’s) total exon length, if you do counts (at least according to this pipeline / unless you do other, fancier stuff) you are including intronic reads. So counts – unlike FPKMs – will be affected by how much pre-mRNA contamination (and therefore intronic coverage) you have in your libraries.

With all that said, next I ran multicov, like so:

Which took about 50 hours of CPU time.

By the way, generating the list of BAM files for this command is annoying this time I did it with echo -n :

The resulting file will have the original 12-column bed file created by gtf2bed_2.pl plus, in this case, 16 additional columns for each of the 16 BAMs I called counts on using multicov.

a couple of covariates

Since FPKMs are, in theory, just counts normalized by library size and transcript size, I figured I should have those two values on hand for this analysis as well. I calculated the library size as number of reads in each BAM with samtools view -c :

Which takes a surprisingly long time (

30 min/BAM), hence the need to submit each one as a job.

The other covariate I wanted was the length of each gene. But which length, you ask? Using the BED file I just created, it’s easy to get the length from the earliest transcription start site to the latest possible transcription end site:

If you want the exonic length, that’s slightly tricker. Obviously, Cufflinks knows this information in some form since it’s used for normalization, so I looked back at the isoforms.fpkm_tracking file from Cufflinks and saw that indeed it has a length value for each transcript. You can pull that out like so:

But genes.fpkm_tracking does not have this for genes, presumably since choosing one length as “the” length for a gene with multiple transcripts is awkward. With some fancier scripting and bedtools merge , you could get the length of the union of all possible exons in a gene, sort of analogous to gene.lengths.txt file we just created, which is the length of the union of all possible transcripts. But I won’t necessarily need that for today.

For the sake of argument, I also calculated straight-up average lengths for each gene symbol, crude though that is. First I grabbed gene symbols and length in bash:

and then just some SQL wrapped in R:

With the counts, FPKMs and covariates in hand I set out to understand how and why these measures differed from one another.

First, the boring setup stuff:

Most basic question: are counts and FPKMs correlated? I would certainly hope so! We can ask this a couple ways. First, let’s ask this question across all gene symbol – tissue combinations.

This is strange. In linear space (Pearson’s correlation), the counts and FPKMs are significantly but barely correlated, with rho = .006. In rank space (Spearman’s correlation) they’re quite strongly correlated, rho = .81. What could these data possibly look like?

This is so extreme: in this view there appear to be basically two kinds of genes: those with some counts but

0 FPKMs, and those with some FPKMs but

0 counts. Amazing that we saw any correlation at all.

This is even true if we take the average value for each gene across the multiple tissues considered here:

The two most extreme outliers were IGHJ6 and SNORD60, so I looked those up individually.

IGHJ6 is only 61 bp long,at chr14:106,329,408-106,329,468, so it’s no wonder that it could have low counts but high FPKMs. SNORD60, on the other hand, is also a short gene, a snoRNA of only 83 bp at chr16:2,205,024-2,205,106. So what is SNORD60′s deal?

First I looked at the raw data:

13-21 million reads but zero FPKMs in many tissues. It didn’t take long to find the source of the problem: in the BED file I used to create counts, SNORD60 is 204 Mb long:

Which turns out to be because in the original GTF file it is listed with three exons in completely different genomic loci.

So when I ran gtf2bed_2.pl to convert this GTF to a BED file, it simply chose the lowest start base and highest end base as the endpoints of a transcript.

It proved surprisingly difficult to find some way of filtering such cases out. The histogram of gene lengths in my BED file is just as extreme as the plots earlier:

Looking for some cutoff to filter out the genes whose length is obviously an error, I Googled “longest human gene” and found DMD, which measures almost 2.3Mb. The histogram of genes ≤ 2.3Mb looks slightly better than the first histogram:

This is closer to the exponential distribution I would expect, though I suspect that there are still some erroneously long genes in this distribution too.

If this subset, of genes < 2.3Mb, is more rational and has at least eliminated some of the most outrageous errors, I would have hoped that it would be possible to explain much of the variability in counts vs. FPKMs within this subset:

But no, linear model of FPKMs

counts gives an R^2 of only .008. Including gene length in the model didn’t help:

And explicitly dividing counts by gene length only helped a little bit, getting us up to an R^2 of .016:

This dataset includes 52,686 Ensembl gene symbols, so I wondered if perhaps the data would be better behaved if we only considered the 23,705 hg19 RefSeq genes. This helped only a little bit, getting us up to an R^2 of .026:

And when I went back to all gene-tissue combinations with this more limited dataset I finally got a rho of .26 for a Pearson’s correlation, and a .83 for Spearman’s.

This is still not as tight a correlation as I had hoped, considering that these two measures are supposed to be broadly measuring the same thing – gene expression – in the exact same dataset. For comparison, when I run my standard gene expression QC pipeline on RNA-seq data for different samples but called using the same pipeline, I often find a Pearson’s correlation between samples of .85 or better. Whereas here, for the same data called with two different pipelines, I get a Pearson’s of only .26. This is perhaps another unfortunate reminder of just how irreproducible gene expression findings can be. The technologies used (including the different bioinformatic pipelines) introduce more variability than is present in the underlying samples themselves.

I figured one possible explanation might be the difference between exonic length and total gene length. Here counts are assessed over total gene length, and I’ve then divided them by total gene length, whereas FPKMs are assessed over exons and normalized by exonic length. Within this relatively well-behaved set of genes ≤ 2.3Mb and in RefSeq, the correlation between total length and exonic length is still only 0.19 in linear space and 0.49 in rank space:

Which suggests that at least part of the problem here is just that counts, which include exons and introns, are measuring something very different than FPKMs, which include only exons.

So it seems these two metrics are just measuring something different, and getting different answers (as evidenced by the low correlation between them). That suggests that at most one of the two methods – counts and FPKMs – is suitable for comparing Gene A to Gene B. At least at a ratio level, that is. Arguably, since the Spearman’s correlation is stronger, both could be okay for ordinal-level analyses.

That’s just comparing Gene A to Gene B. But often the answer we’re looking for in our analyses is to find genes whose expression level correlates with some variable of interest – say, a genotype, drug treatment, or timepoint. Such results will be reproducible between counts and FPKMs only to the extent that counts and FPKMs for each individual gene are correlated across samples. In this case, our “samples” are the 16 different tissues in Human BodyMap 2.0. To assess how reproducible the level of each gene is across different tissues, I made a “volcano plot,” first, of Pearson’s correlations:

The results are much better than I expected:

Pearson's correlation % of genes
positive (p < .05) 83%
none (p > .05) 6%
negative (p < .05) 0.01%
NA* 11%

*The NA values result from rows where either all tissues had 0 counts or all had 0 FPKMs, thus the correlation test failed.

Surprisingly, when I re-ran this with Spearman’s, the results were virtually identical (all of the numbers in the above table were within a fraction of a percent).

So for most genes, the difference between various samples’ expression levels of that gene is at least nominally reproducible between the two metrics considered here: counts and FPKMs. However, I hesitate to assign too much significance to this finding because what I’m using here as my example dataset is expression across different tissues, as opposed to different individuals. Gene expression differences between tissues are pretty large and pretty fundamental to biology, and I would expect differences between individuals to be far more subtle. Whether the same inter-individual differences show up in counts as show up in FPKMs, I can’t say in this example.

conclusions

The name “FPKM” – fragments per kilobase of exon per million reads – implies that FPKM is a measure of gene expression normalized by exonic length and library size, in contrast to raw counts. However in the course of this example I’ve realized there are several other differences between counts and FPKMs:

  • When a read overlaps multiple exon definitions or multiple transcript definitions, Cufflinks makes a decision about which transcript(s) to assign the read to when it calculates FPKMs. The calculation of counts, at least in the simple pipeline I’ve presented here, is not nearly so sophistocated.
  • As a result of that, counts are normally only assessed by gene symbol. If they were assessed by transcript, many reads would be double-counted (or even counted tens of times) as many genes have a multiplicity of transcripts. In comparison there are relatively few genomic loci where two distinct genes overlap.
  • FPKMs only count exonic alignments, counts (at least this pipeline) include introns. A gene’s total length (including introns) is only modestly correlated with its exonic length (rho = .19), so this makes a big difference.
  • Count-generating pipelines are generally not capable of transcript discovery. Instead you have to feed them a list of genomic loci with known genes (with FPKMs this is optional). It’s important to be careful that the merging of transcripts into one row per gene doesn’t create nonsensical results like we saw for SNORD60 above.

All of these differences seem to contribute to accounting for why the FPKMs and counts that I called here – on the exact same dataset – have so little correlation with each other (R^2 < .01 even after removing gene length outliers). In spite of this, the FPKMs and counts for any one gene may be somewhat more reproducible, though this analysis considered different tissues (which have enormous differences in gene expression) and not different individuals (which have subtle differences in gene expression).

Since counts and FPKMs seem to be measuring pretty different things, it’s up for debate which is the more valid measurement. I’ll put myself out there and argue a bit for FPKMs. mRNA-seq libraries are enriched for mRNAs, usually through polyA selection, thus hopefully eliminating most intronic coverage. Given that you’re using a laboratory method specifically to only get mRNAs, your pipeline should match that and only count exons. Clearly, FPKMs also represent a more sophisticated method involving assignment of reads to particular transcripts and normalization for exonic length and library size, all good things. I haven’t heard anyone deny this the argument I’ve heard for counts has been that they’re a different measurement that may have more variability and more power for certain things. But nothing I’ve seen here has convinced me that this extra variability reflects anything meaningful that you would want to analyze.

That said, my original motivation for this post – you always want to do the analysis both ways so you can answer any questions – still stands.

About Eric Vallabh Minikel

Eric Vallabh Minikel is on a lifelong quest to prevent prion disease. He is a scientist based at the Broad Institute of MIT and Harvard.


Interactions between RNA polymerase and the core recognition element are a determinant of transcription start site selection

During transcription initiation, RNA polymerase (RNAP) holoenzyme unwinds ∼13 bp of promoter DNA, forming an RNAP-promoter open complex (RPo) containing a single-stranded transcription bubble, and selects a template-strand nucleotide to serve as the transcription start site (TSS). In RPo, RNAP core enzyme makes sequence-specific protein-DNA interactions with the downstream part of the nontemplate strand of the transcription bubble ("core recognition element," CRE). Here, we investigated whether sequence-specific RNAP-CRE interactions affect TSS selection. To do this, we used two next-generation sequencing-based approaches to compare the TSS profile of WT RNAP to that of an RNAP derivative defective in sequence-specific RNAP-CRE interactions. First, using massively systematic transcript end readout, MASTER, we assessed effects of RNAP-CRE interactions on TSS selection in vitro and in vivo for a library of 4(7) (∼16,000) consensus promoters containing different TSS region sequences, and we observed that the TSS profile of the RNAP derivative defective in RNAP-CRE interactions differed from that of WT RNAP, in a manner that correlated with the presence of consensus CRE sequences in the TSS region. Second, using 5' merodiploid native-elongating-transcript sequencing, 5' mNET-seq, we assessed effects of RNAP-CRE interactions at natural promoters in Escherichia coli, and we identified 39 promoters at which RNAP-CRE interactions determine TSS selection. Our findings establish RNAP-CRE interactions are a functional determinant of TSS selection. We propose that RNAP-CRE interactions modulate the position of the downstream end of the transcription bubble in RPo, and thereby modulate TSS selection, which involves transcription bubble expansion or transcription bubble contraction (scrunching or antiscrunching).

Keywords: RNA polymerase promoter transcription bubble transcription initiation transcription start site selection.

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Analysis of effects of sequence-specific…

Analysis of effects of sequence-specific RNAP–CRE interactions by MASTER (11). ( A )…

Model for TSS selection and…

Model for TSS selection and hypothesis for effects of RNAP–CRE interactions on TSS…

Effects of disrupting RNAP–G CRE…

Effects of disrupting RNAP–G CRE interactions in vitro: analysis by MASTER. ( A…

Effects of disrupting RNAP–G CRE…

Effects of disrupting RNAP–G CRE interactions in vitro: analysis by primer extension. (…

Effects of disrupting RNAP–G CRE…

Effects of disrupting RNAP–G CRE interactions in vivo: 5′ mNET-seq analysis of 4…

Effects of disrupting RNAP-G CRE…

Effects of disrupting RNAP-G CRE interactions in vivo: 5′ mNET-seq analysis of E.…


2.5 – Transcription and Translation

A complimentary copy of the DNA is made in the nucleus to form the mRNA. This process is catalysed by the enzyme RNA polymerase. To copy the mRNA, the DNA double helix is unwound by DNA helicase, with the hydrogen bonds breaking between the base pairs to be copied. The DNA opens at the transcription site, or position of the gene that needs to be copied.

The coding strand, or the sense strand, is the template for the mRNA. However, the mRNA is actually built against the anti-sense strand. It has the same pattern as the opposite strand due to complimentary base pairing.

The free nucleotides pair with the DNA nucleotides. The only difference is that uracil replaces thymine, bonding to adenine. The RNA polymerase forms the phosphodiester bonds to make the backbone of the mRNA molecule. The mRNA then detaches and leaves the nucleus via the nuclear pores in the membrane. It enters the cytoplasm for reading at the ribosomes. The DNA double helix reforms.

2.5.3 – Describe the genetic code in terms of codons composed of triplets of bases

Each sequence of three bases codes for one amino acid, called a triplet code. These groups of three are called codons.

For every amino acid, it has two or three triplets which code for them. Other triplets act as the ‘start’ or ‘stopcodons, which define where to begin and end the polypeptide sequence.

There are also multiple triplets which code for these ‘punctuation’ codons.

2.5.4 – Explain the process of translation, leading to polypeptide formation

The amino acids are activated by combining with tRNA (transfer RNA) in the cytoplasm. tRNA molecules are in the shape of a clover leaf. Each molecule binds to a specific amino acid codon, the other end binding to the amino acid. The other end has an anticodon, which
is the complimentary codon for the mRNA. The tRNA binds to the amino acid, catalysed by an enzyme. This process uses ATP.

Once the mRNA molecule has been transcribed, it is sent to the ribosome in the cytoplasm or endoplasmic reticulum for translation. The protein is formed from the polypeptides, which are built up at the ribosomes. The ribosomes move along the mRNA the ‘read’ the code, beginning at the start codon.

From here, the tRNA molecules, with their amino acids, find their complimentary codon on the mRNA. The amino acids are bound in the ribosomes to form the polypeptide chains. The tRNA then separates from the amino acid and the mRNA, and is sent back to the cytoplasm to find more amino acids. This process continues until a stop codon is reached, at which point the polypeptide chain is released.

In order to provide enough free amino acids for translation, heterotrophs consume them in the protein of their diet.

The first codon on the mRNA molecule is AUG, the start codon, which bonds to the anti codon [UAC] on the tRNA molecule. This tRNA molecule carries the amino acids Methionine. Codon to anti-codon binding is anti-parallel.

The polypeptides formed with fold into their shape for the protein as a result of various intermolecular forces.

The process continues until the complete polypeptide is formed.

2.5.5 – Discuss the relationship between one gene and one polypeptide

The theory is that one gene forms one polypeptide. This is true in most cases, however there are a few exceptions:


Transcription and translation are two different steps of gene expression. We can identify the difference between transcription and translation based on several factors such as a template, raw material, location, product, enzymes involved, etc. Primarily, transcription is the process of producing a mRNA molecule from a DNA template of a gene. On the other hand, translation is the process of producing an amino acid sequence of a protein from a mRNA molecule. Therefore, this is the key difference between transcription and translation.

Furthermore, based on the raw material, the difference between transcription and translation is that transcription requires four types of ribonucleotides as its raw materials while translation requires 20 different amino acids as its raw materials. Similarly, transcription occurs in the nucleus while translation occurs in the ribosomes. Hence, this is the difference between transcription and translation in relation to the location of occurence. More differences between transcription and translation are shown in the below infographic.


Calculating most abundant transcript from RNA-Seq data

vcf2maf uses VEP to annotate variants, and I believe selects the default Ensembl transcript to use for annotation. Sometimes the transcript that VEP selects is not the transcript I'm interested in, usually because the selected transcript is not the most highly expressed transcript in my tissue of interest (skin). vcf2maf allows you to provide a transcript override list so that VEP annotates the variant using the specified transcripts instead.

I have several skin samples sequenced with bulk RNA-Seq. I want to estimate the average abundance for each transcript across all samples and then use these abundances to rank transcripts from most to least abundant. Then I will use the most abundant transcript as the default VEP transcript. I plan to use salmon or kallisto to quantify transcript abundance. Should I use TPM or normalized counts to calculate average expression?

My initial thought is to use normalized counts (generated by DESeq2 from raw counts). Are there any problems with this approach? GTEx displays transcript abundance with average TPM, but I thought TPM was inappropriate to use across samples because it doesn't account for between sample differences.

Update: I forgot to mention I also tried using TPM ranks like @ATpoint describes. I haven't fully compared how this compares to transcripts identified by normalized counts, but the initial genes I checked showed good concordance between methods


The improved database of Chimeric Transcripts and RNA-seq data, ChiTaRS-5.0

The ESTs and mRNAs from GenBank have been used to identify chimeric RNAs of two or more different genes. By analyzing hundreds of thousands of chimeric ESTs by RNA sequencing, we found that the expression level of chimeric ESTs is generally low and they are highly tissue specific in normal cells.

Here we present the improved version of the ChiTaRS database (ChiTaRS-5.0) with more then (66,243 + 41,584 + 3,052 + 19 + 67 + 20 + 292 + 305) = 111,582 chimeric transcripts in humans, mice, fruit flies, rats, zebrafishes, cows, pigs, and yeast. In the current version we extended the experimental data evidence as well as included a novel type of the sense-antisense chimeric transcripts of the same gene confirmed experimentally by RT-PCR, qPCR, RNA-sequencing and mass-spec peptides. In addition, we collected 23,167 human cancer breakpoints with the expression levels of chimeric RNAs confirmed by the paired-end RNA-sequencing experiments in different tissues in humans, mice and fruit flies.

This web site is optimized for use with the Google Chrome, Mozilla Firefox or Opera desktop web browsers. If you find an issue and you are not using Google Chrome or Mozilla Firefox or Opera, please try using Google Chrome or Mozilla Firefox or Opera to see if the issue appears to be browser specific.


Background

Grape (Vitis vinifera) is the most widely grown fruit crop globally. The area under grape cultivation is approximately 7.8 million hectares with a production of about 67.5 million tons. The berries are categorized mainly into table grapes (fresh) and wine grapes (wine), as well as for several value-added products [1]. China is the leading grape-producing country, accounting for 14% of the global grape production [2].

There are several developmental and metabolic processes that occur in the buds and twigs of grape plants during the winter period. These processes include enzyme synthesis, respiration, cell division, photosynthesis, growth stimulator production and growth inhibitor down-regulation. Dormancy is a controlling mechanism that enables woody perennials to adapt seasonal environmental changes and thus affects the following season’s vegetative growth and fruit production. Currently, global warming has a substantial influence on winter chilling accumulation and dormancy release of fruit trees [3]. To ensure sustainable fruit production, it is necessary to investigate the underlying genetic factors responsible for controlling dormancy [4]. Extended dormancy is a key hindrance for the large scale fruit production, including grape, in warm or mild winter regions under temperate and subtropical climates [5, 6]. Several studies have been conducted to determine the association between natural and chemical-induced ED, analyze gene expression during long and short photoperiods, and identify the transcript profile of bud development and signaling of bud dormancy break in grape [7–10]. Dormancy is generally classified into three main types: paradormancy (PD), endodormancy (ED), and ecodormancy (ECD) [11]. PD is the plant growth suspension initiated by factors outside the meristem. It is essentially the effect of one organ on another and involves the dominance of apical buds. ED is regulated by internal growth inhibitors, even under favorable conditions without exposure to cold temperature for a specific duration (chilling requirement), endodormant buds (EDBs) cannot initiate growth. Exposure to low temperature (2–9 °C) shifts the ED state of the plant to ECD. ECDBs can break and grow when exposed to suitable growth conditions [12]. When EDB’s chilling requirement are fulfilled, the ED is released. EDBs steadily transition to the ECD state, especially under adverse environmental conditions. Summer buds (SB), which are green in color and small in size and grow on one side of winter buds that have no scales, can be observed after dormancy release during the new growth period and remain active for a short time during the transition from dormancy release to early summer dormancy. Like other perennial deciduous fruit plants, grape undergoes a characteristic dormant period during its growth cycle. In southeast China, grape buds fulfill their chilling requirement in the end of February and blossom in following spring. Inadequate cold accumulation hours during this period lead to irregular flowering, which consequently decreases fruit production.

The investigations have been made on dormancy at physiological as well as molecular levels in different deciduous fruits. MADS-box (DAM) genes associated with dormancy-have been isolated to investigate their expression pattern in some fruit plants during dormancy [12, 13]. For example, DAM1 through DAM6 have been identified in peach and Japanese apricot [14, 15], while MADS13-1, MADS13-2, MADS13-3, PpMADS1 and PpMADS2 were found in Japanese pear and Chinese white pear (Suli) [16, 17]. The expression profile of these genes during the induction and release of endodormancy indicated that DAMs serve as dose-dependent inhibitors of bud break [15]. Additionally, several other genes are involved in the complex molecular network regulating dormancy in deciduous plants. Therefore, segregating single gene is not sufficient for illuminating underlying molecular processed associated with bud dormancy [13].

Recently, the next-generation sequencing (NGS) technology has uplifted the transcriptomic by allowing the RNA-sequencing using cDNA libraries on a large scale. RNA-seq is a highly efficient and modern tool that involves deep sequencing technologies to generate millions of short cDNA reads which is considerably more efficient than microarray analysis [18]. In previous studies, RNA-seq was successfully applied to investigate dormancy based on direct sequencing of cDNAs in several woody plants using 454-pyrosequencing technology [19]. Moreover, in another study the transcriptomic analysis revealed the dormancy-related regulatory pathways involving photoperiod, hormones and circadian clocks [20–22]. Although previous studies have investigated the physiological as well as the molecular mechanism of bud dormancy using the transcriptomic approach in deciduous fruits as well as other crops [13, 16, 23], no attempt has yet been made to study grape bud dormancy at the transcriptomic level.

This study was undertaken to investigate underlying molecular processes regulating bud dormancy in grape and to develop robust foundation for molecular research. RNA-seq technology was used to categorize and characterize the expression profile of differentially expressed genes (DEGs) during three different grape bud dormancy stages. This novel transcriptome and transcript expression profiling data generated through RNA-seq will offer an improved understanding of underlying molecular process of bud dormancy and will pave the way to identifying key genes involved in dormancy for the ultimate improvement of table grape industry.


Transcription vs Translation

The difference between transcription and translation is that transcription involves the creation of mRNA from DNA whereas translation does the protein synthesis by using the mRNA strands. In molecular biology, the decoding of DNA into mRNA is done by transcription and the development of proteins by RNA is done by translation is defined as the important and central dogma.

The first step in the gene expression is called Transcription where enzyme RNA polymerase copies the genes from the particular segment of the DNA into mRNA (messenger RNA).

DNA bases get bound to the appropriate nucleosides after DNS helix unwinds and then connect to the matching RNA segment of the DNA strand to make a complementary RNA i.e. mRNA.

The translation is the second step and happens after transcription where mRNA is converted further into the required proteins. In this, mRNA gets attached to ribosomes and further decoded to specific amino acids that form polypeptide by connecting each other, and then makes the protein.


Compare and contrast the structure and functions of DNA and RNA.

DNA and RNA are both essential components of cells, and thus life. The structure of DNA is similar to that of RNA in that these are both made of nucleotides, which in turn are made of the same basic units: a phosphate group, a pentose sugar, a nitrogenous base. The pentose sugars differ between DNA and RNA - DNA has deoxyribose (missing OH group on 2' carbon) while RNA has ribose (has OH group on 2' carbon). Both DNA and RNA nucleotides have one of four possible nitrogenous bases, two of which are pyrimidines and two which are purines. Both DNA and RNA can have cytosine, adenine and guanine as nitrogenous bases - however RNA has uracil as its final base while DNA has thymine. Structurally, single strands of DNA and RNA are similar as nucleotides are linked together by covalent, phosphodiester bonds created between the 5' phosphate group of one nucleotide and the 3' hydroxyl (OH) group of another. However, DNA is typically found in a double-helix format which is antiparallel while RNA is typically found as single-stranded.
RNA and DNA also differ in their functionality. DNA is the means by which all genetic information is stored within the cells, and propagation by semi-conservative replication can allow for the genesis of new cells (meiosis, mitosis) or organisms (fusion of haploid gametes). Thus, DNA is only found in the nucleus and mitochondria where it is super-condensed into chromatin and then organized into chromosomes. By contrast, RNA is created by the process of transcription and enables the process of translation - the expression of genetic information in the form of proteins. This happens in two ways: strands of mRNA code for differing sequences of amino-acids - which form proteins - via the codon code, and some RNA can form translation machinery (ribosomes, tRNA) via secondary double-stranded structures. In the latter case, RNA complementary base-pairing differs to DNA pairing as uracil - instead of thymine - now pairs with adenine.