Historically evaluated with Northern blotting and RT-qPCR, RNA sequencing (RNA-Seq) launched to the forefront of next generation sequencing (NGS) in the early 2000s and rapidly emerged as a standard tool for comprehensive analysis of an organism’s transcriptome. Over the years, technology evolved to expand the breadth and scope of RNA-Seq, which now includes detection of various transcript types, sequencing of optimal and deteriorated samples alike, and detection of the sample down to the single-cell level.
A transcriptome encompasses the full range of coding RNA and non-coding RNA transcripts expressed by an organism, also referred to as total RNA (Table 1). In contrast with the genome, the transcriptome actively changes due to many factors, including an organism’s developmental stage, environmental conditions, tissue type or even cell type. Messenger RNA (mRNA) is primarily of interest; however, it only contributes to about 5% of the total RNA present in eukaryotic cells. Roughly 80% is rRNA, 10-15% is tRNA, with the remaining <1% made up of a mix of lncRNA, miRNA and other small RNAs.
| Coding | Non-Coding | ||||||
| Type | Messenger RNA (mRNA) | Ribosomal RNA (rRNA) | Transfer RNA (tRNA) | Long non-coding RNA (lncRNA) | MicroRNA (miRNA) | Short interfering RNA (siRNA) | Piwi-interacting RNA (piRNA) |
| Length | >200 nt | 120-5000 nt | 75-95 nt | >200 nt | 21-25 nt | 20-25 nt | 26-31 nt |
| Standard Modifications | Eukaryotes: 5' cap, 3' poly(A) tail | 2' OMe of sugar | 5' phosphate, 3' hydroxyl | N6-methyl-adenosine (m6A) | 5' phosphate, 3' hydroxyl | 5' phosphate, 3' hydroxyl | 5' phosphate, 3' end 2' -o-methylation |
| Organisms | All | All | All | Eukaryotes | Animals, Plants | Eukaryotes, Viruses | Animals |
Table 1. Types of RNA in the transcriptome
RNA sequencing (RNA-Seq) can be used to simultaneously measure expression in thousands of genes under one condition or to compare it across multiple conditions, also known as differential gene expression (DGE). In today’s research, RNA-Seq is an indispensable tool and the most frequently used technology for transcriptome analysis, enabling high-throughput profiling of coding and non-coding RNA at single-nucleotide resolution.
The process begins with harvesting total RNA from cells and purifying the RNA molecules of interest. Single-stranded RNA is then converted into double-stranded complementary DNA (cDNA) strands in a reverse transcription reaction. Sequencing adapters and barcodes are then added to create RNA-Seq libraries that are subsequently analyzed with Next Generation Sequencing (NGS). The data, called ‘reads’, is then mapped to the genome if the sequence is available. The number of reads aligning to a region represents the transcriptional activity of that gene (Figure 1).
RNA-Seq can provide a comprehensive or targeted characterization of the transcriptome. By studying transcriptomes, researchers can determine when and where genes are expressed. In general, two types of information can be obtained from RNA-Seq: qualitative and quantitative. Qualitative information can include genome annotation, transcript orientation, transcriptional start sites, intron/exon boundaries, polyadenylation sites, alternative splicing or isoforms, gene fusions and variant discovery. Quantitative information can include absolute or relative gene expression levels, isoform expression levels, and differential gene expression, or the comparison levels of expression across two or more conditions.
Before beginning an RNA-Seq experiment, it is important to carefully consider each step of the RNA-Seq workflow (Figure 2), including experimental design, extraction, library preparation, sequencing, and data analysis.
The design stage of an experiment is arguably the most critical step in ensuring the success of an RNA-Seq experiment. Researchers must make key decisions at the start of any NGS project, including the type of assay and the number of samples to analyze. The optimal approach will depend largely on the objectives of the experiment, hypotheses to be tested, and expected information to be gathered.
The first step in characterizing the transcriptome involves isolating and purifying cellular RNA. RNA can be extracted from a variety of input sample types including fresh/fixed/frozen cells, blood, or tissue. The quality and quantity of the input material have a significant impact on data quality; therefore, care must be taken when isolating and preparing RNA for sequencing. Given the chemical instability of RNA, there are two major reasons for RNA degradation during experiments:
Following extraction, RNA should be measured using a Qubit® or NanoDrop™ to indicate the quantity of the RNA and/or DNA present in the sample. Additionally, the RNA should be evaluated with an Agilent® Bioanalyzer™ or TapeStation™ to analyze fragment length, quality, and integrity.
In the short term, RNA may be stored in RNase-free water or TE buffer at -80°C for 1 year without degradation. For the long term, RNA samples may be stored as ethanol precipitates at -20°C. Avoid repeated freeze-thaw cycles of samples, which can lead to degradation. RNA of high integrity will maximize the likelihood of obtaining reliable and informative results.
Library preparation involves generating a collection of RNA fragments that are compatible for sequencing. The process involves enrichment of target (non-ribosomal) RNA, fragmentation, reverse transcription (i.e. cDNA synthesis), and addition of sequencing adapters and amplification. The enrichment method determines which types of transcripts (e.g. mRNA, lncRNA, miRNA) will be included in the library.
Figure 3. Strand-specific RNA library preparation. (A) Incorporation of dUTP during second-strand synthesis and subsequent uracil-specific digestion selects for the first-strand cDNA. (B) In SMART cDNA synthesis, unique adapters can be specifically attached to the 5’ and 3’ ends during cDNA synthesis, preserving orientation.
Regardless of which method is selected, the final RNA library will be evaluated for quality control by Agilent® Bioanalyzer™ or TapeStation™ to analyze the library fragment length, quality and integrity; and qPCR to determine the overall library size.
Parameters for sequencing—such as read length, configuration, and output—depend on the goals of the project and will influence the choice of instrument and sequencing chemistry (Table 2). The main NGS technologies can be grouped into two categories: short-read sequencing and long-read sequencing. Both have distinct benefits for RNA-Seq.
| Platform | Type | Read Configuration | Number of Reads | Data Output |
| Illumina® NovaSeq™ X Plus | Short-read |
2×150 bp (PE150) > 85% of bases higher than Q30 |
3.25 billion single reads (6.5 billion paired end reads) per 25B lane. 65 transcriptomes (50M) per lane | 1 Tb per 25B lane |
| Illumina® MiSeq™ i100 Plus | Short-read |
2×150 bp (PE150) > 90% of bases higher than Q30 |
25 million single reads (50 million paired end reads) per 25M flow cell. 1 transcriptome (50M) per flow cell | 7.5 Gb per 25M flow cell |
| PacBio® Revio™ | Long-read | Up to 500 kb (average read lengths up to 30kb) | Over 60 million per flow cell | 120 Gb per SMRT® cell |
| Oxford Nanopore Technologies® PromethION™ | Long-read | Up to 4 Mb (average read lengths up to 100kb) | 20-30 million per flow cell | Over 150 Gb per flow cell |
Table 2. Sequencing Configurations. Other read configurations are available with different outputs. The 2x150 bp configuration is one of the most common for short-read sequencing. It is also known as ‘paired-end 150 bp sequencing’, or ‘PE150’.
Short-read sequencing is relatively inexpensive on a per-base basis and can generate billions of reads in a massively parallel manner, with single-end read lengths ranging between 50 and 300 bp. The high-throughput nature of this technology is ideal for quantifying the relative abundance of transcripts or identifying rare transcripts. Several platforms available on the market offer flexible outputs using roughly similar chemistry.
Multiple samples are typically multiplexed, or run together, on a flow cell to make experiments more cost-effective. The choice of platform (and flow cell) together with the number of samples multiplexed determines the sequencing depth. Each cDNA fragment can be sequenced from only one end, called single-end (SE) sequencing, or both ends, called paired-end (PE) sequencing. While SE sequencing is generally less expensive and faster, PE sequencing helps detect genomic rearrangements and repetitive sequence alignments better than the single-end configuration, since more information is collected from each fragment.
Long-read sequencing can resolve inaccessible regions of the genome and read through the entire length of RNA transcripts, allowing precise determination of specific isoforms. Two of the leading long-read sequencing platform providers include PacBio® and Oxford Nanopore Technologies®. The PacBio platform uses Single-Molecule Real-Time (SMRT®) sequencing to generate exceptionally long reads reaching up to 10 kb in length. This method, called ‘isoform sequencing’ or ‘Iso-Seq’, produces full-length transcripts1. The Oxford Nanopore platform uses a unique sequencing technology for direct, real-time analysis of exceedingly long transcripts, generating reads up to >20 kb in length2. In non-model species with no genomic references available, long reads can provide valuable information to detect full-length transcripts accurately. However, if cost reduction is paramount and/or high data output is required, short-read sequencing is a better choice.
Evaluating data quality and extracting biologically relevant information is the final and most rewarding step in an RNA-Seq experiment. It is important to find the best analysis pipeline as one pipeline does not fit all approaches. In general, raw sequencing data is pre-processed or ‘trimmed’ to remove adapter sequences and low-quality reads. If a reference genome is available, reads are typically aligned (or mapped) to the reference to discover the genomic origin of the RNA molecule. In samples lacking a reference genome, de novo transcriptome assembly can be performed using overlapping reads. Once reads are mapped, more sophisticated analyses such as differential gene expression (DGE) and isoform variant discovery provide researchers with detailed views of dynamic gene expression profiles.
RNA sequencing (RNA-Seq) is a highly effective method for studying the transcriptome qualitatively and quantitatively. It can identify the full catalog of transcripts, precisely define gene structures, and accurately measure gene expression levels. In today’s research, RNA-Seq is an indispensable tool and the most frequently used technology for transcriptome analysis, enabling researchers to achieve novel biological insights.
LEARN MORE ABOUT RNA SEQUENCING →
References
1. Wang, B., et al., Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing. Nat Commun, 2016. 7: p. 11708.
2. Kono, N. and K. Arakawa, Nanopore sequencing: Review of potential applications in functional genomics. Dev Growth Differ, 2019. 61(5): p. 316-326.