Search

A guide to RNA-Seq workflows and methods

0 Comments

Historically evaluated with Northern blotting and RT-qPCR, RNA sequencing (RNA-Seq) launched to the forefront of next generation sequencing (NGS) in the early 2000s and rapidly emerged as a standard tool for comprehensive analysis of an organism’s transcriptome. Over the years, technology evolved to expand the breadth and scope of RNA-Seq, which now includes detection of various transcript types, sequencing of optimal and deteriorated samples alike, and detection of the sample down to the single-cell level.

 

The Basics of RNA Sequencing

A transcriptome encompasses the full range of coding RNA and non-coding RNA transcripts expressed by an organism, also referred to as total RNA (Table 1). In contrast with the genome, the transcriptome actively changes due to many factors, including an organism’s developmental stage, environmental conditions, tissue type or even cell type. Messenger RNA (mRNA) is primarily of interest; however, it only contributes to about 5% of the total RNA present in eukaryotic cells. Roughly 80% is rRNA, 10-15% is tRNA, with the remaining <1% made up of a mix of lncRNA, miRNA and other small RNAs.

  Coding Non-Coding
Type Messenger RNA (mRNA) Ribosomal RNA (rRNA) Transfer RNA (tRNA) Long non-coding RNA (lncRNA) MicroRNA (miRNA) Short interfering RNA (siRNA) Piwi-interacting RNA (piRNA)
Length >200 nt 120-5000 nt 75-95 nt >200 nt 21-25 nt 20-25 nt 26-31 nt
Standard Modifications Eukaryotes: 5' cap, 3' poly(A) tail 2' OMe of sugar 5' phosphate, 3' hydroxyl N6-methyl-adenosine (m6A) 5' phosphate, 3' hydroxyl  5' phosphate, 3' hydroxyl 5' phosphate, 3' end 2' -o-methylation
Organisms All All All Eukaryotes Animals, Plants Eukaryotes, Viruses Animals

 

Table 1. Types of RNA in the transcriptome

RNA sequencing (RNA-Seq) can be used to simultaneously measure expression in thousands of genes under one condition or to compare it across multiple conditions, also known as differential gene expression (DGE).  In today’s research, RNA-Seq is an indispensable tool and the most frequently used technology for transcriptome analysis, enabling high-throughput profiling of coding and non-coding RNA at single-nucleotide resolution.  

The process begins with harvesting total RNA from cells and purifying the RNA molecules of interest. Single-stranded RNA is then converted into double-stranded complementary DNA (cDNA) strands in a reverse transcription reaction. Sequencing adapters and barcodes are then added to create RNA-Seq libraries that are subsequently analyzed with Next Generation Sequencing (NGS). The data, called ‘reads’, is then mapped to the genome if the sequence is available. The number of reads aligning to a region represents the transcriptional activity of that gene (Figure 1). 

blog_13002-WE 1225 RNA-Seq eBook_F1.1Figure 1. Different parts of a pre-mRNA showing intron (orange), exon (blue), and splice junctions processed into mRNA. Removal of the introns by splicing at the junctions gives rise to coding mRNA. Short reads from RNA-Seq experiments can be assembled to determine coding regions in mRNA and aligned with a reference genome to map read counts.

RNA-Seq can provide a comprehensive or targeted characterization of the transcriptome. By studying transcriptomes, researchers can determine when and where genes are expressed. In general, two types of information can be obtained from RNA-Seq: qualitative and quantitative. Qualitative information can include genome annotation, transcript orientation, transcriptional start sites, intron/exon boundaries, polyadenylation sites, alternative splicing or isoforms, gene fusions and variant discovery. Quantitative information can include absolute or relative gene expression levels, isoform expression levels, and differential gene expression, or the comparison levels of expression across two or more conditions.

RNA-Seq Workflow

Before beginning an RNA-Seq experiment, it is important to carefully consider each step of the RNA-Seq workflow (Figure 2), including experimental design, extraction, library preparation, sequencing, and data analysis.

blog_13002-WE 1225 RNA-Seq eBook_F1.2Figure 2. RNA-Seq experimental workflow.

STEP 1 Experimental Design 

The design stage of an experiment is arguably the most critical step in ensuring the success of an RNA-Seq experiment. Researchers must make key decisions at the start of any NGS project, including the type of assay and the number of samples to analyze. The optimal approach will depend largely on the objectives of the experiment, hypotheses to be tested, and expected information to be gathered. 

STEP 2 Extraction 

The first step in characterizing the transcriptome involves isolating and purifying cellular RNA. RNA can be extracted from a variety of input sample types including fresh/fixed/frozen cells, blood, or tissue. The quality and quantity of the input material have a significant impact on data quality; therefore, care must be taken when isolating and preparing RNA for sequencing. Given the chemical instability of RNA, there are two major reasons for RNA degradation during experiments: 

  • RNA contains ribose sugar and is not stable in alkaline conditions because of the reactive hydroxyl bonds. RNA is also more prone to heat degradation than DNA.  
  • Ribonucleases (RNases) are ubiquitous and very stable, so avoiding them is nearly impossible. It is essential to maintain an RNase-free environment by wearing sterile disposable gloves when handling reagents and RNA samples, employing RNase inhibitors, and using DEPC-treated water instead of PCR-grade water. Additionally, proper storage of RNA is crucial to avoid RNA degradation. 

Following extraction, RNA should be measured using a Qubit® or NanoDrop™ to indicate the quantity of the RNA and/or DNA present in the sample. Additionally, the RNA should be evaluated with an Agilent® Bioanalyzer™ or TapeStation™ to analyze fragment length, quality, and integrity. 

In the short term, RNA may be stored in RNase-free water or TE buffer at -80°C for 1 year without degradation. For the long term, RNA samples may be stored as ethanol precipitates at -20°C. Avoid repeated freeze-thaw cycles of samples, which can lead to degradation. RNA of high integrity will maximize the likelihood of obtaining reliable and informative results.

STEP 3 Library Preparation 

Library preparation involves generating a collection of RNA fragments that are compatible for sequencing. The process involves enrichment of target (non-ribosomal) RNA, fragmentation, reverse transcription (i.e. cDNA synthesis), and addition of sequencing adapters and amplification. The enrichment method determines which types of transcripts (e.g. mRNA, lncRNA, miRNA) will be included in the library. 

  • mRNA-Seq: In eukaryotes, mRNA transcripts contain polyadenylated tails, which are used to enrich mRNA molecules through poly(A) selection. In this process, total RNA is isolated and subjected to either hybridization with oligo(dT)-conjugated beads/columns or reverse transcription with oligo(dT) primers. Polyadenylated RNA molecules make up just 1-5% of total RNA in many species, meaning sample concentrations after poly(A) selection are typically reduced by a factor of 20-100x. Purified mRNA is converted into cDNA libraries and amplified via PCR to enrich library concentration. Poly(A) selection is the most common and the most cost-effective option for eukaryotic mRNA library preparation. 
  • Strand-Specific RNA-Seq: Transcript polarity is important for correct annotation of genes. Since there are many genomic regions that generate transcripts from both strands, identifying the polarity of a given transcript provides essential information about the possible function of a gene. However, the polarity of transcripts can be lost during cDNA synthesis and subsequent amplification. Strand-specific RNA-Seq, also known as ‘stranded’ or ‘directional’ RNA-Seq, preserves this information during library preparation, allowing researchers to determine the orientation of the gene on the DNA template. It can be used in conjunction with mRNA-Seq and total RNA-Seq. There are at least two methods for creating stranded RNA-Seq libraries (Figure 3).

blog_13002-WE 1225 RNA-Seq eBook_F1.3

Figure 3. Strand-specific RNA library preparation. (A) Incorporation of dUTP during second-strand synthesis and subsequent uracil-specific digestion selects for the first-strand cDNA. (B) In SMART cDNA synthesis, unique adapters can be specifically attached to the 5’ and 3’ ends during cDNA synthesis, preserving orientation.

  • Total RNA-Seq: Total RNA-Seq is a method used for comprehensive analyses of protein-coding and long non-coding RNAs (lncRNAs). The latter have important regulatory functions in the genome and are of interest to molecular biologists due to their capacity for epigenetic regulation of transcriptional activity. Since many lack a poly(A) tail, lncRNA molecules are often excluded from poly(A) selection. In total RNA-Seq, oligos complementary to single-stranded rRNAs are used to capture and deplete these molecules prior to sequencing. 
  • Small RNA-Seq: Short RNA transcripts, such as microRNAs (miRNA) and small interfering RNAs (siRNA), also play important gene regulatory functions in the cell. In small RNA-Seq, RNA species are selected by size fractionation from total RNA. Library preparation typically includes ligation of sequencing adapters to 5’ phosphate ends, which are found in small RNAs but absent in degraded fragments of larger RNA molecules, such as mRNA. The RNA fragments are then converted to cDNA libraries prior to sequencing. Since small RNA molecules can be as short as 21 nucleotides in length, sequencing configurations with fewer cycles (e.g. 1x50 bp) can be used. 
  • Single-Cell RNA-Seq: Distinct from traditional “bulk” RNA-Seq methods, single-cell RNA-Seq (scRNA-Seq) allows researchers to capture the transcriptome of individual cells and uncover heterogeneous patterns of gene expression in complex cellular populations. Microfluidics or microwells are typically used to isolate single cells before library preparation. These methods preserve cellular information by adding unique barcodes to each transcript during isolation, which can be bioinformatically traced back to the cell of origin. 
  • Ultra-Low Input RNA-Seq: Since faithful characterization of the transcriptome depends largely on the quality and quantity of the input RNA, standard RNA-Seq approaches call for an ample amount (>500 ng) of intact RNA. Samples producing lower yields and degraded RNA typically require additional amplification steps, as well as higher depths of sequencing to boost data output. These samples are prone to transcriptional bias and poor read mapping to exons. Ultralow input methods have been developed to selectively amplify full-length transcripts with minimal bias, allowing researchers to perform RNA-Seq on samples containing as few as 10 pg of RNA or just a single cell. 
  • Kinnex Full-Length RNA Sequencing: Alternative splicing results in multiple isoforms being encoded by a single gene, which can be effectively analyzed by Kinnex™ full-length RNA sequencing, previously known as isoform sequencing (Iso-Seq). Developed by PacBio®, Kinnex full-length RNA-Seq uses long-read technology to sequence transcripts contiguously from end-to-end, eliminating the need for reconstruction. The result is unambiguous information about a transcript’s start, polyadenylation, and splice sites from a single read. Kinnex full-length RNA-Seq characterizes the full complement of isoforms across the transcriptome, with potential applications including better annotated genomes, detection of gene fusions, and discovery of novel isoforms.  
  • Direct RNA Sequencing: Oxford Nanopore Technologies® direct RNA sequencing offers a groundbreaking approach to transcriptomics by enabling the sequencing of native RNA molecules without reverse transcription or PCR amplification. Direct RNA nanopore sequencing is the only available technology that directly reads native RNA transcripts, enabling quantitation of gene and isoform expression without PCR bias along with direct RNA methylation detection. This technology uses protein nanopores to read RNA directly in real time, producing long reads that resolve isoforms, alternative splicing events, and poly(A) tail lengths with unparalleled clarity. Simultaneously quantify gene expression and detect base modifications such as N6methyladenosine (m6A), providing insights into RNA biology and disease mechanisms without additional chemical treatments or enrichment steps.  
  • High-Throughput Gene Expression Screening: Unlike more traditional RNA-Seq methods, high-throughput gene expression screening (HT-GEx) utilizes 3’ end counting, offering gene level versus transcript level results. HT-GEx is processed directly from cell lysates, removing the RNA extraction step, offering a far more rapid processing workflow. Because only the 3’ end of the transcript is captured, less sequencing is required to generate final results. Ideal for screening, this method does not capture the more nuanced transcript level data, required for most traditional analyses.  

Regardless of which method is selected, the final RNA library will be evaluated for quality control by Agilent® Bioanalyzer™ or TapeStation™ to analyze the library fragment length, quality and integrity; and qPCR to determine the overall library size.

STEP 4 Sequencing 

Parameters for sequencing—such as read length, configuration, and output—depend on the goals of the project and will influence the choice of instrument and sequencing chemistry (Table 2). The main NGS technologies can be grouped into two categories: short-read sequencing and long-read sequencing. Both have distinct benefits for RNA-Seq. 

Platform Type Read Configuration Number of Reads Data Output
Illumina® NovaSeq™ X Plus Short-read

2×150 bp (PE150)

> 85% of bases higher than Q30

3.25 billion single reads (6.5 billion paired end reads) per 25B lane. 65 transcriptomes (50M) per lane 1 Tb per 25B lane
Illumina® MiSeq™ i100 Plus  Short-read

2×150 bp (PE150)

> 90% of bases higher than Q30

25 million single reads (50 million paired end reads) per 25M flow cell. 1 transcriptome (50M) per flow cell 7.5 Gb per 25M flow cell
PacBio® Revio™ Long-read Up to 500 kb (average read lengths up to 30kb) Over 60 million per flow cell 120 Gb per SMRT® cell
Oxford Nanopore Technologies® PromethION™ Long-read Up to 4 Mb (average read lengths up to 100kb) 20-30 million per flow cell Over 150 Gb per flow cell

Table 2. Sequencing Configurations. Other read configurations are available with different outputs. The 2x150 bp configuration is one of the most common for short-read sequencing. It is also known as ‘paired-end 150 bp sequencing’, or ‘PE150’.

Short-read sequencing is relatively inexpensive on a per-base basis and can generate billions of reads in a massively parallel manner, with single-end read lengths ranging between 50 and 300 bp. The high-throughput nature of this technology is ideal for quantifying the relative abundance of transcripts or identifying rare transcripts. Several platforms available on the market offer flexible outputs using roughly similar chemistry.  

Multiple samples are typically multiplexed, or run together, on a flow cell to make experiments more cost-effective. The choice of platform (and flow cell) together with the number of samples multiplexed determines the sequencing depth. Each cDNA fragment can be sequenced from only one end, called single-end (SE) sequencing, or both ends, called paired-end (PE) sequencing. While SE sequencing is generally less expensive and faster, PE sequencing helps detect genomic rearrangements and repetitive sequence alignments better than the single-end configuration, since more information is collected from each fragment. 

Long-read sequencing can resolve inaccessible regions of the genome and read through the entire length of RNA transcripts, allowing precise determination of specific isoforms. Two of the leading long-read sequencing platform providers include PacBio® and Oxford Nanopore Technologies®. The PacBio platform uses Single-Molecule Real-Time (SMRT®) sequencing to generate exceptionally long reads reaching up to 10 kb in length. This method, called ‘isoform sequencing’ or ‘Iso-Seq’, produces full-length transcripts1. The Oxford Nanopore platform uses a unique sequencing technology for direct, real-time analysis of exceedingly long transcripts, generating reads up to >20 kb in length2. In non-model species with no genomic references available, long reads can provide valuable information to detect full-length transcripts accurately. However, if cost reduction is paramount and/or high data output is required, short-read sequencing is a better choice. 

STEP 5 Data Analysis 

Evaluating data quality and extracting biologically relevant information is the final and most rewarding step in an RNA-Seq experiment. It is important to find the best analysis pipeline as one pipeline does not fit all approaches.  In general, raw sequencing data is pre-processed or ‘trimmed’ to remove adapter sequences and low-quality reads. If a reference genome is available, reads are typically aligned (or mapped) to the reference to discover the genomic origin of the RNA molecule. In samples lacking a reference genome, de novo transcriptome assembly can be performed using overlapping reads. Once reads are mapped, more sophisticated analyses such as differential gene expression (DGE) and isoform variant discovery provide researchers with detailed views of dynamic gene expression profiles.

Conclusion

RNA sequencing (RNA-Seq) is a highly effective method for studying the transcriptome qualitatively and quantitatively. It can identify the full catalog of transcripts, precisely define gene structures, and accurately measure gene expression levels. In today’s research, RNA-Seq is an indispensable tool and the most frequently used technology for transcriptome analysis, enabling researchers to achieve novel biological insights. 

LEARN MORE ABOUT RNA SEQUENCING → 

 


References 

1. Wang, B., et al., Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing. Nat Commun, 2016. 7: p. 11708. 

2. Kono, N. and K. Arakawa, Nanopore sequencing: Review of potential applications in functional genomics. Dev Growth Differ, 2019. 61(5): p. 316-326.