Introduction to Most Dominant Transcripts

Introduction

This guide provides a step-by-step workflow for identifying the Most Dominant Transcripts (MDTs) within a bulk RNA-sequencing dataset. The MDTToolset package is designed to work with outputs generated by various RNA-Seq quantification tools, including, but not limited to, Salmon and Kallisto. For the accurate detection of MDTs, users must utilize normalized data values (e.g., TPM values) for bulk RNA-seq data.

Load MDTToolset

library(MDTToolset)

Prepare Data

In this section, we utilize the TPM outputs from the Kallisto tool, originating from an (RNA-Seq experiment) [https://www.nature.com/articles/s41597-020-0541-4]. The data is openly accessible and available for download. The MDTToolset includes two conditions as example cases for users. If you already have the quantification data prepared for each condition, you may proceed to the next step. Ensure that the first two column names are ‘ENST’ and ‘ENSG’, respectively, followed by the sample names as shown below.

## read TPM data 
cond1 <- MDTToolset::cond1 # molecular retina
cond2 <- MDTToolset::cond2 # peripheral retina 

# visalize the dataframes. Note that you should have knitr R package installed.
knitr::kable(head(cond1,2), caption = 'TPM Matrix')

TPM Matrix
ENST	tpm_kallisto_macular_retina.1	tpm_kallisto_macular_retina.10	tpm_kallisto_macular_retina.11	tpm_kallisto_macular_retina.12	tpm_kallisto_macular_retina.2	tpm_kallisto_macular_retina.3	tpm_kallisto_macular_retina.4	tpm_kallisto_macular_retina.5	tpm_kallisto_macular_retina.6	tpm_kallisto_macular_retina.7	tpm_kallisto_macular_retina.8	tpm_kallisto_macular_retina.9
ENST00000000233	189.3070	202.8500	203.376	232.7330	191.1680	184.5650	195.5500	228.8620	202.2350	248.3690	196.5360	235.9530
ENST00000000412	43.7888	52.5654	52.852	65.5688	59.9673	53.0147	64.1605	48.3739	60.3188	62.3249	46.5275	63.5542

knitr::kable(head(cond2,2), caption = 'TPM Matrix')

TPM Matrix
ENST	tpm_kallisto_peripheral_retina.1	tpm_kallisto_peripheral_retina.10	tpm_kallisto_peripheral_retina.11	tpm_kallisto_peripheral_retina.12	tpm_kallisto_peripheral_retina.2	tpm_kallisto_peripheral_retina.3	tpm_kallisto_peripheral_retina.4	tpm_kallisto_peripheral_retina.5	tpm_kallisto_peripheral_retina.6	tpm_kallisto_peripheral_retina.7	tpm_kallisto_peripheral_retina.8	tpm_kallisto_peripheral_retina.9
ENST00000000233	121.7770	136.121	154.4650	169.1770	148.1950	174.5520	147.8110	147.3530	132.3580	138.844	135.4470	165.8010
ENST00000000412	41.0036	54.297	55.0917	62.8938	59.8682	49.5176	61.5132	48.0374	63.8108	60.491	52.6708	56.7923

Prepare Files to Remove the Redundant Transcripts

This step is optional but highly recommended.

A challenge in transcript quantification is the presence of redundant transcripts, which are transcripts with identical sequences that yield the same TPM values. This redundancy can obscure the identification of Most Dominant Transcripts (MDTs) by preventing the detection of a singular, dominant transcript for a given gene. To address this issue, our initial step involves eliminating redundant transcripts from the dataset. This process includes assigning TPM values to a single representative transcript per redundant group, based on the comparison of their protein sequences.

Priority is given to transcripts classified as MANE (Matched Annotation from NCBI and EMBL-EBI) in the Ensembl database, designating them as the primary transcript. In cases where no MANE transcript is identified, a transcript is randomly selected to represent the group. This step is particularly crucial when working with tools like Kallisto, which may assign identical TPM values to redundant transcripts. Following this, the calculation of MDTs proceeds using the reassigned TPM values.

To be able to detect redundat transcripts, we need sequences of transcripts. Fasta sequence of transcripts (cdna sequences) can be downloaded from the designated databases (e.g Ensembl)[https://www.ensembl.org/info/data/ftp/index.html]. We provide Ensembl version 104 as an example. In addition, we provide the Ensembl MANE Select Data (v104) to help remove redundant transcripts found in Kallisto output files. To download your version of the MANE select, navigate to the Ensembl website for the version you used, proceed to the BioMart website, select the fields Gene Stable ID, Transcript Stable ID, and RefSeq Match Transcript (MANE Select), and download the TSV file. Please note that The user should employ the same version that was used to quantify their transcript abundances to ensure consistency. the function prepare_Seq() takes fasta and MANE file as input and gives a dataframe showing ENSG (Gene), ENST (Transcript), and ENSP (Protein), and MANE ID, and ENST Sequence.

# read MANE Select
ensg_enst_ensp_mane_v104 <- MDTToolset::ensg_enst_ensp_mane_v104
ensg_enst_ensp_mane_v104 <- ensg_enst_ensp_mane_v104 %>% dplyr::distinct() # While downloading data from Biomart, the values might not be unique - make sure the rows are unique
# prepare sequences and Mane select transcripts dataframe
ensg_enst_seq <- prepare_Seq(MDTToolset::Homo_sapiens.GRCh38.cdna.all, ensg_enst_ensp_mane_v104)

## Select transcripts having MANE ids
enst_mane_iso <- ensg_enst_ensp_mane_v104[!(is.na(ensg_enst_ensp_mane_v104$RefSeq.match.transcript..MANE.Select.) | ensg_enst_ensp_mane_v104$RefSeq.match.transcript..MANE.Select. == ''),c(1,2,3,4)]

We merge the transcript TPM expression matrix file with the ensg_enst_seq file that we have prepared in the previous step so that the second column includes the ENSG ids.

cond1_ENST_ENSG <- merge(ensg_enst_seq[,1:2], cond1, by='ENST')
cond2_ENST_ENSG <- merge(ensg_enst_seq[,1:2], cond2, by='ENST')

# visalize the dataframes. Note that you should have knitr R package installed.
knitr::kable(head(cond1_ENST_ENSG,2), caption = 'TPM Matrix')

TPM Matrix
ENST	ENSG	tpm_kallisto_macular_retina.1	tpm_kallisto_macular_retina.10	tpm_kallisto_macular_retina.11	tpm_kallisto_macular_retina.12	tpm_kallisto_macular_retina.2	tpm_kallisto_macular_retina.3	tpm_kallisto_macular_retina.4	tpm_kallisto_macular_retina.5	tpm_kallisto_macular_retina.6	tpm_kallisto_macular_retina.7	tpm_kallisto_macular_retina.8	tpm_kallisto_macular_retina.9
ENST00000000233	ENSG00000004059	189.3070	202.8500	203.376	232.7330	191.1680	184.5650	195.5500	228.8620	202.2350	248.3690	196.5360	235.9530
ENST00000000412	ENSG00000003056	43.7888	52.5654	52.852	65.5688	59.9673	53.0147	64.1605	48.3739	60.3188	62.3249	46.5275	63.5542

knitr::kable(head(cond2_ENST_ENSG,2), caption = 'TPM Matrix')

TPM Matrix
ENST	ENSG	tpm_kallisto_peripheral_retina.1	tpm_kallisto_peripheral_retina.10	tpm_kallisto_peripheral_retina.11	tpm_kallisto_peripheral_retina.12	tpm_kallisto_peripheral_retina.2	tpm_kallisto_peripheral_retina.3	tpm_kallisto_peripheral_retina.4	tpm_kallisto_peripheral_retina.5	tpm_kallisto_peripheral_retina.6	tpm_kallisto_peripheral_retina.7	tpm_kallisto_peripheral_retina.8	tpm_kallisto_peripheral_retina.9
ENST00000000233	ENSG00000004059	121.7770	136.121	154.4650	169.1770	148.1950	174.5520	147.8110	147.3530	132.3580	138.844	135.4470	165.8010
ENST00000000412	ENSG00000003056	41.0036	54.297	55.0917	62.8938	59.8682	49.5176	61.5132	48.0374	63.8108	60.491	52.6708	56.7923

As transcripts with the same sequences might have different peptide lengths, we aim to select the ones with the longest sequence when there is a redundancy. This information cannot be accessed directly via Ensembl Biomart, this we need to download the transcripts’ peptide sequences and then calculate the length of the peptides. We provide Ensembl version 104 as an example. In case you want to prepare a different version, download ENSG, ENST, peptide sequences from Biomart and use the following function from MDTToolset:

df_biomart_seq_length <- MDTToolset::MDTToolset::prepare_seq_length(‘mart_export.txt’) The file ‘mart_export.txt’ is in the fasta file format: >ENSG0000000XXXX|ENST0000000KKK MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVDSADNLSEKLEREWDRE LASKKNPKLINALRRCFFWRFMFYGIFLYLGEVTKAVQPLLLGRIIASYDPDNKEERSIA IYLG* >ENSG0000000YYYY|ENST0000000ZZZ MTAEEMKATESGAQSAPLPMEGVDISPKQDEGVLKVIKREGTGTEMPMIGDRVFVHYTGW LLDGTKFDSSLD

# Example dataframe for df_biomart_seq_length 
df_biomart_seq_length <- MDTToolset::df_biomart_seq_selection

knitr::kable(head(df_biomart_seq_length,2), caption = "Transcripts' Peptide Seq Length")

Transcripts’ Peptide Seq Length
	ENSG	ENST	Length
ENSG00000001626\|ENST00000003084	ENSG00000001626	ENST00000003084	1481
ENSG00000004478\|ENST00000001008	ENSG00000004478	ENST00000001008	460

Remove Redundant Transcripts

remove_redundant() function takes three inputs, transcript TPM file (having ENST and ENSG columns), ENST sequence file, and Mane select file prepared in the previous step. If there is more than one transcript of a gene with the same sequences, one of them is removed, its TPM value is added to the selected transcript. Additionally, if there is a transcript assigned as MANE, this transcript is prioritized over the other ones. If there is no MANE selects for the redundant transcripts, the transcripts with the longest peptide sequence is chosen. The output will be the same as previous TPM file without redundant ones.

## Example Usage
#cond1_ENST_ENSG_nonredundant <- remove_redundants(data_matrix = cond1_ENST_ENSG, enst_ensp_sequences = ensg_enst_seq, mane_select =  NULL, peptide_lengths = NULL)
cond1_ENST_ENSG_nonredundant <- remove_redundants(cond1_ENST_ENSG, ensg_enst_seq, enst_mane_iso, df_biomart_seq_length)
cond2_ENST_ENSG_nonredundant <- remove_redundants(cond2_ENST_ENSG, ensg_enst_seq, enst_mane_iso, df_biomart_seq_length)

Find MDTs

The next step involves identifying the Most Dominant Transcripts (MDTs) in the dataset. The find_mdts() function calculates MDTs based on two criteria: a user-defined cutoff and min_exp. The cutoff parameter specifies the enrichment rate between the expressions of the first and second-ranked transcripts. min_exp determines the minimum expression level that transcripts must meet to be considered as MDTs.

cond1_mdts <- find_mdts(cond1_ENST_ENSG_nonredundant, cutoff = 2, min_exp = 0.2)
cond2_mdts <- find_mdts(cond2_ENST_ENSG_nonredundant, cutoff = 2, min_exp = 0.2)

Visualize count of MDTs

We provide a basic boxplot plot using ggplot to plot the number of MDTs in each condition and also bar graph to visualize number of MDTs in each sample. ### visualize the number of MDTs in each condition

# Distribution of number of MDTs in each sample
MDT_boxplot(cond1_mdts, cond2_mdts, title = "Distribution of Number of MDTs", x_axis_label = "Condition", y_axis_label = "Number of MDT")

visualize the number of MDTs in each sample

# Distribution of number of MDTs in each sample

Detection of Most Dominant Transcripts

Tülay Karakulak

2024-08-11