Acute myeloid leukemia (AML) is an aggressive malignancy of haematopoietic stem cells driven by a well-defined set of somatic mutations.21 Identifying the mutations driving individual cases is important for assigning the patient to a recognized World Health Organisation category, establishing prognostic risk and tailoring post-consolidation therapy.3 As a result, AML research and diagnostic laboratories apply diverse methodologies to detect important mutations and many are introducing next-generation sequencing (NGS) approaches to study extended panels of genes in order to refine genomic classification and prognostic category.1 Besides the implications of these developments on costs, expertise and reliance on commercial providers, they also do not capture gene expression data, which have independent prognostic value that cannot be inferred from somatic mutation profiles. The ability to detect AML gene mutations as well as gene expression profiles from a single assay, could provide a holistic tool that accelerates research, simplifies diagnostic work-up and helps develop integrated algorithms to refine individual patient prognosis. Here, we show that AML RNA sequencing (RNA-seq) data can be used to reliably detect all types of clinically important mutations and develop a bespoke fast and easy-to-use software (RNAmut) for this purpose that can be readily used by teams/laboratories without in-house bioinformatic expertise.
We focused on detection of mutations in 33 genes that are relevant to AML classification and prognosis (Table 1) and designed the software pipeline to operate in three stages: read alignment, mutation detection and an additional oncogenicity filter (Figure 1A). To ensure fast alignment of RNA-seq reads, we indexed all possible 10-mer sequences from our target genes into a look-up table (hash function) that maps the 10-mers to their locations on the 33 genes (Online Supplementary Figure S1A). We used 10-mers (instead of 9-mers or 11-mers etc.) for optimal balance between speed and memory requirements (Online Supplementary Table S1). To align RNA-seq reads, the sequence of each read is divided into consecutive 10-mers and each 10-mer is mapped to genic locus/loci using the pre-built look-up table. By examining all 10-mers in a read, RNAmut computes whether the read is perfectly aligned (Type A), aligned with mismatches (Type M) or not aligned (Type N; Online Supplementary Figure S1B). M-type reads are used to detect substitutions and small indels (Figure 1B). To detect tandem duplications, RNAmut uses the subset of N-type reads for which the 5′ end is mapped downstream of their 3′ end, and computes the location of the duplicated region (Figure 1C). Gene fusions are detected through two independent pieces of evidence: first, reads spanning the breakpoint (i.e. chimeric reads) are extracted from the N-type reads and used to report the precise location of the breakpoint (Figure 1D). Secondly, fusion genes can also be identified from paired-end RNA-seq reads when each of the two paired reads aligned to a different fusion partner (Figure 1E). All mutations covered by ≥3 unique reads are reported and these are then optionally parsed through an oncogenicity filter applying the criteria used by the largest AML sequencing study published to date1 (Online Supplementary Table S2), which could be especially useful for diagnosticians. Full details of the RNAmut pipeline are given in the Online Supplementary Materials and Methods. To benchmark read mapping, we compared RNAmut’s alignment with commonly used read aligners.64 Our alignment showed very good agreement with panel-restricted alignments by BWA (Online Supplementary Figure S12A-B) and Salmon (Online Supplementary Figure S14), and global alignment by STAR (Online Supplementary Figure S13), for all of which both Pearson correlation and gradient were very close to 1.
To test the performance of our RNAmut, we analyzed 151 RNA-seq datasets from AML bone marrow samples generated by the Cancer Genome Atlas (TCGA)2 and detected 40 NPM1, 37 FLT3-ITD, 35 DNMT3A, 17 IDH2, 13 IDH1, 17 RUNX1, 17 CEBPA, 13 TP53, 13 TET2, 10 FLT3 TKD, 7 MLL-PTD, 11 WT1, 3 ASXL1, 1 BCOR, 12 SRSF2, 3 SF3B1 and 7 U2AF1 mutations, along with 15 PML-RARA, 10 MYH11-CBFB, 7 RUNX1-RUNX1T1, 3 BCR-ABL1, 3 NUP98-NSD1 fusions and 8 MLL (KMT2A) fusions with various partners (Online Supplementary Figure S4A and Online Supplementary Materials and Methods). Notably RNAmut accurately detects the lengths and positions of duplicated regions of FLT3-ITD (Online Supplementary Materials and Methods and Online Supplementary Figure S8A-B) while also reporting the number of mutated and WT reads and allelic frequencies (Online Supplementary Figure S8C). To assess the accuracy of our software, we compared our results with the mutations detected in these samples by Ley et al.2 Our software detected 289 of the 291 reported mutations (Figure 2). The two cases that we failed to detect were an IDH1 R132C in TCGA-AB-2984 and an MLL-PTD exon2-8 in TCGA-AB-2977. IDH1 R132C was missed due to the gene’s low expression in this sample: only five good quality reads covered R132 of which only one was mutated and thus does not meet the minimum of three mutant reads required by RNAmut (Online Supplementary Figure S18). To examine the missed MLL-PTD, we constructed the nucleotide sequence of the reported exon2-8 junction and found no RNA-seq reads reporting such a junction, indicating this may have been an annotation error. Moreover, our software identified 29 samples with mutations that were not reported by Ley et al. (Figure 2). For all these samples, we found evidence at the level of RNA and, where available, also DNA (eight samples with whole exome sequencing data) to show that they are indeed true positives (Online Supplementary Figure S19-20, and Online Supplementary Table S4). To further demonstrate the robustness of RNAmut, we tested its performance on the RNA-seq data from two other sources: i) a set of 164 myelodysplastic syndrome (MDS) patients87 and ii) 437 AML patients studied by the Leucegene consortium,129 both derived from bone marrow samples. For the MDS samples, RNAmut detected all panel-gene mutations identified through targeted DNA sequencing by the authors (Online Supplementary Figure S5), as well as 34 mutations that were not (Online Supplementary Figure S7). For Leucegene where mutation data are not available on a per sample basis, RNAmut detected similar landscapes of mutations overall (Online Supplementary Figure S4B and S6) including all 27 instances of an NPM1 mutation reported in one of the consortium’s publications.11
To validate our method, we first checked and confirmed that all exon sequences of the 33 panel genes are unique in the transcriptome, ruling out the possibility that RNA fragments from non-panel genes are mistakenly aligned to the panel (Online Supplementary Figures S9-11). To benchmark mutation calling, we compared RNAmut with commonly used mutation callers. Our variant allele frequency (VAF) calculation agreed very closely with Samtools13 for substitutions (Online Supplementary Figure S15A) and with Varscan14 for both substitutions (Online Supplementary Figure S15B) and indels (Online Supplementary Figure S15C). Furthermore, RNAmut detected all gene fusions identified by Fuseq15 and displayed better sensitivity for detection of MLL fusions (Online Supplementary Table S3). Finally, we also compared the VAF detected in RNA-seq with the ones detected in whole exome DNA sequencing data and observed a good correlation for most substitutions (Online Supplementary Figure S16). Nonsense mutations in DNMT3A (n=3) and TP53 (n=1) had lower RNA than DNA VAF, possibly due to nonsense-mediated decay (NMD). Nevertheless, a scan of all gain-of-stop codon mutations showed that transcripts potentially subjected to NMD were within detectable levels in AML RNA-seq datasets (Online Supplementary Figure S17).
Whilst somatic mutation detection from RNA-seq data is not a novel concept,16 existing software packages are designed for whole-transcriptome detection, which requires significantly larger memory and long computation time.17 Also, the lack of integrated pipelines demands intensive scripting and manual adjustment of parameters. Moreover, most existing packages are restricted to the UNIX system, which excludes the Windows user base and with it, most laboratories without in-house bioinformatic expertise. In this study, we present RNAmut, a fast, memory efficient and platform-independent software, which can run on personal computers including laptops and takes less than 30 minutes to detect all types of mutations affecting the selected 33 AML genes, from a typical RNA-seq dataset of 100 million paired-end reads (Online Supplementary Table S1). RNAmut can be easily extended to other malignancies by adding or removing genes from its gene index. In addition, it has the option to operate through a graphical user interface, making it accessible to users without any programming knowledge and is freely available in GitHub as a Java application. (https://github.com/muxingu/rnamut). Users of RNAmut should be aware of its limitations such as the fact that it is not designed to detect copy number variations or indels longer than 30 bp other than FLT3-ITD or MLL-PTD, and as it relies on transcribed RNA it cannot identify intronic or intergenic single-nucleotide variants (SNV). Furthermore, users of customized gene panels will need to ensure that their genes of interest are expressed sufficiently for RNAmut to detect any mutation, which is a general limitation of all RNA-seq based mutation callers.
The current molecular diagnosis of AML relies on multidisciplinary workflows including cytogenetic, molecular and NGS tests in order to detect different types of mutations. In this study, we demonstrate that all diagnostically important somatic mutations in AML can be reliably detected from RNA-seq within one single workflow. Our bespoke software, RNAmut, greatly reduces the difficulty and time required to analyse NGS data with results that match or even out-perform current methods. As our approach can be readily combined with information such as gene expression and splicing from the same RNA-seq dataset, it can be used to generate integrated algorithms that enhance prognostication and patient treatment. Furthermore, as RNA sequencing is a relatively straightforward procedure, our approach can readily be taken up by the AML research community and also by clinical laboratories, for whom it can significantly reduce experimental costs and accelerate AML genomic diagnosis and classification.
- Papaemmanuil E, Gerstung M, Bullinger L. Genomic classification and prognosis in acute myeloid leukemia. N Engl J Med. 2016; 374(23):2209-2221. PubMedhttps://doi.org/10.1056/NEJMoa1516192Google Scholar
- Ley TJ, Miller C. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N Engl J Med. 2013; 368(22):2059-2074. PubMedhttps://doi.org/10.1056/NEJMoa1301689Google Scholar
- Bullinger L, Dohner K, Dohner H. Genomics of acute myeloid leukemia diagnosis and pathways. J Clin Oncol. 2017; 35(9):934-946. https://doi.org/10.1200/JCO.2016.71.2208Google Scholar
- Dobin A, Davis CA, Schlesinger F. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013; 29(1):15-21. PubMedhttps://doi.org/10.1093/bioinformatics/bts635Google Scholar
- Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009; 25(14):1754-1760. PubMedhttps://doi.org/10.1093/bioinformatics/btp324Google Scholar
- Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 2017; 14(4):417-419. PubMedhttps://doi.org/10.1038/nmeth.4197Google Scholar
- Shiozawa Y, Malcovati L, Galli A. Gene expression and risk of leukemic transformation in myelodysplasia. Blood. 2017; 130(24):2642-2653. PubMedhttps://doi.org/10.1182/blood-2017-05-783050Google Scholar
- Shiozawa Y, Malcovati L, Galli A. Aberrant splicing and defective mRNA production induced by somatic spliceosome mutations in myelodysplasia. Nat Commun. 2018; 9(1):3649. Google Scholar
- Lavallee VP, Lemieux S, Boucher G. RNA-sequencing analysis of core binding factor AML identifies recurrent ZBTB7A mutations and defines RUNX1-CBFA2T3 fusion signature. Blood. 2016; 127(20):2498-2501. PubMedhttps://doi.org/10.1182/blood-2016-03-703868Google Scholar
- Macrae T, Sargeant T, Lemieux S, Hebert J, Deneault E, Sauvageau G. RNA-Seq reveals spliceosome and proteasome genes as most consistent transcripts in human cancer cells. PLoS One. 2013; 8(9):e72884. PubMedhttps://doi.org/10.1371/journal.pone.0072884Google Scholar
- Pabst C, Bergeron A, Lavallee VP. GPR56 identifies primary human acute myeloid leukemia cells with high repopulating potential in vivo. Blood. 2016; 127(16):2018-2027. PubMedhttps://doi.org/10.1182/blood-2015-11-683649Google Scholar
- Simon C, Chagraoui J, Krosl J. A key role for EZH2 and associated genes in mouse and human adult T-cell acute leukemia. Genes Dev. 2012; 26(7):651-656. PubMedhttps://doi.org/10.1101/gad.186411.111Google Scholar
- Li H, Handsaker B, Wysoker A. The sequence alignment/map format and SAMtools. Bioinformatics. 2009; 25(16):2078-2079. PubMedhttps://doi.org/10.1093/bioinformatics/btp352Google Scholar
- Koboldt DC, Zhang Q, Larson DE. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012; 22(3):568-576. PubMedhttps://doi.org/10.1101/gr.129684.111Google Scholar
- Vu TN, Deng W, Trac QT, Calza S, Hwang W, Pawitan Y. A fast detection of fusion genes from paired-end RNA-seq data. BMC Genomics. 2018; 19(1):786. Google Scholar
- Neums L, Suenaga S, Beyerlein P. VaDiR: an integrated approach to Variant Detection in RNA. Gigascience. 2018; 7(2)Google Scholar
- Coudray A, Battenhouse AM, Bucher P, Iyer VR. Detection and benchmarking of somatic mutations in cancer genomes using RNA-seq data. PeerJ. 2018; 6:e5362. https://doi.org/10.7717/peerj.5362Google Scholar