Positive selection and heat‐response transcriptomes reveal adaptive features of the Brassicaceae desert model, Anastatica hierochuntica

Summary Plant adaptation to a desert environment and its endemic heat stress is poorly understood at the molecular level. The naturally heat‐tolerant Brassicaceae species Anastatica hierochuntica is an ideal extremophyte model to identify genetic adaptations that have evolved to allow plants to tolerate heat stress and thrive in deserts. We generated an A. hierochuntica reference transcriptome and identified extremophyte adaptations by comparing Arabidopsis thaliana and A. hierochuntica transcriptome responses to heat, and detecting positively selected genes in A. hierochuntica. The two species exhibit similar transcriptome adjustment in response to heat and the A. hierochuntica transcriptome does not exist in a constitutive heat ‘stress‐ready’ state. Furthermore, the A. hierochuntica global transcriptome as well as heat‐responsive orthologs, display a lower basal and higher heat‐induced expression than in A. thaliana. Genes positively selected in multiple extremophytes are associated with stomatal opening, nutrient acquisition, and UV‐B induced DNA repair while those unique to A. hierochuntica are consistent with its photoperiod‐insensitive, early‐flowering phenotype. We suggest that evolution of a flexible transcriptome confers the ability to quickly react to extreme diurnal temperature fluctuations characteristic of a desert environment while positive selection of genes involved in stress tolerance and early flowering could facilitate an opportunistic desert lifestyle.


Article acceptance date: 18 July 2022
The following Supporting Information is available for this article:       Methods S1 Additional information regarding methods used in this study.

Fig. S1
Transcriptome sequencing and hybrid assembly workflow. RNA was extracted from plate-grown A. hierochuntica seedlings under control conditions (Illumina sequencing), or from soil-grown plants under control or various abiotic stress conditions, and imbibed seeds (454 sequencing). To assemble both short-and long-read sequences together, a hybrid approach was taken. The Illumnia reads were first assembled into long contig sequences, using the Trinity assembler, and then shredded into 700 bp long, overlapping (at least 200 bp overlap) pseudoreads that were reassembled together with the 454 reads using the Newbler assembler.

Fig. S3 Clustering dendrogram of module eigenvalues for A. thaliana and A. hierochuntica
transcriptome profiles under heat stress conditions. Dendrograms were generated using the WGCNA package with module size minimum set at 50 genes. Dendrograms are presented after merging of modules with a cut off of 30% dissimilarity (70% similarity). These modules were assigned standard color-based names by WGCNA. Identical module names between A. thaliana and A. hierochuntica do not indicate similarity in function, expression profile or shared genes.

Validation of 'between species' RNA-seq analyses
The above comparisons of gene expression between the two species utilized DeSeq2 (Love et al., 2014) as a normalized measure of gene expression and to identify differentially expressed genes. DeSeq2 normalizes read counts for different sequencing depths between samples. However, when dealing with two different species, several other factors can affect direct comparison of expression levels between orthologs including whether a few highly expressed genes constitute a large proportion of the sequenced transcripts, as well as differences in gene numbers and orthologous transcript length (Zhou et al., 2019;Zhao et al., 2020). Therefore, we performed several further analyses to validate our results. Figure S4a shows that there was no significant difference between the species in the proportion of the top 10 most highly expressed genes out of the total transcripts sequenced across all treatments (A. thaliana, ~7% to 12%, and A. hierochuntica, ~10% to 15% of the total sequenced transcripts).
We further re-normalized our raw read count data (normalized for transcript length) using a new between-species method that applies Scale-Based Normalization (SCBN) to the most conserved orthologs, thereby obtaining a scaling factor that minimizes the false discovery rate of differentially expressed genes (Zhou et al., 2019). Applying SCBN to the 109 most conserved orthologs between A. thaliana and A. hierochuntica (Dataset S12) and using the scaling factor to correct normalized gene counts, we obtained similar comparative basal expression results as observed with DeSeq2 (Fig. S4b). Finally, QPCR analysis confirmed the RNA-seq fold-change gene expression patterns of four selected A. thaliana and A. hierochuntica genes (Fig. S4c).
These genes included AKS2, a gene found to be positively selected in the 'all extremophyte species' run (Table 1), two genes involved in abiotic stress responses (ELIP1 and HSP17.6II; Sun Heat shock (45 °C), harvested after 0.5, 1, and 2 h. Roots, shoots and flowers (where available) from these soil-grown plants, were harvested separately and snap-frozen in liquid nitrogen. In addition to these soil-grown samples, mature seeds, from the same F4 generation seed stock, were imbibed in H2O for 8.5 h, and then snap-frozen in liquid nitrogen.
For RNA-seq heat stress experiments, A. thaliana and A. hierochuntica were germinated and grown on nutrient agar plates according to Eshel et al. (2017). Seedlings were grown on plates until cotyledons were fully expanded before transfer to 7 cm x 7 cm x 8 cm pots The reference transcriptome was assembled using a hybrid assembly approach that utilized both Illumina and 454 reads. Briefly, the Illumina reads were first assembled using the Trinity software package (Grabherr et al., 2011). The resulting Trinity contigs were then shred into 700 bp fragments ('pseudo-reads'), with a 200 bp overlap between adjacent fragments, using the fasta2frag.tcl script, from the MIRA4 package (Chevreux et al., 1999;Chevreux et al., 2004). Both the shredded contigs and the 454 reads were assembled together using the Newbler assembler v.2.6 (Margulies et al., 2005) to generate the A. hierochuntica reference transcriptome comprising 36,871 assembled high-confidence transcripts ( Figure S1).  (Conesa et al., 2005). For statistical differential expression tests, count data of uniquely mapped reads were calculated per transcript from the Bowtie output using a custom python script, and were further analyzed using the R package DESeq2 (Love et al., 2014). In order to test for differences between species, a subset of 17,962 Bi-directional Best BLASTN hits (orthologs) between the two species, was used. DESeq2 normalizes the count data of each library by a size factor to account for different sequencing depths between the libraries. However, when comparing the expression levels between different species, it is essential to also normalize the read counts by the transcript length. Therefore, instead of inputting DESeq2 with the read counts per transcript, we first divided these values by the transcript length (in kilobases) and rounded the values to input DESeq2 with discrete numbers that fit a Poisson distribution.

A. thaliana and
One inherent problem in analyzing expression data (or any other type of multidimensional data) is that the variance increases with the mean. Therefore, the most highly expressed genes will dominate the differential expression analysis. One common solution is to use logarithmic transformation but this approach is prone to dominance of the lowest expressed genes, which will show the strongest differences between samples (see DESeq2 manual, http://www.bioconductor.org/help/workflows/rnaseqGene/#timecourseexperiments). Therefore, DESeq2 uses the regularized-logarithm transformation (rlog) to stabilize the variance in count data across the mean, thereby improving the dispersion estimation for both high and low expressed genes, and reducing the false discovery rate in calling differential expression (Love et al., 2014). (GO:0009408), 'resp. to temperature stimulus' (GO:0009266), 'resp. to reactive oxygen species' (GO:0000302), 'resp. to hydrogen peroxide' (GO:0042542), 'resp. to high light intensity' (GO:0009644), 'resp. to oxidative stress' (GO:0006979), 'resp. to radiation' (GO:0009314), 'resp.
For functional clustering of GO-terms in Fig. 5C, the GOMCL algorithm (Wang et al., 2020) was used to reduce redundancy of the Gene Ontology. GO terms that share the majority (>50%) of genes among them were clustered with an inflation value of 3. Enriched GO terms with more than 1,000 genes in the A. thaliana genome were excluded.

Scale-based normalization (SCBN)
To validate the differential gene expression analysis performed with DeSeq2, we re-normalized our RNA-seq read count data (normalized to transcript length) using the Scale-Based BLASTn (ver, 2.10.1) software revealed 109 highly conserved orthologs with an E-value ≤ 1e-100 , query coverage of ≥ 98%, and identical matches of ≥ 99%. The SCBN R package (http://www.bioconductor.org/packages/devel/bioc/html/SCBN.html) was then used on the 109 highly conserved genes to obtain a scaling factor of 0.9223461, which it then applied to the 17,962 orthologs to call 12,808 common differentially expressed genes (p ≤ 1e-05 ) between the two species. To generate an approximate corrected gene count, the scaling factor was applied to each individual gene and the corrected average and median basal (control) expression is depicted in Fig. 9B.

Real-time QPCR analysis
Total RNA was extracted from whole shoots with TRIzol (38% Phenol (w/v), 0.8 M guanidine thiocyanate, 0.4 M ammonium thiocyanate, 0.1 M sodium acetate pH 5, 5% glycerol (v/v)) according to Rio et al. (2010). To remove residual genomic DNA, 7 g of total RNA was treated with RNase-free PerfeCta DNase I (Quanta Biosciences, Inc., Gaithersburg, MD, USA) according to the manufacturer's instructions, and cDNA was synthesized from 1 μg of total RNA using the qScript TM cDNA Synthesis Kit (Quanta Biosciences). For amplification of PCR products, primers were designed using the NCBI Primer-BLAST tool (Ye et al., 2012), and analyzed for any secondary structure with the IDT OligoAnalyzer™ Tool (Dataset S13 To detect positive selection in the five extremophyte species, ortholog groups with sequence representation in at least 2 extremophytes, were selected to ensure sufficient statistical power (Anisimova et al., 2001). For each ortholog group, the peptide MSA was converted into the corresponding codon alignment using the pal2nal.pl program (Suyama et al., 2006), and the ML species tree was pruned using PHAST tree_doctor (Hubisz et al., 2011), to keep only sequence-represented taxa. Codon alignments together with pruned trees were further analyzed with the PAML v4.8, CODEML program (Yang, 1997;Yang, 2007), using the Branch-Site model. To test for positive selection, the tested branch(s) were labeled (foreground), and the log likelihood of two models (M1a and M2a), were calculated for each ortholog group. The difference between the two models is that in the M1a (null) model, the non-synonymous to synonymous rate ratio (dN/dS) is fixed to 1 (fix_omega = 1 and omega = 1), indicative of neutral selection, while in the M2a (alternative) model, the initial dN/dS ratio is set to 1, and is further estimated by the model (fix_omega = 0 and omega = 1). A Likelihood Ratio Test was performed (with Χ 2 distribution), to identify genes with log likelihood values significantly different between the two models, indicative of deviation from neutral selection.
Ortholog groups with portion of sites in the foreground branches, that had an estimated dN/dS ratio greater than 1, were considered under positive selection. To account for multiplicity, a Benjamini-Yekutieli false discovery rate (FDR) correction (Benjamini & Yekutieli, 2001) was applied using the 'qvalue' R package, with a q-value < 0.05 cutoff for a gene to be considered as positively selected. Sites under positive selection were identified using the empirical Bayes approach with a posterior probability p > 0.95.
The above procedure was repeated to identify positive selected genes in A.
hierochuntica, and in the other extremophyte species. For each analysis, different branches on the tree were tested (labeled as foreground) compared with all other branches (background): (i) labeling the external branches of all five extremophyte species as the foreground (4,723 ortholog groups); (ii) labeling the A. hierochuntica external branch as the foreground (3,093 ortholog groups); (iii) labeling the E. salsugineum external branch as the foreground (4,457 ortholog groups); (iv) labeling the S. parvula external branch as the foreground (4,369 ortholog groups); and (v) labeling the A. thaliana external branch as the foreground (5,513 ortholog groups). A. thaliana was considered as a control/comparator species sensitive to abiotic stresses (Kazachkova et al., 2018). The Venn diagram comparing positive selected genes ( Fig.   6B) was generated using an online tool: http://bioinformatics.psb.ugent.be/webtools/Venn/).
To further assess the functionality of the positively selected genes, Gene Ontology (GO) terms were assigned to each ortholog group based on A. thaliana GO annotation. In cases where an ortholog group did not contain an A. thaliana ortholog, the closest A. thaliana homolog (best BLASTP hit) was used. Significant positively selected genes were further tested for enriched GO terms (Fisher's exact test, with a q-value < 0.05 cutoff) using the online AgriGO tool (Du et al., 2010; http://bioinfo.cau.edu.cn/agriGO/analysis.php), where the A. thaliana genome served as the background. Enriched GO terms with more than 2,000 genes in the A.
thaliana genome were excluded, as these are broad and less informative terms.