Creation of synthetic microRNA169 gene copies using machine learning

Updated: Feb 24, 2021


The implementation of machine learning algorithms to synthetic biology could, in principle, generate novel hypothesis about living systems because machine learning can help in the identification of design rules for construction of functional network architectures and synthetic gene circuits. Synthetic biology as discipline is concerned with the creation of gene networks to re-program living cells and ultimately endow them with new capabilities for an array of biotechnology applications. On the other hand, the proponents of 'embodied cognition' foresee the use of synthetic biology as means to endow biological systems with artificial-intelligence-like behavior, and with it the use of biological matter as the substitute of computer hardware. The ongoing dialogue between these two disciplines offers tremendous opportunities for artists interested in using biological matter and computer algorithms as means of creative expression. In this work I explored the creation of synthetic microRNA169 genes using the language modeling capabilities of recurrent neural networks (LSTMs), and in doing so I co-opted elements from artificial intelligence and synthetic biology with the sole purpose of artistic expression.


My objective in this study was to train an LSTM with all available microRNA169 gene sequences from plants in order to create synthetic copies with similar structural and biological properties to those found in nature. Overall, I gathered 419 MIR169 gene sequences from 40 different plant species (Table 1) and trained char-rnn (an LSTM implementation previously described, see refs) with the default architecture and the following parameters: sequence_length 300; batch_size 1; and number_of_epochs 50. The number of gene sequences used for training were obviously small for machine learning applications, and this is something I should address in following studies by combining/pooling related microRNA families into a single training dataset.

Table 1. A total of 419 MIR169 gene sequences from 40 plant species were included in this work, with 397 sequences obtained from miRBase database ( and 22 additional sequences that I discovered and described while in graduate school in 2013 [see refs].

From the LSTM's model output I took 42 synthetic sequences (10% of total biological sequences used for training) and constructed a Neighbor Joining phylogenetic tree to asses the relatedness for ALL sequences in terms of nucleotide similarity among synthetic and biological sequences. I focused on those synthetic sequences that were most related to biological MIR169 gene sequences in order to proceed and evaluate their capacity for hairpin-like secondary structure so characteristic of precursor microRNAs in plants (Fig 1a-c).

Figure 1a. Diagram displaying the results of evaluating the first synthetic MIR169 sequence from the LSTM's model output. A section from the main phylogenetic tree is shown on the top-left describing the sequence relatedness of synthetic MIR169 gene to their biological counterparts in plants. Clustal multiple sequence alignment using MUSCLE was conducted to reliably identify mature miR169 sequences. When miR169 was present on the synthetic gene, its sequence was fed into RNAfold web server to predict its secondary RNA structure. The hairpin-like synthetic sequence was then assigned the name of scm_MIR169a in reference to School of Creative Media (SCM) where I am currently working. The mature scm_miR169a was then fed into a small RNA target prediction program using Arabidopsis thaliana (unigene DFCI Gene Index version 15, release on 2010) library for target identification (bottom-right of diagram).

Figure 1b. Diagram displaying the results of evaluating the first synthetic MIR169 sequence from the LSTM's model output. In this case synthetic scm_MIR169b had a different miR169 allele from the canonical found in nature and thus primarily targeted different genes in Arabidopsis.

Figure 1c. Diagram displaying the results of evaluating the first synthetic MIR169 sequence from the LSTM's model output. Two synthetic sequences are shown scm_MIR169c and scm_MIR169d respectively.


The use of synthetic MIR169 genes created with machine learning algorithms could be used in the creation of transgenic plants for those cases in which the mature miR169 sequence differs from the canonical ones found in nature, with possible new phenotypes as outcome. Although not done at the present moment, synthetic miR169 gene targets can also be created using machine learning algorithms in order to complement the work shown here. I was positively surprised to find that training with such few sequences could lead to any solid hairpin secondary structure reminiscent of biological microRNA hairpins. Here I only presented the results of 4 out of the 42 total synthetic sequences outputted by the LSTM model, thus more work needs to be done from my part to conclude this study. The next step would be the creation of transgenic plants harboring one of these synthetic MIR169 genes created with machine learning. Such a genetically modified organism would be done with the sole purpose of scientific research, most importantly, it will be done as the creation of an art piece signifying the biological embodiment of a computer algorithm's creation as the expressive desire of the artist.


I would like to thank Dr. Antoni Chan and team members from the Computer Science Department, City University of Hong Kong (CityU) for providing valuable advice and the necessary infrastructure to run machine learning applications. I also would like to thank Dr. Tomas Laurenzo and the School of Creative Media (CityU) for their support into the exploration of artistic applications of machine learning and artificial intelligence.


Link to sources consulted_