Auditory perception of CG and CCG sequence context variation at the VERNALIZATION 1 gene across orth

Here I present as work of art the rendering of auditory and visual data associated with CG and CCG distribution along genic contexts of vernalization genes that are contained within orthologous regions from four different grass species: Brachypodium distachyon, Brachypodium stacei, Oryza sativa and Zea mays, respectively. My work focused on the following three questions:

1 Does CG and CCG distribution pattern at vernalization genes differ between grasses that exhibit a vernalization requirement in order to flower such as Brachypodium distachyon and Brachypodium stacei compared to those that are vernalization independent such as Oryza sativa (rice) and Zea mays (corn)?

2 Can I sonificate CG and CCG occurrences in order to add an auditory dimension to variation in genome data? If so, can this be considered a step towards 'perceptualization' of genotypic variation?

3 Can I create a cultural artifact as an extended form of 'perceptualization' of genomic information by integrating data with visuals and sound into a single work of art? This would convey the communication of knowledge in artistic form.

This essay explores novel approaches to genome visualization, the production of functional sounds, and the creation of responsive art.


Sonification is defined as the use of non-speech audio to convey information or perceptualize data. For the purpose of this study, non-speech audio was considered as functional sound because it was used to address a certain function, that is, perceptualization of CG and CCG distribution along the genic context of vernalization genes from four different plant species.

Here I used a distinctive type of sonification technique known as parameter-based sonification, which involved the mapping of data values (occurrence of CG and CCG di- and tri-nucleotides) to acoustic attributes of a sound.

I created functional sound programmatically by using Processing sound library's oscillator objects, assigning a particular sound frequency to CG and CCG occurrences depending on their context [upstream (-2 kbp), 5' UTR, gene body, 3' UTR, downstream (+2 kbp) regions].

The algorithmic rules used to generate functional sound are explained below. The approach used into my compositional process followed a translational model in which DNA sequence information from an existing non-audible medium (a text file containing strings of A, G, C and T characters) was translated into sound. In this manner, sonification was true to data as opposed to the sole use of gene sequences as creative inspiration to audio generation.

Taking this into account, it is important to mention that I do not consider the functional sound created in this work as algorithmic music; I let musicians and composers instead to make that distinction. Many efforts have previously taken place in combining music to DNA and protein, including the pioneering work of scientist Ross King in collaboration with musician and artist Colin Angus in 1996, when they assigned musical notes to DNA and protein sequences and played them together into as single musical piece. My work instead explores the possibilities of genome sonification as means to perceptualize genetic variation rather than creating algorithmic music from it.

The occurrence and distribution of CGs and CCGs in vernalization genes was placed within an evolutionary context by comparing orthologous segments among four species of the grass family. In plants, cytosine is methylated when its occurrence is associated within the following three contexts: CG, CHG and CHH (with H = A, T, or C). The frequency of methylated cytosines (mCs) differs for these three contexts in part because of different biological mechanisms. Variation in mCs also varies across different regions in the genome, with repetitive regions harboring mCs in all three contexts, with function of silencing activity of transposable elements. On the other hand, mCs within exons of protein coding genes associate mainly with CG context in many plant species, with gene body methylation (gbM) found to be associated with higher expression levels in most cases.

Because of the importance of gbM in gene expression, my work aimed to provide an understanding of the distribution of genic CGs and CCGs from four different plants of the grass family by conducting an evolutionary approach for CG and CCG occurrence at orthologous genes involved in the vernalization pathway. Because of the reported relationship of mCs between CG and CCG sequence contexts described for Arabidopsis thaliana, I focused on CCG as a particular case for CHG occurrence and distribution. Although I do not focus on methylated cytosines, the ratio of CG and CCG relative to total content of C + G was taken into account as mean to facilitate posterior studies on comparative epigenomics for those research groups actively working on the topic.

The mechanism underlaying the vernalization response in plants is an interesting process not only from a biological point of view, but also from an artistic perspective, as vernalization can be considered 'molecular memory' as plants acquire the ability 'to remember' winter. Strictly speaking, vernalization is the process by which exposure to the extended cold of winter results in the capacity to flower during the next growing season. Here, I focus on the gene VERNALIZATION 1 (VRN1) for the comparative analysis of CG and CCG distribution. Although there is extensive evidence demonstrating cold-mediated chromatin changes involving histone modifications at VRN1 in wheat and barley, there hasn't been much focus on the role of cold-induced cytosine methylation at VRN1 and its effect on the vernalization response in plants. For this reason, comparing the distribution of CGs and CCGs at VRN1 among grass species that display vernalization response (B. distachyon and B. stacei) relative to those that do not require exposure to prolonged cold in order to flower (rice and maize) present a unique opportunity for integrating data sonification and visualization into the creation of works of art. The use of genomic data as raw material for the creation of art constitutes a new form of artistic expression that I've previously defined with the name of Arte GAGAISTA or GAGAISMO, (from the words Genomic And Geometric AbstracionISM); and this work represents an important step towards the construction of a coherent body of work and its theoretical framework supporting it.

Results & Discussion

Visualizing the distribution pattern of CG and CCG sequence contexts at VRN1 gene

Here I intended to explore a novel approach to visually represent VRN1 gene solely based on its C+G content and from it the frequency distribution of CG and CCG sequence contexts. Rather than a rectangular and linear representation of the gene, so characteristically displayed at genome browsers, I opted to use concentric circles instead, as means to display data associated for each genic segment: upstream region, 5'UTR, gene body, 3'UTR and downstream region, respectively. A computer algorithm was created to quantify the occurrences of C and G nucleotides for each segment, as well as the occurrences of CG and CCG sequence contexts. The approach used to visualize this data relied on size and color, and is explained on Figure 1. Five sets of concentric circles represent each genic segment of the VRN1 genes from four grass species and is shown on Figure 2.

Figure 1. Sketch depicting the approach to quantify and visualize CG and CCG frequency distribution relative to total C+G content across elements of the VRN1 gene. C+G content relative to sequence length for each segment (upstream, 5'UTR, gene body, 3'UTR and downstream) as well as CG and CCG frequency of occurrences are proportionally represented with the intensity of color and the length of diameter for each circle.

My results indicated that 5'UTRs had the highest C+G content relative to sequence length when compared to the other genic segments. Similarly, the frequency of occurrences of CG and CCG context was also higher at 5'UTR relative to sequence length in comparison for the other segments. The genotypic variation at C+G content as well as the frequency distribution of CGs and CCGs among orthologous gene copies is evidenced in the intensity of color and the size of circles, in particular for the 5'UTR region of Brast02g311100 and the 5'UTR and 3'UTR regions of maize GRMZM2G032339 gene copies.

Brachypodium stacei and maize have undergone whole genome duplication events and the genotypic variation in C+G, CG and CCG content is evident for the homeologous copies. Interestingly, CCG frequency at VRN1 orthologous gene copies appeared to be higher at 5'UTRs in grasses that do require a vernalization response (5.4% for B. distachyon and 6.6% and 7.7% for B. stacei genes) compared to those that do not have a vernalization requirement in order to flower (3.1% for rice and 3.9% and 4.8% for maize genes).

Looking at VRN1 orthologous gene copies from a point of view solely based on C+G content provided me with a new perspective on the evolutionary aspect of genotypic variation at 5'UTR region of genes. It would be interesting to compare my results with those studies on comparative epigenomics by quantifying the level of methylated Cs at 5' UTR of VRN1 orthologs.

Figure 2. Visualization of C+G content, CG and CCG frequency distribution across genic segments from VRN1 orthologous gene copies. The size and color intensity of concentric circles is proportional to C+G content relative to total sequence length for the segment, and CG/CCG ratio relative to C+G content, respectively.

Sonificating the distribution pattern of CG and CCG sequence contexts at 5'UTR and 3'UTR regions of VRN1 orthologs

I was interested in the 'perceptualization' of genome data by rendering functional sounds associated with the occurrences of CGs and CCGs within the 5'UTR and 3'UTR regions of VRN1 orthologous gene copies. I approached the topic by experimenting with different oscillators objects and envelopes for the synthesis of sound, also with different musical notes and amplitudes. I found that for the purpose of this study, a triangle waveform provided me with the most interesting auditory experience as to perceptually distinguish CGs and CCGs occurrences from the background sound. I created then audio compositions in which CGs and CCGs occurrences across all VRN1 genes were audible (Figure 3).

Figure 3. Approach used to sonificate CG and CCG distribution pattern within 5'UTRs and 3'UTRs of VRN1 orthologous genes. In order to distinguish CG and CCGs occurrences from vernalization responsive versus non-vernalized grasses, I assigned a different musical note to each gene copy, with lower frequency sounds representing B. distachyon and B. stacei whereas higher frequency sounds representing rice and maize. Nucleotides that do not compose CCG/CG sequence contexts (A, C, G, T) were assigned the musical note DO for all gene copies regardless of the species. Thus, the sound associated with DO serves as background sound in which the other notes represent the occurrences of CCGs/CGs, with the highest amplitude/volume assigned to CCGs/CGs from the rice VRN1 gene copy.

The following audio compositions have a resemblance of music played with organ. The sound is true to data and all audible variation is based exclusively on CG and CCG occurrence differences between species.

Addressing genome sequence variation through sound provided me with a different perception on the architectural composition of a gene.

Creation of responsive artwork as cultural artifact that integrates genome data visualization and sonification

In order to convey the communication of knowledge in artistic form I worked towards the creation of an artwork that could integrate data in visual and auditory forms. For this purpose, I adapted the code previously developed by Casey Reas, Chandler McWilliams and LUST (2010) by plugging figure 2 from my work as raw material. I integrated this code with new one that took advantage of Processing's sound library that allowed me to control visual aspects of the painting based on the amplitude/volume level of the auditory composition I previously created to sonificate CCGs occurrences at the 3'UTR region of VRN1 gene. I was interested in using image extrusion (creating depth from 2D images) in order to make projections along the z-axis reactive to sound (amplitude in this particular case) as means to create dimensional surfaces. The result is presented as series of artworks:

The colors in the paintings are dependent of the amplitude of the sound.

From this work I conclude that an artistic approach to knowledge generation in comparative genomics is plausible because of the novel approaches that new media art and creative coding can put to the service of science. Interestingly, genome data can at the same time serve as raw material for the creation of art. I have approached this work as artist with a scientific mind and thus, my product is the creation of art and its communication to society. The creation of functional sound with the objective to communicate genotypic variation was inspired in part by reading about the Italian and futurist painter Luigi Russolo, who in 1913 developed an art based on mechanically produced noise to break musical sounds codified by tradition. In this manner, Russolo introduced the concept of 'noise network'. About the same time and also in Italy, the futurist artist Balilla Pratella advocated for 'musical poetry' based on the glorification of the machine. The audio compositions that I presented in this work could then be considered musical poetry based on the glorification of the gene.


Genomic sequences were obtained from Phytozome ( and the occurrence of CGs and CCGs was determined for each segment: upstream (-2000 bp), 5'UTR, gene body (eons + introns), 3'UTR, and downstream (+ 2000 bp) respectively. Frequency distribution of CGs and CCGs was analyzed in relation to to C + G content for each segment and the following ratio was taken into account as data for visualization and sonification procedures: %CGs = (CG/C+G)*100 and %CCGs = (CCG/C+G)*100.

Quantification and visualization of C+G, CG and CCG occurrences was made programmatically by creating an algorithm in the Processing programing language.

Sonification of CGs and CCGs occurrences was performed algorithmically using Processing's sound library whereas recording of the audio compositions was performed with code that made use of the Minim library (


Bewick et al. (2016). On the origin and evolutionary consequences of gene body DNA methylation. PNAS (113): 32

Eichten et al. (2016