Help Index

  • Analyses of the information content of DNA through sonification
  • Algorithms
  • Midi instruments

  • Analyses of the information content of DNA through sonification

    Introduction

    DNA Sonification refers to the use of audio to convey the information content of DNA sequence data. It provides an interesting adjunct to standard visualization of DNA sequence data. To achieve this the 4 bases (namely G, A, T and C) that make up the DNA sequence are processed from left to right in a linear fashion. To achieve this a dynamic web tool has been created in which DNA sequences are processed to produce audio output.

    Recently these has been much interest in DNA sonification in light of recent advancements in DNA sequencing technology and the benefits thereof. Gene coding regions of the genome are essentially highly ordered sequences of DNA where by the genetic code relates the coding sequence of DNA to an amino acid residues of a protein. However, much of the information content of DNA, outside of gene coding regions has a lower sequence complexity according to our current knowledge base.

    Two vastly different approaches have previously been taken to sonify DNA to achieve outcomes pertinent to either the art or science disciplines. One approach essentially treats DNA as a random sequence for the purpose of generative music synthesis whereas the other assumes non-random sequence and therefore takes into account basic chemical or biological properties during sonifcation. We have focused on the latter approach in this work.

    From a scientific perspective the basic challenge of DNA sonification is to use audio cues to distinguish between a DNA sequence that is a highly ordered gene coding regions from that of low complexity. Towards achieving this, various algorithms have been established to map the nucleotide bases (motifs) to musical notes. In the most rudimentary algorithm, each of the 4 individual nucleotide base (G, A, T, or C) is considered to be a motif and is mapped to one of four musical notes however given the complexity of DNA sequences this mapping is ineffectual and included only for the sake of completeness.

    The consideration of pairs of nucleotides as motifs provides for 16 notes and again does not give justice to the complexity of most DNA sequences. The most useful approach is to mirror the genetic code and treat each of three nucleotide bases as a motif to map to a note. In theory a total of 64 codons exist however in the realm of biology typically these give rise to only 20 of amino acid residues of proteins. This approach of note assignment could clearly be extended to map larger groupings of nucleotides to an ever increasing range of notes, for instance 4 or 5 nucleotide motifs could theoretically be mapped to 256 or 1024 notes, respectively. Whilst this has no basis in biology it is an interesting proposition for generative music aficionados. Given a typical hearing range and the number of discrete notes on musical instruments, this provides for more notes than can be sounded. One solution could be to map these motifs to micro-tonal scales using intervals smaller than semi-tones, however this approach was not pursued at this stage.

    Motif
    Number of motifs
    Motif identifier (Motif ID)
    Single-nucleotide

    4
    (4 x 1)

    G =>Motif ID-1, A =>Motif ID-2, T =>Motif ID-3, C =>Motif ID-4.
    Di-nucleotides
    16
    (4 x 4)
    GG =>Motif ID-1, GA =>Motif ID-2, GT =>Motif ID-3, GC =>Motif ID-4, AG =>Motif ID-5, AA =>Motif ID-6, AT => Motif ID-7, etc...
    Tri-nucleotides
    64
    (4 x 16)
    GGG =>Motif ID-1, GGA =>Motif ID-2, GGT =>Motif ID-3, GGC =>Motif ID-4, GAG =>Motif ID-5, GTG =>Motif ID-6, GCT => Motif ID-7, etc...

    Six DNA sonification algorithms have been scripted to associate a DNA motif to a specific motif identifier. Each of these are further processed to produce a distinct mix of instrument and note identifiers to be assigned to musical notes. The motif identifiers are numbered from 1-4, 1-16 or 1-64 depending on the algorithm. These motif identifiers are further processed using additional parameters to establish a musical key, notes intervals, note length, note timing and tempo. These are then assigned to an octave suitable for the selected instrument. All audio is generated dynamically and the audio output is streamed in real time.

    Irrespective of the algorithm used, in each case Motif ID 1 is assigned to the root note of a musical key and the octave is set by the lower pitch range of the assigned musical instrument. These assignments are made using MIDI note numbers. For each instrument there are 128 MIDI note numbers representing a 10 octave note range. The interval between notes is governed by the scale used to sonify the motifs. For instance the repeating semitone intervals of the natural minor scale (2, 1, 2, 2, 1, 2, 2) or the blues scale (3, 2, 1, 1, 3, 2) are used to assign sequential motif numbers to musical notes. Clearly the choice of key and scale determine the actual notes used in DNA sequence sonification.

    Whilst each of the algorithms produces an audio output with interesting characteristics, the most useful algorithm for DNA sequence analyses using codons (motifs of three nucleotides) mapped to 21 musical notes. In this approach tri-nucleotides are processed in an analogous way to the biological rules of the genetic code (in which a codon consists of three consecutive bases coding for one of 20 amino acid building blocks of a protein). Each of 64 possible codons are mapped to one of 20 musical notes rather than amino acids, as is the STOP codon. Each of the three possible open reading frames is mapped to a separate instruments. In the absence of further DNA sequence annotation to indicate the actual reading frame of the sequence, each open reading frame (instrument) is voiced sequentially with equal bias.

    The information content of the DNA sequence was further sonified using two unique approaches. Firstly, Start or Stop codons were assigned to a loud or quiet volumes, respectively. This volume manipulation not only effects the specific codon but the following notes for a period of time. This effectively silences a reading frame if a Stop codon occurs or makes the reading frame containing a Start codon louder for a period of time. Secondly, unique sequences of DNA are used to trigger percussion instruments upon their detection in the sequence, this is applied to transcription factor binding motifs, promoter elements and to Start and (silences) Stop codons. These methods are effective at distinguishing cDNA sequences from random DNA sequences or AT rich DNA from GC rich DNA.

     

    The human genome consists of approx. 600 billion base pairs

    Consider approx. 1000 base pairs of DNA sequence:
    actcaccctgaagttctcaggatccacgtgcagcttgtcacagtgcagctcactcagtgt ggcaaaggtgcccttgaggttgtccaggtgagccaggccatcactaaaggcaccgagcac tttcttgccatgagccttcaccttagggttgcccataacagcatcaggagtggacagatc cccaaaggactcaaagaacctctgggtccaagggtagaccaccagcagcctaagggtggg aaaatagaccaataggcagagagagtcagtgcctatcagaaacccaagagtcttctctgt ctccacatgcccagtttctattggtctccttaaacctgtcttgtaaccttgataccaacc tgcccagggcctcaccaccaacttcatccacgttcaccttgccccacagggcagtaacgg cagacttctcctcaggagtcagatgcaccatggtgtctgtttgaggttgctagtgaacac agttgtgtcagaagcaaatgtaagcaatagatggctctgccctgacttttatgcccagcc ctggctcctgccctccctgctcctgggagtagattggccaaccctagggtgtggctccac agggtgaggtctaagtgatgacagccgtacctgtccttggctcttctggcactggcttag gagttggacttcaaaccctcagccctccctctaagatatatctcttggccccataccatc agtacaaattgctactaaaaacatcctcctttgcaagtgtatttacgtaatatttggaat cacagcttggtaagcatattgaagatcgttttcccaattttcttattacacaaataagaa gttgatgcactaaaagtggaagagttttgtctaccataattcagctttgggatatgtaga tggatctcttcctgcgtctccagaatatgcaaaatacttacaggacagaatggatgaaaa

    This above sequence contains a segment of the promoter region and coding region of the beta globin gene.

    Consider the first 60 individual bases of this sequence:
    actcaccctgaagttctcaggatccacgtgcagcttgtcacagtgcagctcactcagtgt

    In a biological context, the information content of this can be read in one of three reading frames according to the rules of the genetic code, whereby three nucleotide bases code for a specific amino acid residue in a protein.

    Only one of these reading frames is processed by the cell to make a protein, this is determined by recognition of landmarks or motifs in the sequence such as an inframe "atg" start codon or other codons, such as "tga" that determine the end of a gene. In addition other motifs such as 5'-tataaa-3' determine protein binding sites approximately 25 base pairs upstream of transcription start.

    A biological relationship (referred to as the genetic code) exists to convert each of the 64 codons to a specific amino acid residue (through the biological process of transcription and translation). Also included is a arbitrary association to a number to be used to reference a musical note in the MIDI file.

    Table to convert each of the 64 codons to a specific amino acid residue

    Codon number Codon Amino acid Note reference
    1 GCA Ala 1
    2 GCC Ala 1
    3 GCG Ala 1
    4 GCT Ala 1
    5 AGA Arg 2
    6 AGG Arg 2
    7 CGA Arg 2
    8 CGC Arg 2
    9 CGG Arg 2
    10 CGT Arg 2
    11 AAC Asn 3
    12 AAT Asn 3
    13 GAC Asp 4
    14 GAT Asp 4
    15 TGC Cys 5
    16 TGT Cys 5
    17 CAA Gln 6
    18 CAG Gln 6
    19 GAA Glu 7
    20 GAG Glu 7
    21 GGA Gly 8
    22 GGC Gly 8
    23 GGG Gly 8
    24 GGT Gly 8
    25 CAC His 9
    26 CAT His 9
    27 ATA Ile 10
    28 ATC Ile 10
    29 ATT Ile 10
    30 CTA Leu 11
    31 CTC Leu 11
    32 CTG Leu 11
    33 CTT Leu 11
    34 TTA Leu 11
    35 TTG Leu 11
    36 AAA Lys 12
    37 AAG Lys 12
    38 ATG Mt* 13
    39 TTC Phe 14
    40 TTT Phe 14
    41 CCA Pro 15
    42 CCC Pro 15
    43 CCG Pro 15
    44 CCT Pro 15
    45 AGC Ser 16
    46 AGT Ser 16
    47 TCA Ser 16
    48 TCC Ser 16
    49 TCG Ser 16
    50 TCT Ser 16
    51 TAA ST* 17
    52 TAG ST* 17
    53 TGA ST* 17
    54 ACA Thr 18
    55 ACC Thr 18
    56 ACG Thr 18
    57 ACT Thr 18
    58 TGG Trp 19
    59 TAC Tyr 20
    60 TAT Tyr 20
    61 GTA Val 21
    62 GTC Val 21
    63 GTG Val 21
    64 GTT Val 21

    Table to convert number to midi note (C scale)

    codon2numberthree octavesmidi note numbers
    1A57
    2B59
    3C60
    4D62
    5E64
    6F65
    7G67
    8A69
    9B71
    10C72
    11D74
    12E76
    13F77
    14G79
    15A81
    16B83
    17C84
    18D86
    19E88
    20F89
    21G91

    Midi note numbers

    The following table lists the numbers corresponding to notes for use in note on and note off commands in the MIDI file.

    Octave #CC#DD#EFF#GG#AA#B
    001234567891011
    1121314151617181920212223
    2242526272829303132333435
    3363738394041424344454647
    4484950515253545556575859
    5606162636465666768697071
    6727374757677787980818283
    7848586878889909192939495
    896979899100101102103104105106107
    9108109110111112113114115116117118119
    10120121222123124125126127


    Written by Mark Temple, School of Science and Health, Western Sydney University