# Gene assembly in Ciliates

Post-publication activity

Curator: Grzegorz Rozenberg

Gene Assembly in Ciliates refers to the process of transforming the micronuclear genome into the macronuclear genome during the sexual activity of conjugation in ciliates (single-celled eukaryotes). The special structure of the micronuclear genes closely resembles the linked list data structure from computer science, and the DNA rearrangements performed by ciliates during gene assembly have a strong computational appeal. This article is concerned with models of gene assembly in ciliates. A rich literature exists on this topic, see (Ehrenfeucht et al, 2003) for a monograph, and (Brijder et al, 2009) for a recent review.

## Elements of biology of gene assembly in ciliates

Protozoa are unicellular eukaryotes of striking complexity. They have extensive sensory capabilities including, in various species, the ability to sense light, motion, chemical and magnetic gradients. Their locomotion is mostly through swimming. Many protozoans are active hunters.

The ciliated protozoa (phylum Ciliophora) are a particularly interesting group defined by two major features. On one hand, they all possess cilia, used for locomotion and for creating currents of water driving nutrients towards their oral cavities. On the other hand, as a unique feature, ciliates contain two functionally different types of nuclei: Macronucleus (abbreviated as MAC) and Micronucleus (abbreviated as MIC). Each of the two types of nuclei may be present in multiple copies, depending on the species. The macronuclei are the somatic nuclei, responsible for producing the RNA transcripts necessary throughout the life cycle of the cell. The micronuclei are the germline nuclei; throughout the vegetative cycle, they remain dormant. Micronuclei get activated only during the sexual process of conjugation and even then, micronuclear genes are not expressed.

Conjugation is a non-reproductive sexual activity in ciliates. Its goal is to incorporate into the organism new DNA from the conjugation partner. Conjugation may be triggered as a response to environmental stress, such as prolonged lack of nutrients. In this case, the ciliate cell meiotically divides its diploid MIC into four haploid MICs. It then exchanges a haploid MICs with a conjugation partner through a cytoplasmic bridge that the two form. The cell dissociates from its partner, and combines one of its own haploid MICs with the copy received from its partner to form a novel, fused diploid MIC. The cell destroys its old MICs and MACs, as well as the remaining haploid MICs, while the new MIC divides into two MICs. One of them will be the new MIC that will further divide to reach the necessary number of copies, according to the species. The other one will develop into the new MAC.

This article focuses on the subclass Stichotrichia of the phylum Ciliphora, where the differences between the structures of the micronuclear and the macronuclear genomes are especially pronounced.

### Chromosome organization

As far as the chromosome organization is concerned, the micronucleus is diploid: each chromosome is present in it in two copies. Moreover, chromosomes may be millions of base pairs long, and the genes account for only a small part, around 3%, of their nucleotide sequences (as is the case for chromosomes of most eukaryotes). The macronucleus however contains each chromosome in large number of copies (typically more than 1000 copies in stichotrichs). In contrast to micronuclear chromosomes, most of each macronuclear chromosome (about 85%) is made of genetic material, and each of the chrosomes contains only one gene (sometimes 2 and only very rarely 3).

### Gene organization

Also on a local level, the structure of the genes is dramatically different in MIC and in MAC. Each MIC gene is composed of two sorts of alternating segments: Macronuclear Destined Sequences (MDSs in short) and Internally Eliminated Sequences (IESs in short). The MDSs are eventually spliced together during the transformation of a MIC into a MAC to form the MAC gene, while the IESs are non-genetic sequences that are eliminated in the process. Each MDS has a very special structure: it ends with a short nucleotide sequence called outgoing pointer that occurs also in the beginning of the next MDS, for which it is an incoming pointer. Thus, each MDS has two pointers, one incoming and one outgoing, with two exceptions. One is the MDS that will be eventually placed at the beginning of the macronuclear gene, which does not have an incoming pointer, and the other is the MDS that will end the macronuclear gene, which does not have an outgoing pointer. For the ease of the analysis, we consider that the first MDS of the macronuclear gene starts with a beginning marker, while last MDS of the macronuclear gene ends with an ending marker. In some ciliates, e.g. in stichotrichs, the MDSs may occur in the micronuclear gene in a scrambled order relative to the orthodox order in which they occur in the MAC gene. Moreover, some of the MDSs may also be inverted (with respect to their occurrence in the MAC gene).

During gene assembly ciliates identify all MDSs and splice them together on their common pointers in the orthodox order, to yield the transcription-able MAC gene. This process is fascinating both from the point of view of biology and from the point of view of information processing. As a matter of fact, gene assembly is the most complex instance of DNA processing currently known in nature. An amazing feature of the organization of the MIC genes is that they are essentially a linked list data structure from Computer Science. The formal studies of gene assembly as computations consist of: (i) formulating mathematical models for the structure of genes and their processing throughout gene assembly, and (ii) the analysis of these models in an effort to gain a better understanding of the gene assembly process from many points of view, such as various possible strategies of assembly, the by-products of the process, its complexity, etc.

## The intermolecular model for gene assembly

The first molecular model for gene assembly was proposed in (Landweber, Kari, 1999). The model consists of three molecular operations of DNA recombination on pointers, see Figure 1, Figure 2, Figure 3 for illustrations of each operation.

Figure 1: The first operation of the intermolecular model for gene assembly. Author of the illustration: V.Rogojin

Given a linear DNA molecule z of the form uxvxw, where u,x,v,w denote sequences of base pairs of nucleotides, the first operation is a DNA recombination of z on x, having as a result two DNA molecules. One of them is a linear DNA molecule: uxw. The other is a circular molecule: [vx], where we use the brackets as a notation indicating a circular molecule.

Figure 2: The second operation of the intermolecular model for gene assembly. Author of the illustration: V.Rogojin

The second operation of the model is the inverse of the first one. The input to the operation is made of two DNA molecules. The first one is a linear DNA molecule of the form uxw, where u,x,w denote sequences of base pairs of nucleotides. The second one is a circular DNA molecule of the form [vx], where v is a sequence of base pairs of nucleotides. The result of applying the second operation to this input is the linear DNA molecule uxvxw.

Figure 3: The third operation of the intermolecular model for gene assembly. Author of the illustration: V.Rogojin

The third operation of the model involves two linear DNA molecules of the form uxv and rxt, where u,x,v,r,t denote sequences of base pairs of nucleotides. The result of applying this operation to these two molecules consists of two linear molecules: uxt and rxv.

The idea in all three operations is that x is a pointer and by applying the operation, two MDSs get spliced together to form a longer, composite MDS. The second and the third operations of the model are intermolecular, in the sense that they involve the recombination of two separate molecules. For this reason, the whole model is generally referred to as the intermolecular model for gene assembly.

A micronuclear gene gets transformed into its macronuclear counterpart through iterative applications of the three operations. One of the characteristics of the intermolecular model is that it is reversible: the result of applying an operation can be reversed by applying another operation of the model on the same pointer sequence.

Note that inverted pointer sequences are not explicitly dealt with in the definition of the intermolecular model. However, disentangling inverted MDSs can be explained in the model by assuming that two copies of the initial linear micronuclear gene are available, and that any double stranded linear DNA molecule is available also in its inverted form, see (Harju et al, 2004).

## The intramolecular model for gene assembly

A different molecular model for gene assembly was introduced in (Prescott et all, 2001) and (Ehrenfeucht et al, 2001). It consists of three molecular operations, all taking as an input one single (linear) DNA molecule, see Figure 4, Figure 3, Figure 6 for illustrations of each operation. For this reason, the model is referred to as the intramolecular model for gene assembly.

Figure 4: The ld operation of the intramolecular model for gene assembly. Author of the illustration: V.Rogojin

The first operation is called the loop, direct-repeat excision, or in short, ld. The ld operation is applicable to a DNA molecule having two identical (non-overlapping) occurrences (direct repeat) of a pointer, i.e., to a molecule of the form upxpv, with u,x,v denoting sequence of base pairs of nucleotides, and p denoting a pointer. The molecule is then folding into a loop aligned on the two occurrences of pointer p, thus facilitating DNA recombination on p. The result of the operation is the linear molecule upv and the circular molecule [xp].

Figure 5: The hi operation of the intramolecular model for gene assembly. Author of the illustration: V.Rogojin

The second operation is called hairpin, inverted-repeat excision/reinsertion, or in short, hi. The hi operation is applicable to a DNA molecule having two (non-overlapping) occurrences of a pointer such that one occurrence is an inversion of the other one, i.e., to a molecule of the form $$upx\overline{p}v\ ,$$ where $$\{p,\overline{p}\}$$ are a pointer and its inverse, and $$u,x,v$$ denote sequences of base pairs of nucleotides. The pointer is said in this case to have an inverted repeat. The molecule folds into a hairpin so that the two occurrences of pointer $$p$$ are aligned and recombination is facilitated. The result is the DNA molecule $$u\,p\,\overline{x}\,\overline{p}\,v\ .$$

Figure 6: The DLAD operation of the intramolecular model for gene assembly. Author of the illustration: V.Rogojin

The third operation is called double loop, alternating direct-repeat excision-reinsertion, or in short, dlad. The dlad operation is applicable to a DNA molecule having two pointers, each with a direct repeat, coming in an alternate sequence, i.e., to a molecule of the form $$u_1pu_2qu_3pu_4qu_5\ ,$$ where $$p,q$$ are pointers and $$u_1,u_2,u_3,u_4,u_5$$ denote sequences of base pairs of nucleotides, with $$p,q$$ pointers. The molecule folds into a double loop so that both occurrences of $$p$$ and both occurrences of $$q$$ are simultaneously aligned, facilitating a double DNA recombination. (Equivalently, the fold may also be seen as simple loop aligned simultaneously in two different places.) The result of the operation is the DNA molecule $$u_1pu_4qu_3pu_2qu_5\ .$$

Figure 7: An animation of the LD operation of the intramolecular model for gene assembly. Author of the illustration: I.Petre and V.Rogojin
Figure 8: An animation of the HI operation of the intramolecular model for gene assembly. Author of the illustration: I.Petre and V.Rogojin
Figure 9: An animation of the DLAD operation of the intramolecular model for gene assembly. Author of the illustration: I.Petre and V.Rogojin

It is important to note that all three operations of the intramolecular model only take one molecule as an input. Consequently, for a sequence of the three operations to lead to an assembled gene, each application of the ld operation has to be simple: the excised circular molecule may contain one IES only, and no MDS. Otherwise, the excised MDSs can never be reintegrated into the assembled gene. The only exception is in the last step of an assembly, when the whole assembled gene may be placed on a circular molecule, see (Ehrenfeucht et al, 2001b) for more details.

## Template-guided recombination

An important research topic is the discovery of the physical mechanism governing molecular operations that actually implements gene assembly, i.e., it splices the MDSs of a MIC gene to produce its MAC version. The model briefly discussed in this section is based on the idea that templates are available -- these templates allow to identify MDSs in a MIC gene, and they determine the order in which these MDSs must be spliced. The template-guided recombination is a 3-way process involving two molecules to be recombined (they can be part of the same, longer molecule) and the template which by containing an already recombined molecule shows the way the recombination should be done. The main features of a template-guided recombination model are:

• the irreversibility,
• self-propagation of the template, i.e., the template is preserved/reconstructed,
• the configuration required for recombination is not sequence-specific.

The model of template-guided recombination was proposed in (Prescott et al, 2001). In this model, the template is a double stranded DNA molecule. This template could for example be a copy of the assembled gene from the old pre-conjugation MAC of the ciliate. The template is aligned in-between the two double strands of DNA to be recombined. The recombined molecule contains physical parts of the template, but the template itself is reconstructed during the process.

A modification of this model was proposed in (Angeleska et al, 2007). Here the template is a double-stranded RNA molecule, rather than a double-stranded DNA molecule. Also, the template is not positioned in-between the two molecules to be recombined, but rather above the two molecules (as a sort of a roof). In this model, no parts of the (RNA) template are incorporated in the recombined molecule: the RNA template detaches itself after it played its templating role. In the same paper, also a model using a single-stranded RNA molecule is presented.

Experimental evidence supporting the template-guided recombination in O.trifallax (also called S.histriomuscorum) is presented in (Nowacki et al, 2008).

## Modeling gene assembly through sorting permutations

### Genes as signed permutations

In modeling gene assembly one may focus on the sequence of MDSs only and ignore the transformations and the excisions of IESs. In this way, a micronuclear gene, its macronuclear form, and all the intermediate forms may be represented by the sequence of their MDSs. Since some of the MDSs may be inverted, one deals in this way with signed permutations.

The first MDS of the macronuclear gene is denoted by 1, the second MDS is denoted by 2, etc. The inverse of an MDS denoted by i, for some positive integer i, is denoted by $$\overline{i}\ .$$ A micronuclear or an intermediate gene is then formalized as a signed permutation through the sequence of its MDSs. The assembled gene is represented as a sorted permutation, i.e., either as the permutation $$12\ldots n$$ or as the permutation $$\overline{n}\ldots\overline{2}\,\overline{1}\ .$$

### Gene assembly as a sorting of signed permutations

With the genes represented as signed permutations over the alphabet $$\Pi_n=\{1,2,\ldots,n\}\ ,$$ for some $$n\geq 1\ ,$$ the operations of the intramolecular model are formalized as rewriting rules on signed permutations. Of the three operations of the intramolecular model, ld is however not explicitly modeled. The idea is that two consecutive MDSs that were either next to each to other, or were placed next to each other through the application of hi or dlad, will at some unspecified point be spliced together through an ld operation. The application of that ld can be done at any point of the assembly, independently of the other operations applied in the process. The remaining operations hi and dlad are formalized as follows.

• For each $$1\leq p < n\ ,$$ $$\mathsf{hi}_p$$ is defined as follows:
• $$\mathsf{hi}_p(x p y (\overline{p+1}) z) = x p (p+1) \overline{y} z\ ,$$
• $$\mathsf{hi}_p(x \overline{p} y (p+1) z) = x \overline{y} p (p+1) z\ ,$$
• $$\mathsf{hi}_p(x (p+1) y \overline{p} z) = x \overline{y} (\overline{p+1}) \overline{p} z\ ,$$
• $$\mathsf{hi}_p(x \overline{(p+1)} y p z) = x (\overline{p+1}) \overline{p}\, \overline{y} z\ ,$$

where $$x,y,z$$ are signed strings over $$\Pi_n\ .$$

• For each $$1 \leq p,q < n\ ,$$ where $$|p-q|>1\ ,$$ $$\mathsf{dlad}_{p,q}$$ is defined as follows:
• $$\mathsf{dlad}_{p,q}(x p'' u q'' v p' w q' z) = x w q' q'' v p' p'' u z\ ,$$
• $$\mathsf{dlad}_{p,q}(x p'' u q' v p' w q'' z) = x w v p' p'' u q' q'' z\ ,$$
• $$\mathsf{dlad}_{p,q}(x p' u q'' v p'' w q' z) = x p' p'' w q' q'' v u z\ ,$$
• $$\mathsf{dlad}_{p,q}(x p' u q' v p'' w q'' z) = x p' p'' w v u q' q'' z\ ,$$

where $$x, u, v, w, z$$ are signed strings over $$\Pi_n\ ,$$ and either $$p' = p\ ,$$ $$p'' = p+1\ ,$$ or $$p' = (\overline{p+1})\ ,$$ $$p'' = \overline{p}\ ,$$ and either $$q' = q\ ,$$ $$q'' = q+1\ ,$$ or $$q' = (\overline{q+1})\ ,$$ $$q'' = \overline{q}\ .$$ In all these case, we also denote $$\mathsf{dlad}_{q,p} = \mathsf{dlad}_{p,q}\ .$$

• For each $$1 < p < n\ ,$$ we define $$\mathsf{dlad}_{p-1,p}$$ as follows:
• $$\mathsf{dlad}_{p-1,p}(x p''' u p'' w p' z) = x w p' p'' p''' u z\ ,$$
• $$\mathsf{dlad}_{p-1,p}(x p'' v p' w p''' z) = x w v p' p'' p''' z\ ,$$
• $$\mathsf{dlad}_{p-1,p}(x p' u p''' v p'' z) = x p' p'' p''' v u z\ ,$$

where $$x, u, v, w, z$$ are signed strings over $$\Pi_n$$ and either $$p'=p-1\ ,$$ $$p''=p\ ,$$ $$p'''=p+1\ ,$$ or $$p'''=(\overline{p+1})\ ,$$ $$p''=\overline{p}\ ,$$ $$p' = (\overline{p-1})\ .$$ We denote $$\mathsf{dlad}_{p,p-1}=\mathsf{dlad}_{p-1,p}$$

The definition of both hi and dlad as rewriting rules on signed permutations required more details than the definitions of their corresponding molecular operations. The reason for this is that, while the molecular operations are defined in terms of pointer repeats and pointer overlaps, the formal operations are defined in terms of MDSs coming in a specific order and orientation. The multiple cases in the definitions of hi and dlad as rewriting rules on signed permutations come from the observation that different MDS sequences may have the same pointer sequences. This aspect is discussed in formal details in (Ehrenfeucht et al, 2003).

## Modeling gene assembly through string rewritings

### Genes as signed strings

On a higher level of abstraction, genes could also be represented through their sequence of pointers only. In this way, a gene is represented as a signed string where each letter (standing for a pointer) has exactly two occurrences (standing for the two occurrences of each pointer). The strings are signed: some of the letters could be inverted, accounting for the inversion of some of the pointers in the gene. The three operations of the model are then represented as string rewriting operations. On this level, much information about the gene is discarded: the nucleotide sequence of the MDSs and of the IESs, the markers, and the information about which pointer occurrences are incoming and which are outgoing. Remarkably though, the pointer sequence is enough as far as the investigations of strategies for gene assembly are concerned. Indeed, it has been proved in (Ehrenfeucht et al, 2003) that every assembly strategy seen as a sorting strategy of the signed permutation representing the gene corresponds to a successful rewriting of the corresponding string, as defined below. Conversely, each successful rewriting of the string associated to a micronuclear gene corresponds to a sorting strategy of the corresponding signed permutation.

The crucial observation in the modeling of gene assembly as a string rewriting process is that a pointer is defined as a nucleotide sequence that is placed at the edge of an MDS, and is adjacent to an IES. Once two MDSs are spliced together on their common pointer (as a result of applying one of the operations), the nucleotide sequence that used to act as a pointer is now placed in the middle of a composite MDS. Since the sequence is no longer adjacent to an IES, it no longer acts as a pointer. The second occurrence of the same pointer is placed in the middle of a composite IES. (In the case of an ld operation, a sequence which used to be an IES is connected by a pointer sequence into an excised circular molecule.) Since this sequence is no longer a part of an MDS, it stops acting as a pointer. Consequently, the gene assembly process can be seen as an iterative process of pointer elimination. The assembled gene does not contain pointers anymore, and therefore, it is represented as the empty string!

A signed string $$\pi$$ over the alphabet $$\Pi_n$$ is called a legal string if for every letter $$p$$ occurring in $$\pi\ ,$$ there are exactly two occurrences of letters from $$\{p,\overline{p}\}$$ in $$\pi\ .$$ Here we set $$\overline{\overline{p}}=p\ ,$$ for all $$p\in\Pi_n\cup\overline{\Pi}_n\ ,$$ where $$\overline{\Pi}_n$$ denotes the set $$\{\overline{p}\mid p\in\Pi_n\}\ .$$

### Gene assembly as a string rewriting process

The molecular operations ld, hi, and dlad are formalized as string rewriting operations as follows.

• The ld operation:
• $${\mathsf{ld}}_p(\pi_1pp\pi_2)=\pi_1\pi_2\ ,$$ for any $$p\in\Pi_n\cup\overline{\Pi_n}\ .$$
• The hi operation:
• $${\mathsf{hi}}_p(\pi_1p\pi_2\overline{p}\pi_3)=\pi_1\overline{\pi}_2\pi_3\ ,$$ for any $$p\in\Pi_n\cup\overline{\Pi_n}\ .$$
• $${\mathsf{dlad}}_{p,q}(\pi_1p\pi_2q\pi_3p\pi_4q\pi_5)=\pi_1\pi_4\pi_3\pi_2\pi_5\ ,$$ for any $$p,q\in\Pi_n\cup\overline{\Pi_n}\ .$$

For a legal string $$\pi\ ,$$ we say that a composition $$\phi$$ of ld, hi, and dlad string rewriting operations is successful for $$\pi$$ if $$\phi(\pi)=\Lambda\ ,$$ where $$\Lambda$$ denotes the empty string.

## Modeling gene assembly through graph rewritings

### Genes as signed graphs

On an even higher level of abstraction, genes could also be represented as a signed graph. The correspondence is done through the string representation: for each legal string $$\pi\ ,$$ one associates its signed overlap graph $$G_\pi\ ,$$ as follows. Let $$\pi$$ be a legal string over $$\Pi_n\ .$$ For each letter $$p\in \Pi_n\cup\overline{\Pi}_n$$ occurring in $$\pi\ ,$$ one considers a vertex labeled with $$p$$ in $$G_\pi\ .$$ The vertex is labeled with +, and we say that it is positive, if both $$p$$ and $$\overline{p}$$ have an occurrence in $$\pi\ .$$ Otherwise, if either $$p$$ is present twice in $$\pi\ ,$$ or $$\overline{p}$$ is present twice in $$\pi\ ,$$ then the vertex $$p$$ is labeled with -, and we say that it is negative. For two vertices $$p,q\ ,$$ $$G_\pi$$ has an edge between $$p$$ and $$q$$ if $$p,q$$ overlap in $$\pi\ ,$$ i.e., $$\pi=\pi_1 p_1 \pi_2 q_1 \pi_3 p_2 \pi_4 q_2 \pi_5\ ,$$ for some strings $$\pi_1,\ldots,\pi_5$$ and $$p_1,p_2\in\{p,\overline{p}\}\ ,$$ $$q_1,q_2\in\{q,\overline{q}\}\ .$$

Clearly, there is some loss of information when moving from the level of strings to that of graphs, in that the ordering of the pointers is not represented anymore. It has been shown in (Ehrenfeucht et al, 2003) that as far as the gene assembly strategies are concerned, there is almost no loss of information (as explained in the next section).

### Gene assembly as a graph rewriting process

The operations ld, hi, and dlad are formalized as three graph rewriting operations. Similarly as in the case of strings, the underlying concept is that pointers are eliminated throughout gene assembly, yielding in the end the empty graph, standing for the assembled gene.

For a (signed) graph $$G$$ and a vertex $$p\ ,$$ we denote by $$N_G(p)$$ the neighborhood of $$p$$ in $$G\ .$$

• The ld operation$\mathsf{ld}_p$ is applicable to the signed graph $$G$$ if the vertex labeled with $$p$$ is isolated and negative in $$G\ .$$ The result $$\mathsf{ld}_p(G)$$ is the graph obtained from $$G$$ by removing $$p\ .$$
• The hi operation$\mathsf{hi}_p$ is applicable to the signed graph $$G$$ if the vertex labeled with $$p$$ is positive in $$G\ .$$ The result $$\mathsf{hi}_p(G)$$ is the graph obtained from $$G$$ by:
• removing the vertex $$p$$ and
• inverting the subgraph induced in $$G$$ by the neighborhood of $$p$$ in $$G\ :$$ the signs of all vertices are switched, and the edge relationship is complemented (one adds an edge between two vertices if there was no edge between them and the other way around).
• The dlad operation$\mathsf{dlad}_{p,q}$ is applicable to the signed graph $$G$$ if $$p,q$$ are both negative and neighbors of each other in $$G\ .$$ The result $$\mathsf{dlad}_{p,q}(G)$$ is the graph obtained from $$G$$ as follows:
• remove $$p,q$$ from $$G\ ,$$
• complement the edge relationship between all vertices $$r,s$$ if:
• $$r\in (N_G(p)\setminus N_G(q))\cup(N_G(q)\setminus N_G(p))$$ and $$s\in N_G(p)\cap N_G(q)\ ,$$ or
• $$r\in N_G(p)\setminus N_G(q)$$ and $$s\in N_G(q)\setminus N_G(p)\ .$$

For a signed graph $$G\ ,$$ we say that a composition $$\phi$$ of ld, hi, and dlad graph rewriting operations is successful for $$G$$ if $$\phi(G)=\Lambda\ ,$$ where $$\Lambda$$ denotes the empty graph.

The relationship between the modeling of gene assembly as a string rewriting and as a graph rewriting has been explored in (Ehrenfeucht et al, 2002) and (Ehrenfeucht et al, 2003). It has been proved that for any legal string $$\pi$$ and any ld, hi, or dlad operation $$f\ ,$$ if $$f$$ is applicable to $$\pi\ ,$$ then $$f$$ (or rather its graph-based counterpart having the same notation) is applicable to $$G_\pi\ .$$ Moreover, the signed overlap graph corresponding to $$f(\pi)$$ is $$f(G_\pi)\ .$$

The other way around, the relationship is more complex. In the case of an operation $$f$$ that is either a hi, or a dlad we have a similar result: if $$f$$ is applicable to $$G_\pi\ ,$$ then $$f$$ (or rather its string-based counterpart having the same notation) is applicable to $$\pi\ .$$ Moreover, the signed overlap graph corresponding to $$f(\pi)$$ is $$f(G_\pi)\ .$$ In the case of ld however, an operation may be applicable to the graph, while not being applicable to the string. As an example one may take the double occurrence string $$\pi=2332$$ having the discrete graph with the negative nodes 2,3 as its overlap graph. By definition, both $$\mathsf{ld}_2$$ and $$\mathsf{ld}_3$$ are applicable to $$G_\pi\ ,$$ while only $$\mathsf{ld}_3$$ is applicable to $$\pi\ .$$

The string-based model for gene assembly and the graph-based model are however equivalent in the following sense, as proved in (Ehrenfeucht et al, 2002) and (Ehrenfeucht et al, 2003):

• For any successful strategy $$\phi$$ for a legal string $$\pi\ ,$$ its graph-based counterpart $$\phi$$ is a successful strategy for $$G_\pi\ .$$
• For any successful strategy $$\phi$$ for a signed graph $$G\ ,$$ there is another successful strategy $$\phi_1$$ for the graph $$G$$ that is obtained from $$\phi$$ by moving all ld operations in $$\phi$$ at the very end of the composition. (In other words, no hi and no dlad operation follows any ld operation in the composition $$\phi_1\ .$$) We call any composition having this property a canonical strategy for $$G\ .$$
• For any canonical strategy $$\phi_1$$ for the signed overlap graph $$G_\pi\ ,$$ there is a successful strategy $$\phi_2$$ for $$\pi$$ that only differs from $$\phi_1$$ in the order of the ld operations.

## Summary and main research lines

We have surveyed in this article some of the research done on the computational nature of gene assembly in ciliates. We have described two molecular models (the intermolecular and the intramolecular models) for gene assembly, a template-guided DNA recombination mechanism that implements gene assembly, and a mathematical model on three levels of abstraction for reasoning about the gene assembly strategies. We have entered into more details in the mathematical modeling of the intramolecular model, where the bulk of research on the gene assembly process was done. The main line of investigation in the intermolecular model concerns its computational power of the model, see (Landweber, Kari, 1999) and (Kari et al, 1999).

There is a number of research topics that we did not cover in this short overview. We will return to them in the forthcoming updates of this article. Topics include:

• Universality
• Invariants
• Gene patterns
• Parallelism
• Simple operations
• Computing with ciliates

## References

• Ehrenfeucht A, Harju T, Petre I, Prescott DM, Rozenberg G (2003) Computation in Living Cells: Gene Assembly in Ciliates. Springer, Berlin Heidelberg New York.
• Brijder R, Daley M, Harju T, Jonoska N, Petre I, Rozenberg G (2010) Computational nature of gene assembly in ciliates. In: G.Rozenberg, T.H.W.B\"ack, J.N.Kok (Eds.) Handbook of Natural Computing, Springer.
• Landweber L, Kari L (1999) Nature’s solution to a computational problem. In: Kari L, Rubin H, Wood D (Eds.) Special issue of Biosystems: Proceedings of DNA Based Computers IV. 52(1-3), 3-13.
• Harju T, Petre I, Rozenberg G (2004) Two models for gene assembly. In: J.Karhumaki, G.Paun, G.Rozenberg (Eds.), Theory is Forever, LNCS 5113, 89-101, Springer.
• Ehrenfeucht A, Prescott DM, Rozenberg G (2001) Computational aspects of gene (un)scrambling in ciliates. In: L.F.Landweber, E.Winfree (Eds.) Evolution as computation, 216-256, Springer.
• Prescott DM, Ehrenfeucht A, Rozenberg G (2001) Molecular operations for DNA processing in hypotrichous ciliates. Europ. J. Protistology 37, 241-260.
• Ehrenfeucht A, Petre I, Prescott DM, Rozenberg G (2001) Circularity and other invariants of gene assembly in ciliates. In: M.Ito, G.Paun, S.Yu (Eds.) Words, Semigroups, and Transductions, World Scientific, 81-97.
• Ehrenfeucht A, Harju T, Petre I, Prescott DM, Rozenberg G (2003) Formal systems for gene assembly in ciliates, Theoretical Computer Science 292, 199-219.
• Ehrenfeucht A, Petre I, Prescott DM, Rozenberg G (2002). String and graph reduction systems for gene assembly in ciliates. Mathematical Structures in Computer Science, 12, 113-134.
• Prescott DM, Ehrenfeucht A, Rozenberg G (2001) Template-guided recombination for IES elimination and unscrambling of genes in stichotrichous ciliates. J. Theor. Biol. 222 323–330
• Angeleska A, Jonoska N, Saito M, Landweber LF (2007) RNA-template guided DNA assembly. J. Theor. Biol. 248 706–720
• Nowacki M, Vijayan V, Zhou Y, Schotanus K, Doak TG, Landweber LF (2008) RNA mediated epigenetic programming of a genome-rearrangement pathway. Nature 451 153–158
• Kari L, Kari J, Landweber LF (1999) Reversible Molecular Computation in Ciliates. In: J.Karhumaki, H.Maurer, G.Paun, G.Rozenberg (eds.) Jewels are Forever, Springer, 353-363

Internal references