Splicing systems
Rosalba Zizza (2010), Scholarpedia, 5(7):9397. | doi:10.4249/scholarpedia.9397 | revision #137071 [link to/cite this article] |
Preliminary remark
A splicing system is a formal model of a recombinant behaviour of sets of double stranded DNA molecules when acted on by restriction enzymes and a ligase. Recombinant DNA technology has received a rapid increase in interest in Molecular Biology, aimed at the development of new biotechnologies. On the other hand, the surprising power of splicing process stimulated interest towards the design of computational models inspired by biological phenomena. In this perspective, Formal Language Theory appears to be a natural framework for formalizing and investigating DNA computing models. Additional motivation for the investigation of splicing systems has been generated by the work of several researchers adding control structures to the splicing formalism thereby creating universal computation systems.
The purpose of this contribution is to stimulate interest in this topic, introducing various concepts, as thoroughly but as simply as possible, leaving the complete definitions and results to a large number of papers, only some of which it has been possible to cite in the bibliography. The author would like to especially thank Tom Head, who were so generous with his time and expertise and whose support and guidance, in the preliminary phase of this project, were invaluable.
Contents |
The Origins
"It is hoped that the line of research initiated here will interest biologists who have not previously found reason to be concerned with formal language theory [...] The results discussed here are certainly only modest formal achievements. It is hoped that future developments will provide much stronger results following the given paradigm and yield wholly new paradigms as well" (Head (1987)).
Formal Language Theory has been positively influenced by the introduction of new models of computation, inspired by biological processes, which are contributing to its renewal. A first example of this trend is the mathematical formalism for biological phenomena of recombinant processes proposed in 1987 by Head. In his seminal work, he introduced the formalism of splicing systems to describe the generative power of some recombination processes. This is still one of the growing areas in Molecular Computing, as well as an area of Formal Language Theory. It is commonly called dry molecular computing, to underline that the aim is to develop theoretical models of computations, starting from a biological mechanism. Indeed, a DNA strand is formed by a backbone of linked molecules of deoxyribose, each one of them carrying one of four bases Adenine, Guanine, Cytosine, and Thymine. Hence, it can be viewed as a string over the four letter alphabet (or nucleotides A, G, C, T) and so a natural way to model DNA computations is within the framework of Formal Language Theory. The (single-stranded) DNA sequence has a polarity, i.e., a sequence of DNA is distinct from its reverse, and the two distinct ends of a DNA sequence are known as the 5' end and the 3' end. The four bases are related to each other by the Watson-Crick complementary rule (A=T; G=C) and two complementary (single) strands with opposite polarity will join together, forming a double helix (hybridization process). However, we can treat each of such molecules as a sequence of pairs of letters, representing two strings bounded together in parallel (i.e., our alphabet consists of four symbols [A/T],[C/G],[G/C],[T/A]).
Loosely speaking, splicing occurs in two steps. First, restriction enzymes (proteins) recognize a specific pattern in a DNA sequence and then they cut the molecule in a specific point inside the recognized pattern. The site and the shape of the cut ends are specific to each enzyme. Since DNA is double stranded, the cutting phase may produce blunt or staggered ends. This is caused by cleavage patterns which are blunt or not. It is quite evident that if no sticky ends match, no fragment previously cut may be pasted. Clearly, if such staggered ends are in the vicinity of each other, the two original molecules may be re-associated, but this is not of particular interest. Nevertheless, it is also possible that two new hybrid molecules will be formed. Ligase enzymes perform this process (second step), pasting together properly matched fragments, under chemical conditions (related to cleavage patterns structure and complementary laws). Thus, two DNA sequences are cut by two restriction enzymes and the fragments may be recombined so that new sequences are produced. A precise description of the biochemical introduction can be found in Head (1987), Head et al. (1996), Păun et al. (1998).
Splicing_systems/A short biological description.
After being investigated in a few papers as
an operation with double stranded sequences,
the splicing operation
has been considered as an operation with single
stranded strings. Loosely speaking,
even if different definitions
of splicing have been given in the literature,
the splicing operation is applied to two words and
two different words may be generated.
The linear splicing operation
concatenates a prefix of one string with a suffix of
another string,
under some conditions, represented as a (splicing) rule,
i.e.,
splicing performs a change in prefixes and suffixes.
Note that from this viewpoint, we have no mismatches.
Formal Models of DNA recombination
As a model of the above biochemical operation, Head (1987) considered the following string operation (passing from double stranded sequences to strings is allowed, due to the precise Watson-Crick complementarity of nucleotides): consider an alphabet A and two finite sets, B and C, of triples (\(\alpha\),\(\mu\),\(\beta\)) of strings in A*. These triples are called patterns and the string \(\mu\) is called the crossing of the triple. Given two patterns (with the same crossing) \(\alpha \mu \beta, \alpha' \mu \beta'\) both from B or both from C, and two strings \(u \alpha \mu \beta v, p \alpha' \mu \beta' q\ ,\) these strings can be spliced and get \(u \alpha \mu \beta' q, p \alpha' \mu \beta v\ .\)
Note the similarity with the DNA recombination, in that the crossing of triples corresponds to the common sticky end of fragments produced by restriction enzymes; the fact that there are two sets of patterns, B and C, corresponds to the fact that the double stranded DNA molecules can have 5' or 3' sticky ends. Indeed, given such a scheme, based on double-stranded DNA molecules, one must take care of given overhangs, as well as blunt-end cleavage, and that only certain recombinations are permitted.
Example: Let us consider the word \(w=cxcxc\) and the triple \((c,x,c)\ .\) The pattern \(cxc\) occurs twice in \(w\) and the triple can be applied, coupled with itself, to two copies of \(w\ .\) Thus, using all combinations, the result of splicing is\[(cx)c + (cx)^2c + (cx)^3c\ .\]
Abstracting further from this idea, Păun (1996a) considered splicing rules of the form \(r=u_{1}\#u_{2}\$u_{3}\#u_{4}\ ,\) where \(u_{1},u_{2},u_{3},u_{4}\) are strings over a given alphabet A and \(\#,\$ \not \in A\ .\) The words \(u_{1}u_{2}\ ,\) \(u_{3}u_{4}\) are called sites of r. Given such a rule \(r\ ,\) by splicing the two strings \(x_{1}u_{1}u_{2}x_{2},y_{1}u_{3}u_{4}y_{2}\ ,\) the strings \(x_{1}u_{1}u_{4}y_{2},y_{1}u_{3}u_{2}x_{2}\) are produced. It is clear that this is a generalization of Head's definition: each string \(u_{1}u_{2},u_{3}u_{4}\) corresponds to the site of a restriction enzyme; two sites stay together in the same rule if they produce matching sticky ends, while the crossing is supposed to be included in the context strings (hence it can be empty).
Example: Consider the ordered pair of words \(u=ttttggaaccttt\) and \(v=tttggaacctttt\ .\) One can construct a new word from these two by cutting each, between the two occurrences of the symbol "a" in the subsequence \(s=ggaacc\) that occurs in both \(u\) and \(v\ .\) Then build a new string by pasting together, i.e., concatenate the left portion of \(u\) and the right portion of \(v\ .\) This description is formalized by the rule \(r=gga\#acc\$gga\#acc\ .\) The cutting phase generates the fragments\[ttttgga, accttt, tttgga, acctttt\ .\] The pasting phase generate the words \(ttttggaacctttt\ ,\) \(tttggaaccttt\ .\) The rule \(r=gga\#1\$t\#t\) applied to \(u\) and \(v\) generates the words \(ttttgga ttggaacctttt, ttttgga tggaacctttt,\) \(ttttgga ttt\ ,\) \(ttttgga tt\ ,\) \(ttttgga t, ... \ .\) Other words can be generated by applying the rule to two copies of \(u\ ,\) or to two copies of \(v\ .\)
Let me play around with splicing... see Splicing_systems/Păun splicing in cartoons
A still more general definition was considered in Pixton (1996): the rules are of the form \((\alpha,\alpha', \beta)\) and by splicing two strings \( \epsilon \alpha \eta, \epsilon' \alpha' \eta'\) the strings \(\epsilon \beta \eta', \epsilon' \beta \eta\) are generated. This definition is more general than Păun's (note that the context substrings \(\alpha, \alpha'\) are substituted by \(\beta\) during the splicing).
Example: The rule \((a, xa, xa)\) applied to (two copies of) the word \(cxae\) generates \(cx xa e\) (and \(cx a e\)). Splicing allows us to "pump" the letter x.
Computing devices based on splicing
From a string operation we can go to a language operation in a very natural manner and so a computing (language generating) device can be obtained, of the form \(S= (A,I,R)\ ,\) where A is an alphabet, I is a set of strings over A (called initial language) and R is a set of splicing rules (of a given type, from the three mentioned above) over A. Such machinery generates a language in the following way: starting from the strings in I, splice them in all possible ways with respect to the rules in R, add all the resulting strings to I and iterate the process (it may be assumed that when a DNA molecule is present, arbitrarily many copies of it are present, obtained by amplification).
The language generated by \(S\ ,\) denoted by \(L(S)\) and called the splicing language generated by \(S\ ,\) consists of all strings which can be obtained in this way (iterated splicing). More precisely, let us give a formalism for the iterated splicing in the case of Păun's definition, as known in the literature. Obviously, the same has been done for the two other definitions, in a similar manner. Firstly, \((w', w''){\vdash}_r w\) denotes the fact that the word \(w\) is generated (following Păun's definition) by the rule \(r\ ,\) starting from \(w', w''\ .\) Given a splicing system S and a language \(L \subseteq A^*\ ,\) we set \(\sigma'(L)=\{w \in A^* ~|~ \exists w', w'' \in L, \exists r \in R : \; (w',w''){\vdash}_r ~w \}.\) Then, we define \(\sigma^0(L)=L, \ \sigma^{i+1}(L) = \sigma^i(L) \cup \sigma'(\sigma^i(L)), \ i \geq 0\ ,\) and \(\sigma^*(L) = \bigcup_{i \geq 0} \sigma^i(L).\) Finally, given a splicing system S, with initial language \(I \subseteq A^*\ ,\) the language \(L(S)=\sigma^*(I)\) is the language generated by S. A language L is Păun generated (or L is a splicing language) if a splicing system S exists such that L=L(S). In the following, we often call splicing systems as H systems.
Example: Consider again the Pixton rule \((a, xa, xa)\) and the word \(cxae\ .\) It is not difficult to see that by iterating the application of the rule to (copies of) the generated word, the language \(cxx^*ae\) can be generated.
Example: Consider the regular language \(L=(aa)^*\ .\) It is known that L cannot be generated by finite splicing systems (both I,R finite sets) since any finite set of rules that allow every word in L to be generated will also generate words of odd length, as well as the words of even length. The language \(L'=caba^*b\) is generated by a finite splicing system, by taking advantage of the fact that the letter "c" occurs as the leftmost letter of every word in L' and occurs nowhere else in any word of L'. The letter c is a constant for L', a classical notion given by Schutzenberger. One can check that L'=L(S), for the splicing system \(S=(A,I,R)\ ,\) with \(A=\{a,b,c\}\ ,\) \(I=\{cabb,cabab,cabaab\}\ ,\) \(R=\{caba\#a \$cab \#a\}\ .\)
Notice that a standard extension is to consider a terminal alphabet, \(T\subseteq A\) (extended H system) and to accept in \(L(S)\) only the strings consisting of symbols from T (extended splicing).
Fundamental results
Finite splicing systems \(S =(A,I,R)\ ,\) i.e., with finite sets I and R (hence without a terminal alphabet), cannot generate all regular languages, e.g. \((aa)^*, a^*ca^*ca^*.\) Conversely, the language \(L(S)\) generated by any such system is regular. For Head rules this was proved by Culik et al. (1991), using domino languages, for Păun rules it was directly proved by Pixton (1995), who then also proved in 1996 the regularity for Pixton rules (and this implies once again the regularity for Păun's case). Very recently, a new and direct proof of the regularity of the languages generated by finite splicing systems has been provided by Manca (2004) and by Head et al. (2006). Actually, Pixton (2000) also proved the general result that by iterating Pixton splicing rules on a language from a given full AFL (abstract family of languages), one also gets a language from that AFL. We have mentioned that Pixton's definition is stronger than Păun's, which in turn is stronger than Head's definition. The separations are proper: there are regular languages which can be generated by Pixton splicing systems but not by Păun splicing systems, and there are regular languages which can be generated by Păun splicing systems but not by Head splicing systems (P. Bonizzoni, C. Ferretti, G. Mauri, R. Zizza, H.J. Hoogeboom, N. Van Vugt).
I\R | Fin | Reg | LIN | CF | CS | RE |
---|---|---|---|---|---|---|
Fin | Fin,Reg | Fin,RE | Fin,RE | Fin,RE | Fin,RE | Fin,RE |
Reg | Reg | Reg,RE | Reg,RE | Reg,RE | Reg,RE | Reg,RE |
LIN | LIN, CF | LIN,RE | LIN,RE | LIN,RE | LIN,RE | LIN,RE |
CF | CF | CF,RE | CF,RE | CF,RE | CF, RE | CF,RE |
CS | CS,RE | CS,RE | CS,RE | CS,RE | CS,RE | CS,RE |
RE | RE | RE | RE | RE | RE | RE |
Because in the DNA computing literature it is mainly the Păun version of splicing that is investigated, from now on we will refer almost exclusively to this variant. Precisely, let \(F_1, F_2 \in \{Fin,Reg,LIN, CF, CS, RE\}\) (i.e., classes of finite, regular, linear, context-free, context-sensitive, recursively enumerable languages, respectively), we denote \(H(F_1, F_2)= \{L(S) ~|~ S =(A,I,R)\) with \(I \in F_1, R\in F_2 \}\) the class of languages generated by Păun splicing systems, where the initial language I belongs to \(F_1\) and the set of rules R belongs to \(F_2\) (the rules are written as strings, hence it makes sense to speak about the type of their language). The following table (see Head et al. (1996)) collects the results we already know on the level of the Chomsky hierarchy \(H(F_1, F_2)\) belongs to. These results, already mentioned partially, were obtained in separate papers. Furthermore, if we look at the table, either \(H(F_1, F_2)\) is a specific class of languages in the above hierarchy (one entry in the table), or it is strictly intermediate between two of them (two entries in the table).
One of the more interesting still open problems is the characterization of the proper subclass of the class of regular languages which can be generated by finite splicing systems, and in turn, the design of a procedure allowing us to decide whether a regular language is generated by finite splicing systems. Indeed, Păun and Pixton splicing systems of the form \(S=(A,I,R)\ ,\) with finite I and R, can generate regular languages, hence they have at most the power of finite automata (see the examples reported at the beginning of this section).
In 2012 Kari and Kopecki (DNA18 conference) provided a procedure for deciding whether a regular language \(L\) is a splicing language (i.e., \(L\) is generated by a finite splicing system). This procedure is based on the size of the syntactic monoid of \(L\). Namely, if \(L\) is a splicing language, they proved that \(L\) is also generated by a "canonical" splicing system \(S\) whose size is a function of the size of the syntactic monoid of the input language, and which can be effectively constructed. By a Pixton's result (1996), we can construct the finite state automaton accepting \(L(S)\). Finally, we test whether \(L(S)=L\), a decidable property for regular languages, in order to decide whether \(L\) is a splicing language.
From a computational point of view, the competence of finite automata is too limited. A characterization of recursively enumerable languages (hence of the power of Turing machines) is obtained in Păun's case when using a set of splicing rules which is a regular language. This does not hold for Pixton splicing, a case where even with a regular set of rules one gets only regular languages. It is worth of note that \(H(Fin, Fin)\) is not comparable with LT (locally testable) languages.
When Păun's definition is considered, two types of splicing systems may be defined, according to the fact that one word or two words are produced as a result of the splicing operation, called 1-splicing and 2-splicing operation, respectively. Thus, 1-splicing and 2-splicing languages may be defined as languages generated by splicing systems when we use 1-splicing or 2-splicing operations. While the power of splicing languages was investigated in comparison with classes in the Chomsky hierarchy, it wasn't known if 1-splicing and 2-splicing languages represent the same class. Non trivial classes of regular languages have been found which may be generated by using the 1-splicing operation (over a finite splicing system) but which cannot be generated by using the 2-splicing operation (e.g., \(L'=L+ cL + Ld\ ,\) where \(L \subseteq A^*\ ,\) \(c,d \not \in A\)). Thus, the 1-splicing operation is more powerful than the 2-splicing operation (E. Laun-Goode, D. Pixton, S. Verlan, R. Zizza).
Restricted versions of splicing rules
In order to characterize regular languages generated by finite splicing systems, some partial results have been provided by considering (realistic) additional hypotheses or suitable restrictions on splicing rules. As an example, R is reflexive if for each \(u_1 \# u_2 \$ u_3 \# u_4 \in R\ ,\) we have \(u_1 \# u_2 \$ u_1 \# u_2, u_3\#u_4 \$ u_3 \#u_4 \in R\ ,\) whereas R is symmetric if for each \(u_1 \# u_2 \$ u_3 \# u_4 \in R\ ,\) we have \( u_3\#u_4 \$ u_1 \#u_2 \in R\ .\) Observe that 2-splicing is equivalent to 1-splicing plus the symmetric hypothesis on R.
The splicing operation was explicitly linked with the concept of constant already by Head (1987), in his seminal paper. Given a language \(L \subseteq A^*\ ,\) a word \(c \in A^*\) is a constant if, whenever \(ucv, pcq \in L\) then \(ucq, pcv \in L\) (Schutzenberger, 1975). It is evident the similarity between a constant and a crossing in Head's definition. Head proved that persistent splicing languages coincide with Strictly Locally Testable (SLT) languages and may be generated by systems such that \(B=C=A^k\ ,\) for \(k \geq 1\) (uniform splicing systems or Null Context H-systems, NCH systems). To do this, he used the result of De Luca and Restivo (1980) showing that a language L is SLT if and only if there is an integer k such that all strings in \(A^k\) are constants with respect to L. Kim (1992) constructed a finite state automaton accepting the language generated by a Head system with a regular initial language and a finite set of triples. Moreover, in the same paper a characterization of SLT languages is given, based on automata properties.
Head (1998b) has given different characterizations of the family of SLT languages, pointing out that the class of SLT languages itself is the union of the families of languages generated by a special hierarchy of splicing systems (simple H systems - SH, each crossing of a triple is a letter), introduced as a subclass of NCH systems by Mateescu et al. (1998). It is decidable whether a regular language is generated by SH systems and in the case of unary languages \(L \subseteq a^*\) they have a very simple regular expression (\(L=a^*\) or \(L=a^+\) ). SH systems are generalized by considering semi-simple (Păun) H systems where all rules have the form \(a \# 1 \$ b \# 1\ ,\) \(a,b \in A\ .\) These systems are a particular case of splicing systems with one-sided context.
Head (1998a) has considered the generating power of finite splicing systems with a one-sided context, i.e., each rule has the form \(u \# 1 \$ v \# 1\) or \(1 \# u \$ 1 \#v\ ,\) where 1 is the empty word. It is also supposed that for each rule having the above form, rules \(u \# 1 \$ u \# 1\ ,\) \(v \# 1 \$ v \# 1\) (resp. \(1 \# u \$ 1 \# u\ ,\) \(1 \# v \$ 1 \# v\)) are also in R (i.e., R is reflexive). Head stated a relationship between splicing and constants and proved that it is decidable whether a regular language is generated by one-sided context splicing systems, but only when the rules are either \(u \# 1 \$ v \# 1\) or \(1 \#u \$ 1 \# v\ .\)
By applying one of Head's results, semi-simple Păun splicing languages are SLT languages (Goode et al. (2001)). Both in the initial Mateescu et al. (1998) paper about simple H systems and later by Head (who gave the name of k-fat [semi-]simple H systems) splicing rules of the form \(x\#1\$x\#1\) were considered, with \(|x|\le k\ ,\) for a given constant k. The k-fat semi-simple H systems were investigated by Ceterchi, M. Vide and Subramanian (2004) both for linear (and circular) strings. They also consider types of rules different from (1,3)-type as above (the non-null strings can be placed in several combinations in the four positions of a general splicing rule \(u_1\#u_2\$u_3\#u_4\)). The results are as expected: characterizations, incomparability of different types, decidability for regular languages of being in a given class of splicing languages, etc.
Extending Head's results Head (1998a), the structure of the class of reflexive regular splicing languages (languages generated by finite splicing systems in which the set R satisfies the reflexive hypothesis) has been completely described by means of a finite set of constants for the language and of some factorizations of such constants into two words ( Bonizzoni et al. 2005a). In (Bonizzoni, 2010) the author provides a procedure to decide whether a regular language is a reflexive splicing language, based on the above-mentioned characterization that is given in terms of a finite set of constants for the language. A different approach for the same class of reflexive languages was also presented by Goode and Pixton (2007) together with a decision procedure.
It has been conjectured that a necessary condition for a regular language L to be a splicing language is that L must have a constant in the Schützenberger’s sense. Bonizzoni and Jonoska (2011), proved this longstanding conjecture to be true, by using properties of strongly connected components of the minimal deterministic finite state automaton for a regular splicing language.
Contributors: P. Bonizzoni, R. Ceterchi, C. De Felice,
P. Harsha, T. Head, N. Jonoska, K. Krithivasan, E. Laun-Goode,
A. Mateescu, G. Mauri, G. Păun, D. Pixton, G. Rozenberg, A. Salomaa,
K.G. Subramanian,
N. Van Vugt, S. Verlan, R. Zizza.
Universality by finite splicing systems
From a computational point of view, the above mentioned results are quite frustrating: finite H systems compute only at the level of finite automata, while computational universality is obtained by using an infinite set of splicing rules. Fortunately, a deeper analysis of the proofs of universality indicates a number of ways for overcoming this drawback, leading to extended H systems with permitting contexts, whose rules have associated finite sets of symbols such that a rule is applicable only to strings which contain the associated symbols. Actually, many other types of controlled splicing systems were considered. About a dozen such systems can be found in the literature, in general imitating the types of controls known from the "classic" regulated rewriting area in formal language theory. In particular, the following controls were investigated:
- forbidding contexts: symbols are associated with rules and a string cannot be spliced if it contains such a symbol
- target languages: the splicing of two strings is allowed only if the resulting strings belong to a given regular language which is associated with the rule or with the whole set of rules (in the former case we say that we have local targets, and in the latter case we have a global target)
- programmed control: a next mapping is given on the set of rules, which indicates the sequencing of rules
- evolving sets of rules: at each step, a different set of rules is produced, by point mutation rules which act on the splicing rules themselves
- double splicing: the strings resulting from the splicing of two strings are immediately spliced again by any available rule
- considering multisets of strings: the strings are counted, by splicing they are consumed, whereas the multiplicity of strings resulting from a splicing operation is increased by one.
In all these cases one gets characterizations of recursively enumerable languages. Moreover, because the proofs are constructive, starting from universal type-0 grammars, one can obtain universal H systems, programmable splicing systems capable of computing at the level of Turing machines. Explicit universal splicing systems of various types were exhibited, even if all of them are rather complex, hence unrealistic from a practical point of view.
Contributors: G. Alford, V.T. Chakaravarthy, E. Csuhaj-Varju, K.L. Denninghoff, C. Ferretti, F. Freund, R. Freund, P. Frisco, R.W. Gatterdam, L. Kari, S. Kobayashi, S. Kopecki, K. Krithivasan, V. Manca, G. Mauri, C. Martin-Vide, V. Mitrana, M. Oswald, A. Păun, G. Păun, G. Rozenberg, A. Salomaa, G. Vaszil, T. Yokomori.
Splicing circular strings
Circular splicing is a much investigated topic, both because circular DNA is important in biology and because splicing can be used to understand properties of circular words, even if many classical problems are still open on them. Circular splicing systems were introduced in Head (1992) along with various open problems related to their computational power. In the circular context, the splicing operation acts on two circular DNA molecules by means of a pair of restriction enzymes as follows. Each of these two enzymes is able to recognize a pattern inside one of the given circular DNA molecules and to cut the molecule in the middle of such a pattern. Two linear molecules are produced and then they are pasted together by the action of ligase enzymes. Thus a new circular DNA sequence is generated. For instance, circular splicing models the integration of a plasmid into the DNA of a host bacterium. Depending on whether or not these ligase enzymes substitute the recognized pattern (in nature, both situations can happen), we have the Pixton definition or the Head and Păun definition.
Obviously a molecule of circular DNA will be represented as a circular word, i.e., as an equivalence class with respect to the conjugacy relation ~, defined by xy ~ yx, for x,y in A*. A circular language \(C=^\sim L\) is a set of circular words and the full linearization of C, denoted \(Lin(C)\ ,\) is the set of all the words in A* corresponding to the elements of C, i.e., \(Lin(C)=\{w \in A^* | ^\sim w \in C\}\ .\) There is also a notion of regularity of circular languages, where \(Reg^\sim = \{ C \subseteq ^\sim A^* ~|~ \exists L \in Reg : ~ ^\sim L=C \}.\) Indeed, looking at regular circular languages is the same as looking at regular languages closed under the conjugacy relation and so given a circular language C, we have \(C \in Reg^\sim\) if and only if its full linearization \(Lin(C)\) is regular. The structure of \(Reg^\sim\) is still unknown (except for regular languages generated by some regular group codes).
Definitions
As already said, there are three definitions of circular splicing given by Head, Păun and Pixton. In Head's definition (1992), given two circular words \(^\sim hpxq, ^\sim kuxv\) on A and two triples (p,x,q),(u,x,v), the splicing operation produces \(^\sim hpxvkuxq\ ,\) if \((p,x,q)P(u,x,v)\ ,\) where P is a binary relation.
Example: Suppose that (a,1,b)P(b,1,a). Starting with (two copies of) \(^\sim ab\ ,\) the word \(^\sim a^2b^2\) is generated, whereas \(^\sim a^3b^3\) is generated starting with \(^\sim a^2b^2\) and \(^\sim ab\ .\)
Another definition is presented in Head et al. (1996), named here Păun's definition since it is the counterpart of Păun linear splicing in the circular context. Given a rule \(r=u_1 \#u_2 \$ u_3 \# u_4\) and two circular words \(^\sim hu_1u_2, ^\sim ku_3u_4\ ,\) the rule cuts and linearizes the two strings obtaining \(u_2hu_1\) and \(u_4ku_3\ ,\) and pastes them and circularizes obtaining \(^\sim u_2hu_1u_4ku_3\ .\)
Example: The rule \(a^2 \# 1 \$ 1 \# a^2\) applied to (two copies of) \(^\sim a^2\) generates \(^\sim a^4\ ,\) and applied to \(^\sim a^2\) and \(^\sim a^4\) generates \(^\sim a^6\ .\) The rule \( a \ \# ~ b ~\$~ b ~\#~ a\) applied to (two copies of) \(^\sim ab\) generates \(^\sim a^2b^2\ ,\) and applied to \(^\sim a^2b^2\) and \(^\sim ab\) generates \(^\sim a^3b^3\ .\)
Observe that in both the Head and Păun definition, circular splicing is length increasing, i.e., starting form two circular words, the spliced word has length greater than or equal to the sum of the lengths of the two words. The same observation cannot be made for Pixton's definition (1996): given two circular words \(^\sim \alpha \epsilon, ^\sim \alpha' \epsilon'\ ,\) the two rules \(r=(\alpha, \alpha' ; \beta), \overline {r}=(\alpha', \alpha;\beta')\) cut and linearize the two strings, obtaining \(\epsilon \alpha, \epsilon' \alpha'\) and then paste, substitute and circularize them, producing \(^\sim \epsilon \beta \epsilon' \beta'\ .\)
Example: The rules \((a^2,a^2b;a^2),(a^2b, a^2;1)\) applied to (two copies of) \(^\sim a^4b\) generate \(^\sim a^6b\ .\) Indeed, in order to generate \(^\sim a^6b\ ,\) for example, the words \(^\sim a^4b, ^\sim a^4b\) can be spliced and choose \(\epsilon=a^2b, \epsilon'=a^2, \beta=a^2, \beta'=1\) and so \(^\sim a^2ba^2a^2\) is generated: the letter "b" of the second word is erased by setting \(\beta'=1\ .\)
A (finite) circular Head splicing system is a quadruple (A, I, T, P), where I is the (finite) initial circular language, T is a finite set of triples of words in A* and P is a binary relation on T such that, for each (p, x, q), (u, y, v) in T, if (p, x, q)P(u, y, v) then x = y. A (finite) circular Păun or Pixton) splicing system is once again a triple (A,I,R) where A is a finite alphabet, I is the (finite) initial circular language and R is the (finite) set of rules (or triples). The circular splicing language generated by a circular splicing system is defined as in the linear case. Of course, the computational power of these models has been investigated, but few results are known.
Let me play around again with splicing... see Splicing_systems/Păun splicing in cartoons
Computational power of the finite models
Even in the finite case (I,R finite sets), without additional hypotheses, the computational power of circular splicing systems is unknown. We do not know which class of languages they generate, we do not even know which regular languages belong to this class and both regular and non regular languages may be generated, in contrast with the linear case. Indeed the regular language \((aa)^*\) cannot be generated by finite linear splicing systems, whereas the (regular) circular language \(^\sim (aa)^*\) is a circular Păun splicing language, e.g., by choosing \(A = \{a\}, I = {^\sim aa} \cup 1, R = \{aa\#1\$1\#aa\}.\) However, consider the context-free circular language \(^\sim a^nb^n\ ,\) which is not a regular circular language (since its full linearization is not regular). It is quite easy to check that \(^\sim a^nb^n\) is generated by Head (or Păun) circular splicing systems (see the second example in the previous subsection). Moreover, one can see that \(C =^\sim(aa)^*b\) cannot be generated by a finite Head or Păun system, since the definitions force words with more than one occurrence of "b" to be generated. The same language can be generated by a finite Pixton system, by choosing \(A = \{a, b\}, I = ^\sim\{b,a^2b,a^4b\}\ ,\) \(R = \{(a^2, a^2b; a^2), (a^2b,a^2; 1)\}\) (see the last example in the previous subsection, P. Bonizzoni, C. De Felice, G. Mauri, R. Zizza, 2000). It has been proved that the Head and Păun definitions are equivalent, whereas it has been shown that the Pixton definition is more powerful than the other two (Bonizzoni et al. 2010a).
Another natural research direction is to investigate whether the property of being a regular circular splicing language can be translated into the existence of a finite automaton recognizing its full linearization. Loosely speaking, the crux of the matter in stating whether \(^\sim L\) is Păun generated is to exhibit a finite set of rules which are able to produce words having as factors a non-bounded number of occurrences of labels c of closed paths in the finite state automaton recognizing L. In Bonizzoni et al. (2004) the authors overcome this difficulty for a special class of regular languages, generated by finite circular splicing systems, the so-called star languages \(^\sim X^*\ ,\) which include free monoids X* generated by regular group codes X. Languages over a one-letter alphabet generated by finite (Păun) circular splicing systems have also been characterized and it is decidable whether a regular language over a one-letter alphabet can be generated by finite (Păun) circular splicing systems ( Bonizzoni et al. (2005b)).
One of the first results regarding the computational power of Head circular splicing systems states that a finite Head circular splicing system with rules having the form (1,x,1), where |x|=1, always generates regular circular languages (Siromoney et al. (1992)). On the contrary, consider the Head system S' defined by \(A = \{a,b, c\}, I= ^\sim \{baca\},T = \{ (1,a,1)\}, (1,a,1)P(1,a,1).\) In Bonizzoni et al. (2010b) it is shown that S' generates a non-regular language. Various classes of circular Păun splicing systems such that for each \( u_1 \# u_2 \$ u_3 \# u_4 \in R\ ,\) we have \(u_1u_2, u_3u_4 \in A\ ,\) where A is the alphabet of I, have been defined and studied (e.g., Ceterchi et al. (2003), a survey is presented in Bonizzoni et al. (2010b), finding deep connections with properties of graphs and other classes of (context-free) languages (pure unitary languages, Bonizzoni et al. (2010a)).
In Berstel et al. (2012)) the authors proved that it is decidable, given a circular splicing language and a regular language, whether they are equal. In addition, in order to define the computational power of circular splicing systems, alphabetic splicing system have been introduced. These are a generalization of simple and semi-simple splicing systems. Precisely, we have that each hand (i.e., \(u_i\)) of a site is either a letter or the empty word. It has been proved that the language generated by an alphabetic splicing system, and consequently also by the simple and semi-simple splcing systems, is context-free. This result has been obtained by introducing a new kind of splicing operation, flat splicing, in which one word is inserted in another word, by using contexts specified by a rule.
As done for the linear case, it is natural to assume that R is reflexive or R is symmetric. In addition, the self-splicing operation has also been considered. Self-splicing introduces different semantics of how rules are used. While in the case of the circular splicing operation, two words are pasted together to form a new circular word, in the case of the self-splicing operation, a single circular word gives rise to two circular words. More precisely, let \(S = (A,I,R)\) be a splicing system. Then, given a rule \(u_1 \# u_2 \$ u_3 \#u_4\) and a circular word \(^\sim w\ ,\) we set \(^\sim w ~ {\vdash}_r ~ (^\sim w', ^\sim w'')\) if there are linearizations \(w\) of \(^\sim w\ ,\) \(w'\) of \(^\sim w', w''\) of \(^\sim w''\) such that \(w = xu_1u_2yu_3u_4, w' = u_4xu_1\) and \(w'' = u_2yu_3\ .\) Thus, \((^\sim w', ^\sim w'')\) is generated starting with \(^\sim w\) and by using self-splicing with a rule r. Obviously, these additional hypotheses can also be defined with respect to Head's or Pixton's formalism.
Pixton (1996) proved that a circular Pixton splicing system with I being a regular circular language and R being reflexive and symmetric, always generates a regular circular language, but when self-splicing is allowed. A similar closure property of splicing has been demonstrated for Păun circular splicing systems with I in a full AFL, the counterpart for circular splicing of an analogous result for linear splicing. A link between regular circular languages and automata is pointed out, along with examples of circular splicing languages which are not regular and which are generated by finite circular splicing systems (Pixton (2000)).
Further models
Other variants of these models have been described, by considering, for example, so-called mixed splicing (Head (1992)) in which both linear and circular words are allowed. Several splicing operations, either acting on circular strings only (for instance self-splicing) or dealing with mixed splicing have been proposed (Siromoney et al. (1992)). A new type of circular extended H system (CH systems) is put forward and the universality of this model is demonstrated. In particular, linear and circular strings are involved, without multiplicity, and with I,R both finite sets (Yokomori et al. (1999)). Starting from this work, CH systems have been considered as components of a communicated distributed circular H system, i.e., the circular splicing counterpart of the parallel communicating grammar systems (Siromoney, 2000).
Contributors
J. Berstel, L. Boasson, P. Bonizzoni, R. Ceterchi, V.R. Dare, C. De Felice, I. Fagnot, C. Ferretti, R. Freund, T. Head, S. Kobayashi, G. Mauri, V. Mihalache, G. Păun, D. Pixton, G. Rozenberg, A. Salomaa, R. Siromoney, K.G. Subramanian, C. M. Vide, T. Yokomori, R. Zizza.
Wet splicing
Soon after Adleman's experiment (1994) several papers were circulated/published stating how a Turing machine can be simulated using (a finite number of) restriction enzymes and ligases, working on a finite number of DNA molecules. Restriction enzymes and ligases mean splicing, and according to the regularity results mentioned above, such a process cannot surpass the level of finite automata. The long way to Turing machines is covered by the lab worker using some bio-stuff, not by the DNA. However, it is worthy of note that, even if a large amount of work remains to be done for its feasible realization, experimental work on splicing systems has also been carried out successfully by the Biomolecular Computing Group at Binghamton. This group proved the viability of Head's finite model using a single step procedure in the lab, making the predictive capability of splicing clear.
Contributors: T. Head, N. Jonoska, S.A. Karl, E. Laun-Goode, K.J. Reddy.
Other topics
For the reader's convenience, there is a list below of a number of topics which were investigated with respect to the splicing operation and splicing systems, mentioning only a few names among many contributors.
- Complexity theory for Splicing systems (the time complexity function of a splicing system is defined in terms of the number of rounds needed to generate a word of length n): R. Loos, M. Ogihara, A. Malcher, D. Wotschke.
- Splicing on arrays (extensions of Head's and Păun's splicing to two-dimensional arrays): V.T. Chakaravarthy, R. Freund, S.R. Kaushnik, S.N. Krishna, K. Krithivasan, U. Raghavan, R. Rama.
- Splicing on graphs (especially on trees): S. Bozapalidis, C. Ferretti, R. Freund, K. Krithivasan, G. Rahonis, Y. Sakakibara, R. Santhanan.
- Splicing in Membrane Computing : J. Castellanos, F. Freund, R. Freund, P. Frisco, S.N. Krishna, M. Margenstern, M. Oswald, A. Păun, Gh. Păun, R. Rama, A. Rodriguez-Paton, Y. Rogozhin, Y. Sakakibara, S. Verlan, T. Yokomori.
- Crossover operations (related to splicing: jump from one string to the other one several times, recombining the fragments): A. Atanasiu, J. Dassow, L. Ilie, K. Krithivasan, V. Mitrana, Y. Sakakibara.
- Other restricted versions of splicing rules: P. Harsha, T. Head, K. Krithivasan, E. Laun-Goode, A. Mateescu, Gh. Păun, D. Pixton, G. Rozenberg, A. Salomaa, N. Van Vugt.
- (Other) Controls on the use of splicing rules (to get universality): M.H. Begum, P.H. Chandra, J. Dassow, H.J. Hoogeboom, L. Kari, V. Mitrana, Gh. Păun, G. Rozenberg, A. Salomaa, K.G. Subramanian, D.G. Thomas, N. Van Vugt.
- Cut-and-paste operations (systematically investigated): P. Bonizzoni, R. Ceterchi, C. Ferretti, F. Freund, R. Freund, G. Mauri, F. Wachtler, R. Zizza.
- Arithmetics with splicing systems (explicit computations performed by using the splicing operation): P. Frisco.
- Relevance of splicing for natural language : G. Bel Enguix, J. Kelemen, V. Manca, S. Marcus, C. Martin-Vide, V. Mitrana, Gh. Păun.
- Learning (finding a generative device for a given language, starting from positive/negative sample strings): S.M. Kim, R. Siromoney, Y. Takada, T. Yokomori.
- Splicing on routes (as well as crossovering on routes, controlled by strings which specify the cut points): A. Mateescu.
- Mathematical properties of splicing (algebraic approaches, representations of languages by means of particular splicing operations): P. Bonizzoni, R. Dassen, C. De Felice, T. Head, H.J. Hoogeboom, N. Ishii, E. Laun-Goode, G. Mauri, D. Pixton, P. Raj, N. Van Vugt, R. Zizza.
- Distributed splicing systems (Test tube systems are distributed symbol processing mechanisms combining the above mentioned ideas with grammar systems theory, a recent field of research studying distributed symbol processing mechanisms that are based on formal language theoretic operations. The components of the system are test tubes processing strings describing DNA molecules which can communicate by merging their contents with that of the other tubes after separating and amplifying them according to certain predefined criteria. For the modelling of the chemical reactions in the tubes, the splicing operation is used.) A. Atanasiu, E. Csuhaj-Varju, J. Dassow, C. Ferretti, F. Freund, R. Freund, P. Frisco, G. Georgescu, T. Hinze, L. Kari, M. Margenstern, C. Martin-Vide, G. Mauri, Gh. Păun, L. Priese, Y. Rogozhin, G. Rozenberg, A. Salomaa, M. Sturm, G. Vaszil, S. Verlan, F. Wachtler, C. Zandron.
References
Following Scholarpedia instructions, the list below collects only major contributions to the area since the literature on splicing is very extensive. Other papers, especially those published before 2005, are enumerated in the Hyperbibliography (Section "See also"). Other resources can be retrieved by looking through the bibliography of the papers below.
- J. Berstel, L. Boasson, I. Fagnot (2012), Splicing systems and Chomsky hierarchy, Theor. Comp. Sci. 436, pp.2-22 .
- P. Bonizzoni, C. De Felice, R. Zizza (2005), The structure of reflexive regular splicing languages via Schutzenberger constants, Theor. Comp. Sci. 334, pp.71-98.
- P. Bonizzoni, C. De Felice, R. Zizza (2010), A characterization of (regular) circular languages generated by monotone complete splicing systems, Theor. Comp. Sci., DOI information: 10.1016/j.tcs.2010.05.013 .
- P. Bonizzoni, C. De Felice, G. Fici, R. Zizza (2010), On the regularity of circular splicing languages: a survey and new developments, Natural Computing 9:2, pp. 397-420.
- P. Bonizzoni, C. De Felice, G. Mauri, R. Zizza (2004), Circular splicing and regularity, Theor. Inform. Appl. 38, pp. 189–228.
- P. Bonizzoni, C. De Felice, G. Mauri, R. Zizza (2005), On the power of circular splicing, Discr. Appl. Math. 150, pp. 51–66.
- P. Bonizzoni, N. Jonoska: Regular Splicing Languages Must Have a Constant (2011), Developments in Language Theory 2011, Lecture Notes in Computer Science 6795, pp. 82-92.
- R. Ceterchi, KG Subramanian (2003), Simple circular splicing systems, Romanian J. Inf. Sci. Technol. 6, pp. 121–134.
- K. Culik, T. Harju (1991), Splicing semigroups of dominoes and DNA, Discr. Appl. Math. 31, pp. 261-277.
- K.L. Denninghoff, R.W. Gatterdam (1989), On the undecidability of splicing systems, Intern. J. Computer Math. 27, pp. 133-145.
- R. Freund, L. Kari, G. Păun (1999), DNA Computing based on splicing: the existence of universal computers, Theory Comput. Syst. 32(1), pp. 69-112.
- R. W. Gatterdam (1992), Algorithms for splicing systems, SIAM J. of Comp. 21:3, pp. 507-520.
- E. Goode, D. Pixton (2001), Semi-simple splicing systems, in: Where Mathematics, Computer Science, Linguistics and Biology meet (C. Martin-Vide and V. Mitrana, Eds.), pp. 343-357, Kluwer Academic Publ., Dordrecht.
- T.Head (1987), Formal language theory and DNA: an analysis of the generative capacity of specific recombinant behaviours, Bull. of Mathematical Biology 49, pp. 737-759.
- T. Head (1992), Splicing schemes and DNA, in Lindenmayer Systems: Impacts on Theoretical Computer Science and Developmental Biology, Springer-Verlag, Berlin, pp. 371-383.
- T. Head (1998), Splicing languages generated with one sided context, in Computing with Bio-molecules: Theory and Experiments (Gh. Păun, Ed.), Springer-Verlag, Singapore.
- T. Head (1998), Splicing representations of strictly locally testable languages, Discr. Appl. Math. 87, pp. 139-147.
- T. Head, D. Pixton (2006), Splicing and regularity, in Recent advances in Formal Languages and Applications by Z. Esik, C. M. Vide, V. Mitrana Eds., pp. 119-147, Springer.
- S.M. Kim (1997), Computational modeling for genetic splicing systems, SIAM J. of Comp. 26, pp. 1284-1309.
- A. Mateescu, G. Păun, G. Rozenberg, A. Salomaa (1998), Simple splicing systems, Discr. Appl. Math. 84, pp. 145-163.
- G. Păun (1996), On the splicing operation, Discr. Appl. Math. 70, pp. 57-79.
- G. Păun (1996), Regular extended H systems are computationally universal, Journal of Aut., Lang., Combinatorics. 1:1, pp. 27-36.
- G. Păun, G. Rozenberg, A. Salomaa (1996), Computing by splicing, Theor. Comp. Sci. 168:2, pp. 321-336.
- D. Pixton (1996), Regularity of splicing languages, Discr. Appl. Math. 69, pp. 101-124.
- D. Pixton (2000), Splicing in abstract families of languages, Theor. Comp. Sci. 234, pp. 135-166.
- R. Siromoney, K.G. Subramanian, A. Dare (1992), Circular DNA and Splicing Systems, in Proc. of ICPIA, Lecture Notes in Computer Science 654, pp. 260-273, Springer-Verlag.
- T. Yokomori, S. Kobayaski, C. Ferretti (1999), On the power of circular splicing systems and DNA computability, Proc. of IEEE Int. Conf. Evol. Comp. ICEC99.
- C. Zandron, C. Ferretti, G. Mauri (1997): A Reduced Distributed Splicing System for RE Languages, Lecture Notes in Computer Science 1218, pp. 319-329.
Internal references
- Olaf Sporns (2007) Complexity. Scholarpedia, 2(10):1623.
- Gheorghe Păun (2010) Membrane Computing. Scholarpedia, 5(1):9259.
- Paul M.B. Vitanyi (2009) Turing machine. Scholarpedia, 4(3):6240.
Recommended reading
- N. Jonoska, G. Păun, and G. Rozenberg, editors (2004), Aspects of Molecular Computing: Essays Dedicated to Tom Head, on the Occasion of His 70th Birthday, Lecture Notes in Computer Science 2950, Springer Verlag, Berlin, Heidelberg, New York.
- T. Head (2006), From biopolymers to formal language theory, in Automata Theory with Modern Applications by James A. Anderson, Ch. 7, Cambridge U.Press, 231-234.
- T. Head (2011), How the structure of DNA molecules provides tools for computation, in Biology, Computation and Linguistics - New Interdisciplinary Paradigms, Ed. by G. Bel-Enguix, V. Dahl & M.D. Jimenez-Lopez, IOS Press, Amsterdam.
- T. Head (2012), Restriction enzymes in language generation and plasmid computing, in Biomolecular Computing: From Logic Systems to Smart Sensors and Actuators, Ed. by Evgeny Katz, Wiley-VCH Verlag GmbH & Co., 245-263.
- T. Head, G. Păun, D. Pixton (1996), Language theory and molecular genetics: generative mechanisms suggested by DNA recombination, in: (G. Rozenberg, A. Salomaa, Eds.) Handbook of Formal Languages, Vol. 2, 295-360, Springer Verlag.
- G. Păun, G. Rozenberg, A. Salomaa (1998), DNA computing, New Computing Paradigms, Springer-Verlag, Berlin.
- M. Simon (2005), Emergent Computation: emphasizing bioinformatics, Springer (Biological and medical physics, biomedical engineering).
See also
- A Bibliography of Molecular Computation and Splicing Systems, created and maintained until 2005/03/22 by Pierluigi Frisco, is http://www.macs.hw.ac.uk/~pier/dna.html