# User:Eugene M. Izhikevich/Proposed/Wirth syntax notation

## Introduction

The use of a formal syntax was first introduced by Peter Naur for the definition of the language Algol in 1960. It was the first time that the syntax of a language was defined precisely. A rigorously defined syntax allows the design of compilers that rely on parsing programs, that is, on decomposing programs into their parts according to the formally defined structure of the language. The idea of a formal definition of syntax was due to N. Chomsky, a mathematical linguist. It was not successful for spoken languages, but became a cornerstone for programming languages.

## BNF and Algol 60

The basic idea is that a language consists of an infinite set of well-formed sentences, well-formed according to the rules of the syntax, and consisting of words from a finite vocabulary. Hence, a syntax consists of the rules, how to construct correct sentences, and, vice-versa, how to parse them, and thereby, if their semantics (meaning) is defined, to understand them. Indeed, the goal of parsing is not merely to decide whether or not a sequence of words is correctly constructed, but more so to grasp its meaning. This holds for humans as well as computers.

We may explain this on hand of two constructs of Algol, the identifier and the integer. The former is any sequence of letters and digits, where the first character is a letter. The latter is a non-empty sequence of digits. These rules are described in the definition of Algol as follows:

<integer> ::= <unsigned integer> | + <unsigned integer> | - <unsigned integer>
<unsigned integer> ::= <digit> | <unsigned integer> <digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<identifier> ::= <letter> | <identifier> <letter> | <identifier> <digit>


The entities in angular brackets denote syntactic classes, the symbol ::= marks a definition of the preceding class, and the bar means “or”. This notation was introduced by Naur, and it is called BNF for Backus-Naur-Form. (John Backus was the author of Fortran and had introduced the ideas of N. Chomsky to the committee designing Algol.) The somewhat clumsy symbol for "equals" was chosen with due respect for his typewriter’s capabilities, as Naur later admitted.

As an aside, we note that thanks to the formal notation potential ambiguities in a language can be detected. This property is obviously of vital importance for programming languages. Indeed, Algol did contain a number of ambiguities. They were removed in a revision issued in 1962.

## The spread of notations, and EBNF

In particular the publication of Algol with its precise underpinning spawned a flurry of activities in language design and syntax analysis. Unfortunately, every author used a slightly different notation for his syntax. This called for an effort to propose a standard notation. In 1977 this author published such a suggestion. It also included a small number of notational extensions and improvements, and was called Extended BNF, or simply EBNF. Its merits are:

1. The new notation distinguishes clearly between symbols belonging to the describing notation (EBNF) and symbols belonging to the described language.
2. It does not exclude symbols used in EBNF from their use in the described language.
3. It contains an explicit construct denoting iteration, and it thereby avoids the heavy use of recursion for expressing simple repetition.
4. It avoids the need for an explicit symbol denoting the empty string.
5. It is based on the standard ASCII character set, and is therefore machine-readable.

Evidently, EBNF is itself a language, and therefore must have a syntax. If it is of any use, it should thus be capable of describing itself. Indeed, its description takes only 5 equations. For the sake of brevity, the classes identifier and string will not be defined here in specific detail. Terminal symbols are enclosed in quotes. The syntax of EBNF is:

syntax      = {production}.
definition  = identifier "=" expression "." .
expression  = term {"|" term}.
term        = factor {factor}.
factor      = identifier | string |
"(" expression ")" | "[" expression "]"| "{" expression "}".


Repetition is denoted by curly braces. For example, {a} stands for $$\varepsilon$$|a|aa|aaa|... Optionality is expressed by brackets. For instance [a] denotes $$\varepsilon$$|a. Parentheses merely serve for grouping. Thus (a|b)c stands for ac|bc.

The formulation of EBNF in terms of ASCII characters allows for simple programs for checking the formal correctness of a syntax definition, including the detection of ambiguities and other properties of a language.

EBNF has been used for defining the syntax of many languages, including Modula-2 and Oberon. It has also proved to be useful in the definition of simple communication protocols in an intuitively constructive way. A sequence of messages between partners is defined as a syntax, where the parts emitted by the various parties are written in different colors.

## References

• P. Naur. Revised Report on the Algorithmic Language ALGOL 60. Comm. ACM, 6 (Jan. 1963), pp. 1 – 17.
• N. Wirth. What can we do about the unnecessary diversity of notation for syntactic definitions? Comm. ACM 20, (Nov. 1977), 11.