6.3. Syntax Analysis.   

The syntax analyser has to determine if a given "sentence" in the source language

if B then S;
is legitimate in a given language.

Language is described by a finite number of
rules = SYNTAX of language
but can generate an infinite number of sentences.
The syntax analyser tests the sentences against the rules.

In English, sentences can be ambiguous:

The hairy-nosed wombat eats roots and leaves.

A computer program must be unambiguous if it is to produce correct, reproducible results. Therefore our syntax rules must be strict and, like our MU puzzle, we must not try to break away from these rules.

Note that in the above ambiguous sentence the use of

punctuation can remove the ambiguity. In the same way
punctuation (, ; : BEGIN END) can be used to remove ambiguities
from computer programs.
Many compilers are smart enough to be
able to supply missing semicolons, just as we are able to supply
meaning to slightly ambiguous sentences.
We can do that by noting the context of the ambiguity.

The Backus-Naur Form is a useful notation for expressing these rules.

<sentence>::=<subject><predicate>
< > Non terminal symbols
::= is defined by
> < followed by
| choice of symbols
<subject>::= <article><adjective><noun>
<predicate>::= <verb><object>
<object>::= <article><adjective><noun>
<article>::= "the"|"a" )
<adjective>::= "blue"|"hairy-nosed"|"frightened" } terminal
<noun>::= "wombat"|"fish"|"house" } symbols
<verb>::= "flew"|"hunted" )

A syntactically correct sentence is:

The hairy-nosed wombat hunted a frightened fish

A syntactically correct but semantically dubious sentence is:

A hairy-nosed fish hunted the frightened wombat

An alternative notation used by Wirth is

(1) Non terminal symbols i.e. <_> are written as
identifiers without angle brackets but in CAPITALS.
(2) Terminal symbols are in lower case letters and
their values are in quotation marks if necessary.
(3) Braces { } are used to denote zero or more occurrences.
(4) Square brackets [] are used to denote 0 or 1 occurrence.

In Java:

<statement>::=[<label>":"]<unlabelled statement>
or
<statement>::=<unlabelled statement>|<label>":"<unlabelled statement>

A language L has the following parameters

(1) A vocabulary T of terminal symbols


e.g. IF, THEN, ELSE, BEGIN, END, REPEAT

(2) A set N of non-terminal symbols ( grammar )

e.g. <expression> <factor> <term> <block>

(3) A set P of productions (syntax)

e.g. <expression>::=
<simple expr><relational operator><simple expr>
| <simple expr>

(4) A symbol S, from set of non-terminals, ( START symbol )

e.g. <program>::=

We can generate the language L (= set of sequences of symbols) by applying the syntax rules to the starting symbol S.


Starting symbol in MU puzzle was MI. We can generate one

of set of possible states by adding U, MI -> MIU,
then by doubling MIU -> MIUIU

We cannot directly go from our start symbol S to some sequence Sn unless we go through the preceding S1 ,S2 ...Sn sequences.

(1) (3)
MI --> MIU --> MIUIU etc.