Dealing with ambiguities due to indentation

Our syntax does not have ends (like the keyword ‘end’ in Pascal); and we find ourselves struggling trying to parse things like:

data Tree a = Leaf a | Branch (Tree a) (Tree a)

data List a = Nil
            | Cons a (List a)

At the end of the first line, the parser is actually trying to look for another data constructor (possibly in a new line, like in the List). It sees the PADDING and (being look-ahead by 1 token) does not notices the KEYWORD_DATA that comes next; so it shifts the PADDING and enters a state where the KEYWORD_DATA is no longer valid.

This state of affairs makes our parser work for a single definition. But also allows visually unappealing constructions like:

data List a = Nil
| Cons a (List a)

Trying to parse indentation

The solution seems to be to issue a PADDING at the same level of indentation of the line before, an INDENT if the level of indentation increases, a DEDENT if the indentation level decreases, and a NEWLINE if more than one ‘n’ are found.

Look at this program:

fn = let a1 = aa1
         a2 = aa2
         a3 = expr where b1 = bb1
                         b2 = expr2 where c1 = cc1
                                          c2 = cc2
         a4 = aa4
     in a4

fn2 = fnn2

The current state of the parser (2018-09-10) breaks our intuitions and “captures” the a4 = ... as part of the where expression in b2 = expr2 where ....

We expect the equation a4 = aa4 to be part of the outer-most let expression.

In the program:

fn = let a1 = aa1
         a2 = aa2
         a3 = expr where b1 = bb1
                         b2 = expr2 where c1 = cc1
                                          c2 = cc2
                         a4 = aa4
     in a4

fn2 = fnn2

we expect, however, the a4 = aa4 to be part of the outer-most where expression.

These two examples demonstrate that the rule of issuing a DEDENT when the indentation level decreases is not sufficient to disambiguate both programs (they will have the same run of tokens).

If we remove where expressions from our syntax and rewrite the programs with let alone, the in keyword is enough to disambiguate.

Decision

Before using Lark

Note

This subsection is out-dated since version 0.3.0, where we introduced a parser based on Lark.

  1. Keep the where as it is. To disambiguate you’ll need parenthesis:

    fn = let a1 = aa1
             a2 = aa2
             a3 = (expr where b1 = bb1
                              b2 = expr2 where c1 = cc1
                                               c2 = cc2)
             a4 = aa4
         in a4
    
    fn2 = fnn2
    

    Indentation alone won’t do it.

    The alternative is to remove where from the syntax.

  2. Introduce ‘NEWLINE’ to divide definitions:

    1. `any chunk of more than one '\n'`:deleted:.
    2. any chunk with single ‘n’ but ending with a (possibly empty) run of spaces that set the indentation level back to the minimal indentation level (set by the first line(s) of the programs).

    A ‘NEWLINE’ is now emitted in the case explained in b). Doing otherwise would make the following program invalid because it uses new-lines to separate local definitions:

    class Eq a => Ord a where
         (<) :: a -> a -> Bool
         (<) a b = not (a >= b)
    
         (>) :: a -> a -> Bool
         (>) a b = not (a <= b)
    
         (<=) :: a -> a -> Bool
         (<=) a b = a < b `or` a == b
    
         (>=) :: a -> a -> Bool
         (>=) a b = a > b `or` a == b
    

    This change does not introduce any (new) ambiguity.

  3. A single definition (expressions, type expressions, and data types) cannot contain NEWLINE tokens. The places where a line break is allowed must be indented (but the amount of indentation is not meaningful).

  4. Definitions must be separated by NEWLINE tokens.