Why are parsers important in Haskell

Parsing in the functional programming language Haskell

Transcript

1 Wedel University of Applied Sciences - University of Applied Sciences Wedel Feldstrasse D Wedel Parsing in the functional programming language Haskell Author: Jens Kulenkamp December 10, 2003 This documentation was created as part of the computer science seminar. Student, matriculation: Jens Kulenkamp, ​​[] Department: Computer Engineering Supervisor: Prof. Dr. Schmidt submission date: December 2003

2 Table of contents Table of contents 2 1. A brief overview 4 2. Basics Parser grammar Ambiguous grammar Unambiguous grammar Left recursion Functional parser in Haskell Motivation The data type Parse Elementary parser symbol spot dig succeed parser combiners The alternative The sequencing The converter many option A parser for arithmetic expressions Die Idea The grammar The parser

3 4. Parsec What is Parsec? Examples with Parsec The sequence and selection Parser with semantics The selection A pocket calculator with Parsec Download 22 Bibliography 23 3

4 1. A brief overview This elaboration was created during the IT seminar in the 2003/2004 winter semester. In the following chapter, a few basics are first dealt with. It explains what a parser is and goes into the definition of grammars. To make the grammar definition a little more understandable, two examples follow, which are then discussed and evaluated. These examples are followed by the complete development of a non-deterministic parser in the functional programming language Haskell. First, a general and reusable data type is developed for this purpose. On the basis of this type, elementary parsers and later parser combinators are developed. Following this development phase, a pocket calculator for arithmetic expressions will be developed on the basis of the parser. 4th

5 2. Basics 2.1. Parser A parser is a program that receives a text or token as input. This input is parsed and converted into a logical structure. The text can be, for example, a programming language, an XML file, but also an arithmetic expression or database command. The output of a parser is usually a syntax tree or a program tree. This syntax tree is created on the basis of the given grammar, whereby this must include the language scope or the derivation rules of the language to be analyzed. If the input does not correspond to the given syntax, then corresponding error messages must be generated. Grammar To develop a parser, a grammar must first be defined so that the parser can recognize and structure the text to be read. A context-free grammar is a quadruple G = (T, N, P, S) where: T: is an alphabet of terminal symbols. N: is an alphabet of nonterminal symbols and describes structured units. P: is a lot of production rules. A production rule consists of a pair of non-terminal and a sequence of non-terminal and / or terminal symbols. Production rules are also the replacement rules. S: is the start symbol. The set of all terminal symbols T which can be derived from the start symbol S is the language G generated. Ambiguous grammar A grammar G1 for arithmetic expressions is given below. G1 = (T, N, P, S) T = {num, + ,,, /} N = {E} P = {E} E = {E> num E + EEEEEE / E} S = E Here is E for expression. 5

6 The following expression is considered as an example. The parser will read this expression from left to right, where, based on the start symbol, E can be derived as follows: E> E + E> num + E> num + num. The associated syntax tree can be created on the basis of this derivation sequence and represented as follows. Syntax tree 1 The syntax tree shown is initially constructed from the top down, starting from the root. This principle is called topdown analysis. Furthermore, it can be seen that the non-terminal symbols are on the nodes and the terminal symbols are on the leaves. As a further example, consider the terminal word 4 * 5-8 and also create the derivation sequence and the associated syntax tree. E> EE> num E> num EE> num num E> num num num E> EE> E num> EE num> num E num> num num num Syntax tree 2 3 Here you can see that there are several syntax trees for the same expression can be created. These syntax trees are created through the application of different production rules. The problem here lies in the ambiguous semantics. It cannot be that one expression can have different meanings. The result must be clearly defined by the grammar. Ambiguity should be avoided when developing a grammar. A possible solution approach is, for example, the definition of priorities, associativity or a naturally structured grammar, for example clear grammar through brackets. G2 = (T, N, P, S) T = {num, + ,,, /} N = {E, D, F} P = {E> E + DEDD, D> DFD / FF, F> num} S = EF stands for f actor. The additional symbol F defines left-hand associativity and excludes ambiguity. If the corresponding syntax tree is considered for the same expression 4 * 5-8, it can be seen that the child node D * F must be calculated before the expression E - D is calculated. It should also be noted that only a syntax tree can be created for the expression. Syntax tree 4 The left associativity has been determined by the rule D> D F D / F F. If you first exchange the expression with the following rule from D> F D F / D F, a right-hand associativity can be implemented. 6th

7 Left recursion With the given grammar G2 from the section, the production of the form A> Aalpha can lead to problems. What is meant here is that the non-terminal symbol A appears again on the far left on the right-hand side of the production (recursion). If you consider a section of the known rules E> E + D D, and consider a possible implementation, it could look like this: void E () {E (); Symbol(); // skip symbol +! T (); } When considering the given implementation, E () will be called recursively, which will irretrievably lead to a stack overflow. However, this implementation error can be corrected by changing the grammar E> D + E D. This change turned a left recursion into a right recursion. 7th

8 3. Functional parsers in Haskell 3.1. Motivation In this chapter different elementary parser and parser combiners are developed. For communication between the individual parsers, however, it is imperative to develop a Parse data type. This data type should be as general and reusable as possible, so that it does not matter whether strings or tokens are processed, for example. In order to achieve this goal, the data type is first gradually refined and abstracted from the most general case.The Parse data type If you first imagine a simplified parser as a black box and only consider the input and output of a parser, a parser becomes a string as input and receive a program tree as output. In this case, the string represents the program to be parsed. The output is the developed program tree, which reflects the structure of the input. In Haskell, the data type could be described as follows: type Parse = String -> Tree A parser consists of a large number of many elementary parsers that are in turn connected by combiners, each of these combinations processing only one partial structure of the entire system. The problem here is the intermediate storage of the partial results (the subtrees) and the symbols still to be processed. There is no way to save the data in a global data structure (we do not use monads here). Therefore, the determined results must be temporarily stored in the function result. A refined version of the implementation can look like this: type Parse = String -> (Tree, String) If you look again at the aspect of the mutual calling of the individual parsers, the case will arise that a parser delivers no, one or even several results. In the previous consideration, only one result is provided. A string, for example, could allow several ways of interpretation or it could happen that no result is found. For this it is necessary to allow a list of results in Parse. One possibility is that the result is saved as a list of results. Each element of the list contains the subtree and the symbols still to be processed. If an empty list is returned, no result was obtained. type Parse = String -> [(Tree, String)] 8

9 If only the first result of the list is required, only the head of the list is to be used. There will be no loss of efficiency due to lazy evaluation. The unused results are only calculated when used. The data type Parse will save the generated syntax tree in the tree and the symbols to be processed in the string. The structure of the syntax tree differs from application to application. As an example, a tree structure for arithmetic expressions or in another case for an XML data structure is required. Another abstraction process must be carried out so that the tree structure is flexible and independent of the application. type Parse a = String -> [(a, String)] So far the parser has been limited to strings. Strings are implemented in Haskell as a list of characters. However, it is conceivable that at a later point in time there will be tokens from a lexer as input instead of a character string. In this case the type has to be modified one last time. type Parse ba = [b] -> [(a, [b])] or something more readable like this: type Parse symbol result = [symbol] -> [(result, [symbol])] This data type is used in this chapter and Elementary Parser symbol is used in the following examples. After the Parse data type has been developed, the first small parser that expects the symbol a can be developed. If the first symbol does not correspond to the a, an empty list is returned. > symbola :: Parse Char Char> symbola [] = []> symbola (x: xs)> x == a = [(a, xs)]> otherwise = [] Called up in Hugs, the result looks like this. In the first example, the symbol a was recognized and stored as a result. In the second call, the first symbol corresponds to the character b, accordingly the empty list was returned.? symbola "abc" [(a, "bc")]? symbola "bcd" [] 9

10 The function is as expected. This should be generalized so that a separate parser does not have to be implemented for each symbol used. The parser symbol is shown for this. > symbol :: Eq a => a -> Parse aa> symbol s [] = []> symbol s (x: xs)> s == x = [(x, xs)]> otherwise = [] At the parser symbol the symbol to be checked is passed as a parameter. In addition, a generalization of the symbol type has been made. It no longer has to be a character as in the previous example. It only has to be ensured that there is an instance of the class Equality for the type spot Another important parse function is spot. The parser spot expects a function as a parameter. This is applied to the header of the symbol list and accordingly provides an empty list or a list of possible results. > spot :: (a-> Bool) -> Parse aa> spot f [] = []> spot f (x: xs)> fx = [(x, xs)]> otherwise = [] By implementing spot it is now possible to write a much more elegant version of symbol. > symbol :: Eq a => a -> Parse a a> symbol t = spot (== t)? symbol a "abc" [(a, "bc")] dig This parser expects a character string and checks whether the first symbol is a digit. > dig :: Parse Char Char> input = (spot isdigit) input succeed A simple but not to be despised parser is succeed. succeed always returns a parse with the given input. The parameters are not processed or changed in any way. 10

11> succeed :: b -> Parse a b> succeed val inp = [(val, inp)] This function can be used as an epsilon function, for example. The function always returns an empty program tree and the transferred symbols. This function will be used later with the parser combinators Parser combinators In the previous chapter 3.3 about elementary parsers only parsers for terminal symbols were developed. However, parsers that process terminal symbols and nonterminal symbols are much more interesting. So-called parser combiners are required for this. Some combiners are discussed below. For example a parser for sequencing two parsers (first P1 then P2) or an alternative (P1 or P2) The alternative The alternative processes two transferred parsers and provides a list of possible results as the function result. > infixr 4 <>> (<>) :: Parse from -> Parse from -> Parse from> (p1 <> p2) input = (p1 input) ++ (p2 input) The alternative is here as a <> operator with the 4th priority level and a right associativity implemented. The reason for the implementation as an operator is the better readability of the source code. The parser could of course also have been implemented as a normal function. First of all, consider the function implementation. In the function body, only the two parsers p1 and p2 are applied to the input. The results of the two parsers are linked by the list concatenation operator ++. For example:? (dig <> symbol a) "abc" - [] ++ [(a, "bc")] [(a, "bc")] In the example, the parser dig fails but the parser symbol a returns a result .? (dig <> symbol a) "123" - [(1, "23")] ++ [] [(1, "23")]? (symbol a <> symbol b) "123" - [] ++ [] [] 11

12 Sequencing Another important parser is sequencing. This parser first applies the parser p1 to the input and the rest is applied to p2. If p1 and p2 produce a result, the sequence result is a tuple of results. The result of the sequence is a parser. This allows the final result to be processed by another combiner. > infixr 6 <*>> (<*>) :: Parse ab -> Parse ac -> Parse a (b, c)> (p1 <*> p2) input = [((x1, x2), xs2)> (x1, xs1) <- p1 input>, (x2, xs2) <- p2 xs1>] The following examples illustrate how the sequence works. (symbol a <*> symbol b) "123" []? (symbol a <*> symbol b) "abc" [((a, b), "c")]? (symbol a <*> symbol b <*> symbol c) "abcd" [((a, (b, c)), "d")] In the first example, the result of the first parser symbol a is an empty list. In the second example, both parsers are successful. The last example is considered more closely. The result is a tree with tuples of type [((Char, (Char, Char)), [Char])].? (symbol a <*> (symbol b <*> symbol c)) "abcd" ~ [((a, (b, c)), "d")] = [((x1, x2), xs2) (x1 , xs1) <- [(a, "bcd")], (x2, xs2) <- [(b, c), "d")]]? (symbol b <*> symbol c) "bcd" ~ [(b, c), "d")] = [((x1, x2), xs2) (x1, xs1) <- [(b, "cd" )], (x2, xs2) <- [(c, "d")]] The converter One of the most important parser combiners is the converter. The converter is for applying a function f to a parser p. This makes it possible, for example, to convert a digit as a character into an integer. 12

13> infixr 5> :: Parse ab -> (b -> c) -> Parse ac> (pf) input = [(fx, xs)> (x, xs) <- p input>] First, a Parser p and a function f expected. The parser p is applied to the input, the function f is applied to the result x. The result is a new parser of the type Parse a c. The type c is the result type of the function f.> Digit :: Parse Char Int> digit = spot isdigit f> where f c = ord c - ord 0? digit "123" [(1, "23")] As mentioned at the beginning, a digit can be converted into an integer using the operator. The digit function is one possible use of the converter. Many numbers, for example, in rare cases only consist of one digit. A parser many is required so that this list of objects can be converted to a symbol. > many :: Parse ab -> Parse a [b]> many p = (p <*> many p list)> <>> (succeed [])> where list (x, xs) = x: xs There are two Different function results: The list can be empty (succeed []). The list is not empty and contains a symbol followed by a list of objects (p <*> many p). So that the parser many processes the list, it calls itself recursively. The result still has to be converted from type (x, xs) to type (x: xs). The converter can be used for this. The converter converts the list from curried to uncurried. The following examples clearly show that every valid combination is saved in the list. The parser many therefore delivers more than one result. This must be observed during processing.? many (spot isdigit) "123abc" [("123", "abc"), ("12", "3abc"), ("1", "23abc"), ("", "123abc")]? many (spot isalpha) "test" [("test", ""), ("tes", "t"), ("te", "st"), ("t", "est"), (" "," test ")] 13

14 option Another parser is the option. For example, a number can be negative or positive.The negative number is indicated by the preceding -. > option :: Parse a b -> Parse a [b]> option p input = ((p (: []))> <>> (succeed [])) input The option applies two parsers to the input. On the one hand the parser p and the converter On the other hand the parser succeed []. With this alternative link, more than one result is delivered if successful. See the following example :? option (symbol -) "-123" [("-", "123"), ("", "- 123")]? option (symbol -) "123" [("", "123")] 3.5. A parser for arithmetic expressions The idea Now that various elementary parser and parser combiners have been developed, these should now be used and used. A pocket calculator is to be developed for this purpose. The following example is a recursive decent parser. The calculator is designed to calculate integer operations. Numbers: 12 and -88, negative numbers are represented by - Operators: +, -, *, /,%% stands for the modulo operation Only whole-number division spaces are not allowed Every expression must be in parentheses! For example, an expression might look like this: (23- (20/2)). This means that a naturally structured grammar is used. The use of brackets for each expression avoids ambiguities. The grammar For clarity, the naturally structured grammar is given in a BN form. Expr :: = const (Expr Op Expr) Op :: = + - * /% 14

15 In Hakell the grammar is implemented as a simple arithmetic data type. An expression can consist of a natural number Lit or an expression Bin. But in turn, Bin is an operator and two numbers. data Expr = Lit Int Bin Op Expr Expr data Op = Add Sub Mul Div Mod The expression (14 + -2) is then represented as follows Bin Add Lit 14 Lit (-2). A somewhat more complicated expression is for example (120 * (20/2)) and is represented as follows: Bin Mul Lit 120 (Bin Div Lit 20 Lit 2) The parser After the grammar has been described by a simple arithmetic data type in Haskell, only the corresponding parsers still need to be adapted to the grammar. > parser :: Parse Char Expr> parser = litparse <> opexprparse The parser describes an expression. The expression consists of a literal parser or an opexpr parser. > litparse :: Parse Char Expr> litparse> = (> option (symbol -) <*> many1 (spot isdigit)>) charlisttoexpr.join> where join = uncurry (++)? litparse "14" [(Lit 14, ""), (Lit 1, "4")]? litparse "-4" [(Lit (-4), "")] The litparse parser creates the possible literals from the transferred symbol list. > opexprparse :: Parse Char Expr> opexprparse = (symbol (<*>> parser <*>> spot isop <*>> parser <*>> symbol)>) makeexpr> where> makeexpr :: (a, (expr, (char, (expr, b)))) -> Expr> makeexpr (_, (e1, (bop, (e2, _)))) = Op (chartoop bop) e1 e2 The parser opexprparse is used to build the expressions recursively . Examples: 15

16? parser "(14 + -2)" [(Bin Add (Lit 14) (Lit (-2)), "")]? parser "(14-2) + a" [(Bin Sub (Lit 14) (Lit 2), "+ a")] As the last example shows, no errors are caught yet. In a next development step, it would have to be checked whether the set of symbols still to be checked is empty. Then the last mistake would also be discovered. 16

17 4. Parsec There are a number of parsers for Haskell. Only the bottom-up parser Happy and the top-down parser Parsec should be mentioned here. The following is a brief introduction to Parsec. What is Parsec? Parsec is a powerful parser library. This is a top-down parser using monads. In general, emphparsec is operated as an LL [1] parser. Backtracking is also possible and is dealt with in detail in the chapter. Parsec enables the output of meaningful error messages in the event of syntactic errors. Furthermore, with Parsec a lexical analysis and parsing can be realized in one. It is not necessary, as is often the case, to use two different tools (e.g. Lex and Yacc). In the case of further interest, the interested reader is referred to the reference manual from Parsec. A reference to the website can be found in the appendix Examples with Parsec Before Parsec can be used in a module, the library must be made known in the Haskell module. > module Main where> import Parsec After Parsec has been imported, the first simple examples are possible. The many function is already known from the chapter. This function is also implemented in Parsec. The following parser simple1 expects a list of letters. The letter function used is predefined in hugs. > - A parser for identifier> simple1 :: Parser [Char]> simple1 = many letter The functionality is illustrated by the following call. The function parsetest is from Parsec and is used for simple testing of parsers. parsetest simple1 "abc123" "abc" 17

18 The sequence and selection As in chapter 3.4 on parser combiners, it became clear how important one operator is for sequences and one for selection. By using monads in parsecs, no extra operator needs to be created for the sequence. The monad operator do is used for sequences. The following parser openclos0 expects a pair of brackets. > openclose0 :: Parser Char> openclose0 = do {char (>; char)>} However, a pair of brackets is not particularly exciting. A far more exciting version of a bracket pair parser might look like this. openclose1 :: Parser () openclose1 = do {char (; openclose1; char); openclose1} <> return () With its own recursive call, openclose1 expects a sequence of pairs of brackets. The <> operator is the choice. If no opening bracket is found, return () is called and the recursion ends. Parser with semantics A parser similar to openclose1, only that it calculates the number of pairs of curly brackets. The result of the recursive calls is saved in the variables m and n (attention: single assignment). The alternative <> is used to abort the recursion. If no opening bracket {is found, return 0 is executed. > noofbrace = do {char {>; m <- noofbrace>; char}>; n <- noofbrace>; return (1 + m + n); >}> <> return 0? parsetest noofbrace "{} {}" 2? parsetest noofbrace "{{}} {}" 3 18

19 The selection As mentioned in the introduction to Chapter 4.1, Parsec is an LL [1] parser. But what does LL [1] mean? In the top-down analysis, a series of derivation steps is sought and carried out starting from the root. Without the use of backtracking, it must be possible to uniquely select every derivation sequence in a grammar. In the following, the problem of backtracking will be dealt with. First of all, the following excerpt from a grammar is known. Terminal symbols are capitalized here. The nonterminal symbols are written in lower case. cond :: = IF boolexpr THEN stmt FI IF boolexpr THEN stmt ELSE stmt FI In addition, the following symbol sequence (statement) must be processed: IF (a> b) THEN RETURN a ELSE RETURN b FI The following is considered as the given statement would be top-down processed. A symbol pointer points to the next symbol IF to be processed. A parser pointer is just before the production rule cond. When processing the production rules, the first alternative IF boolexpr THEN stmt FI is initially incorrectly selected. The terminal symbols are compared and syntax trees for the non-terminal symbols boolexpr and stmt are built up step by step. However, a final comparison test with FI from the grammar and the ELSE from the symbol sequence fails. The subtree created must now be discarded and the pointer to the symbol sequence must be set back to the beginning. The parser pointer will be set to the second alternative of cond. Now the second and correct sequence of derivation can be processed. The terminal symbols are compared accordingly and subtrees are created for boolexpr and stmt. With this procedure and the given grammar, the problem is that, considering the first symbol IF, no statement can be made as to which of the two derivation rules should be used. The same problem should be dealt with in Parsec with a clear grammar. A condition consists of three symbols of the opening bracket, the a or b and a closing bracket. Given a symbol sequence of (b), no statement can be made as to whether the first or second rule should be evaluated. cond :: = (a) (b) A possible implementation with parsec could look like this:> or0 = string "(a)"> <>> string "(b)" string (a) is an independent parser here Symbol sequence (a) expected. The alternative <> only tries the second parser string (b) if the first parser has not processed any symbols. 19th

20? parsetest or0 "(a)" "(a)"? parsetest or0 "(b)" parse error at (line 1, column 1): unexpected "b" expecting "(a)" The first example runs as expected. The second example ends with an error message. The parser string (a) checks and compares symbol by symbol and expects an a instead of the b. The second alternative parser string (b) is not checked because the first has already processed symbols. The second alternative is therefore never carried out. An improved version of or0 is shown as follows:> or1 = do {char (>; char a <> char b>; char)>; return "ok">}? parsetest or1 "(a)" "ok"? parsetest or1 "(b)" "ok" Function and operation of or1 is as expected. The first parser checks the first symbol, the opening bracket. The second parser is either an a or b followed by the closing bracket. For this indirect look ahead (a) or (b) there is an extra parser try in Parsec. try has the same function as the technique used above in the parser or1. To clarify, the same example is shown using try. > or2 = try (string "(a)")> <>> try (string "(b)") It should be noted at this point that backtracking can be implemented in a parser using try. It must be clear, however, that this costs time and performance. A pocket calculator with parsec For the sake of completeness, a small pocket calculator with parsec is also shown in this chapter. However, the development and functional description are not given. module MyPCalc1 where import Parsec import ParsecExpr expr :: Parser Integer 20

21 expr = buildexpressionparser table factor "Expression" table :: OperatorTable Char () Integer table = [[binary "*" (*) AssocLeft, binary "/" div AssocLeft], [binary "+" (+) AssocLeft , binary "-" (-) AssocLeft]] where binary sf assoc = Infix (do {string s; return f}) assoc factor = do {char (; x <- expr; char); return x} <> number "simple expression" number :: Parser Integer number = do {ds <- many1 digit; return (read ds)} "number" Example calls :? parsetest expr "2 + 5 * 2" - * has higher priority 12? parsetest expr "(6 + 2) * 3" 12? parsetest expr "48/2/2" - / is left associative 12 21

22 5. Download This documentation is available in the following formats: Postscript (doku.ps) Best quality for printing. Acrobat (doku.pdf) Good quality for online viewing, with linked table of contents All examples used in this documentation can be downloaded as: tar-archiv (exampleparser.tar.gz) The transparencies used for my lecture are available here: Acrobat (PDF) For projectors and Acrobat readers in full screen mode. 22nd

23 Bibliography [Bird] [Erwig] [Fokker] [Leijen] [Thompson] [Wirth] R. Bird: Introduction to Functional Programming using Haskell, Prentice Hall, ISBN, 1998 G. Erwig: Translator construction, Springer Verlag, ISBN, 1999 J Fokker: Functional Parsers. Lecture Notes of the Baastad Spring school on Functional Programming, jeroen / article / parsers / parsers.ps, May 1995 D. Leijen: Parsec, a fast combinator parser, daan / parsec.html, October 4, 2001 S. Thompson: Hakell - The Craft of Functional Programming, Springer Verlag, ISBN, 1999 N. Wirth: Compilerbau, Teubner study books, ISBN,