# parserlib
**Repository Path**: wangwangxingao/parserlib
## Basic Information
- **Project Name**: parserlib
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-11-22
- **Last Updated**: 2023-11-22
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Parserlib
A c++17 recursive-descent parser library that can parse left-recursive grammars.
## Table Of Contents
[Introduction](#introduction)
[Using the Library](#using-the-library)
[Writing a Grammar](#writing-a-grammar)
[Invoking a Parser](#invoking-a-parser)
[Non-Left Recursion](#non-left-recursion)
[Left Recursion](#left-recursion)
[Customizing a Parser](#customizing-a-parser)
[Simple Matches](#simple-matches)
[Tree Matches](#tree-matches)
[Resuming From Errors](#resuming-from-errors)
## Introduction
Parserlib allows writing of recursive-descent parsers in c++ using the language's operators in order to imitate Extended Backus-Naur Form (EBNF) syntax.
The library can handle left recursion.
Here is a Calculator grammar example:
```cpp
extern Rule<> add;
const auto val = +terminalRange('0', '9');
const auto num = val
| '(' >> add >> ')';
Rule<> mul = mul >> '*' >> num
| mul >> '/' >> num
| num;
Rule<> add = add >> '+' >> mul
| add >> '-' >> mul
| mul;
```
*(NOTE: Rule 'add' needs a forward reference because it is mentioned for the very first time in rule 'num'.)*
The above grammar is a direct translation of the following left-recursive EBNF grammar:
```
add = add + mul
| add - mul
| mul
mul = mul * num
| mul / num
| num
num = val
| ( add )
val = (0..9)+
```
## Using the Library
The library is available as headers only, since every class is templated.
In order to use it, you have to have the path to its root include folder in your project's include folder list.
Then, you have to either include the root header, like this:
```cpp
#include "parserlib.hpp"
```
Or use the various headers in the subfolder 'parserlib'.
All the code is included in the namespace `parserlib`:
```cpp
using namespace parserlib;
```
## Writing a Grammar
A grammar can be written as a series of parsing expressions, formed by operators and by functions that create parser objects.
### Terminals
The most basic parser is the `TerminalParser`, which is used to parse a terminal. In order to write a terminal expression, the following code must be written:
```cpp
terminal('x')
terminal("abc")
```
*The terminals in this library are by default of type `char`, but they can be customized to be anything.*
Other types of terminal parsers are:
```cpp
terminalRange('a', 'z') //parses all values between 'a' and 'z'.
terminalSet('+', '-') //parses '+' or '-'.
```
### Sequences
Terminals can be combined in sequences using the `operator >>`:
```cpp
const auto ab = terminal('a') >> terminal('b');
const auto abc = ab >> terminal('c');
```
In order to parse a sequence successfully, all members of that sequence shall parse successfully.
### Branches
Expressions can have branches:
```cpp
const auto this_or_that = terminal("this")
| terminal("that");
```
Branches are followed in top-to-bottom fashion.
If a branch fails to parse, then the next branch is selected, until a branch is found or no more branches exist to follow.
### Loops
- The `operator *` parses an expression 0 or more times.
- The `operator +' parses an expression 1 or more times.
```cpp
+(terminalRange('0', '9')) //parse a digit 1 or more times.
```
### Optionals
A parser can be made optional by using the `operator -`:
```cpp
-terminalSet('+', '-') >> terminalRange('0', '9') //parse a number; the sign is optional.
```
### Conditionals
- The `operator &` allows parsing an expression without consuming any tokens; it returns true if the parsing succeeds, false if it fails. It can be used to test a specific series of tokens before parsing.
- The `operator !` inverts the result of a parsing expression; it returns true if the expression returns false and vice versa.
```cpp
!terminalSet('=', '-') >> terminalRange('0', '9') //parse an integer without a sign.
```
### Matches
- The `operator ==` allows the assignment of a match id to a production; [the created match does not have any children](#simple-matches).
- The `operator >=` allows the assignment of a match id to a production; [the created match has children matches](#tree-matches).
```cpp
(-terminalSet('+', '-') >> terminalRange('0', '9')) == std::string("int")
```
## Invoking a Parser
In order to invoke a parser, the appropriate `ParseContext` instance must be created.
```cpp
//declare a grammar
const auto grammar = (-terminalSet('+', '-') >> terminalRange('0', '9')) == std::string("int");
//declare an input
std::string input = "123";
//declare a parse context over the input
ParseContext<> pc(input);
//parse
const bool ok = grammar(pc);
//iterate over recognized matches
for(const auto& match : pc.matches()) {
if (match.id() == "int") {
const auto parsedString = match.content();
//process int
}
}
```
## Non-left Recursion
Rules allow the writing of recursive grammars.
```cpp
//whitespace
const auto whitespace = terminal(' ');
//integer
const auto integer = terminalRange('0', '9');
//forward declaration of recursive rule
extern Rule<> values;
//value; it is recursive
const auto value = integer
| terminal('(') >> values >> terminal(')');
//rule
Rule<> values = value >> whitespace >> values;
```
## Left Recursion
The library can parse left recursive grammars.
```cpp
//the recursive rule
extern Rule<> expression;
//and integer is a series of digits
const auto integer = +terminalRange('0', '9');
//a value is either an integer or a parenthesized expression
Rule<> value = integer
| '(' >> expression >> ')';
//multiplication
Rule<> mul = mul >> '*' >> value
| mul >> '/' >> value
| value;
//addition
Rule<> add = add >> '+' >> mul
| add >> '-' >> mul
| mul;
//the root rule
Rule<> expression = add;
```
## Customizing a Parser
The class ParseContext is a template and has the following signature:
```cpp
template class ParseContext;
```
It allows customizing the source type, the match id type and the source position type.
### Customizing the source type
By default, a ParseContext instance will use an `std::string` as an input source. But this can be changed to accomodate any STL like container.
For example, the source can be a static array of integers:
```cpp
ParseContext> pc(input);
```
### Customizing the match id type
The default match id type is `std::string`, but usually it shall be an integer or an enumeration. It's also good for performance reasons to replace `std::string` with a numeric value, since match ids are created and destroyed as parsing is performed.
Example:
```cpp
ParseContext pc(input);
```
### Customizing character processing
The parse context's parameter named '`SourcePositionType' allows the customization of character processing:
- customizing comparison of elements, for example in order to implement case insensitive parsing.
- providing extra information regarding the source, for example line and oclumn numbers.
- customizing the newline character sequence.
The library already provides two classes for the above:
- class `SourcePosition` is the most basic class that just contains an iterator for the current position; it allows for statically using either case sensitive or case insensitive parsing.
- class `LineCountingSourcePosition` extends the class `SourcePosition` with line and column information, and it also allows the specification of newline sequence, which, by default, is implemented by class `DefaultNewlineTraits` that recognizes the character `\n` as the newline separator.
Examples:
```cpp
//case insensitive parsing
ParseContext> pc(input);
//case sensitive parsing with line counting
ParseContext> pc(input);
//case insensitive parsing with line counting and custom newline traits
ParseContext> pc(input);
```
## Simple Matches
The `operator ==` allows the creation of a match, when an expression parses successfully. The right hand side should be an expression which evaluates to the match id expected by the parse context. Example:
```cpp
enum TYPE {
A, B, C
};
const auto a = terminal('A') == A;
const auto b = terminal('B') == B;
const auto c = terminal('B') == C;
const auto grammar = a >> b >> c;
std::string input = "ABC";
ParseContext pc(input);
const bool ok = grammar(pc);
for(const auto& match : pc.matches()) {
std::cout << match.content() << " = " << match.id() << std::endl;
}
```
The above produces the output:
```
A = 0
B = 1
C = 2
```
## Tree Matches
The `operator >=` allows the creation of a match, like the `operator ==`, with a difference: all matches created within the context of the expression are placed as children matches.
This allows matches to also be trees, instead of a flat list.
In the following example, an IP4 address is returned as a tree match, with the following structure:
```
IP4_ADDRESS
HEX_BYTE
HEX_DIGIT
HEX_DIGIT
HEX_BYTE
HEX_DIGIT
HEX_DIGIT
HEX_BYTE
HEX_DIGIT
HEX_DIGIT
HEX_BYTE
HEX_DIGIT
HEX_DIGIT
```
Here is the code:
```cpp
enum TYPE {
ZERO,
ONE,
TWO,
THREE,
FOUR,
FIVE,
SIX,
SEVEN,
EIGHT,
NINE,
A,
B,
C,
D,
E,
F,
HEX_DIGIT,
HEX_BYTE,
IP4_ADDRESS
};
const auto zero = terminal('0') == ZERO ;
const auto one = terminal('1') == ONE ;
const auto two = terminal('2') == TWO ;
const auto three = terminal('3') == THREE;
const auto four = terminal('4') == FOUR ;
const auto five = terminal('5') == FIVE ;
const auto six = terminal('6') == SIX ;
const auto seven = terminal('7') == SEVEN;
const auto eight = terminal('8') == EIGHT;
const auto nine = terminal('9') == NINE ;
const auto a = terminal('A') == A;
const auto b = terminal('B') == B;
const auto c = terminal('C') == C;
const auto d = terminal('D') == D;
const auto e = terminal('E') == E;
const auto f = terminal('F') == F;
const auto hexDigit = (zero | one | two | three | four | five | six | seven | eight | nine | a | b | c | d | f) >= HEX_DIGIT;
const auto hexByte = (hexDigit >> hexDigit) >= HEX_BYTE;
const auto ip4Address = (hexByte >> terminal('.') >> hexByte >> terminal('.') >> hexByte >> terminal('.') >> hexByte) >= IP4_ADDRESS;
const std::string input = "FF.12.DC.A0";
ParseContext pc(input);
using Match = typename ParseContext::Match;
const bool ok = ip4Address(pc);
assert(ok);
assert(pc.matches().size() == 1);
const Match& match = pc.matches()[0];
std::stringstream stream;
stream << match.children()[0].children()[0].content();
stream << match.children()[0].children()[1].content();
stream << '.';
stream << match.children()[1].children()[0].content();
stream << match.children()[1].children()[1].content();
stream << '.';
stream << match.children()[2].children()[0].content();
stream << match.children()[2].children()[1].content();
stream << '.';
stream << match.children()[3].children()[0].content();
stream << match.children()[3].children()[1].content();
const std::string output = stream.str();
std::cout << output;
```
The above prints the input, which is the value `FF.12.DC.A0`.
## Resuming From Errors
In order to resume from errors, the special `operator ~()` can be used to create an `error resume point`.
An `error resume point` shall be combined with `operator >>()` to create a sequence of parsers, in which the parsers before the `error resume point` may create an error, and then the `error parser` will try to resume parsing from the `error resume point`.
Here is an example of parsing a terminal enclosed in single quotes:
```cpp
const auto ws = *terminal(' ');
const auto letter = terminalRange('a', 'z') | terminalRange('A', 'Z');
const auto digit = terminalRange('0', '9');
const auto character = letter | digit;
const auto terminal_ = ('\'' >> *(character - '\'') >> ~terminal('\'')) == "terminal";
const auto grammar = ws >> *(terminal_ >> ws);
```
If an error happens when parsing a terminal, then the parser will look for the single quote symbol `\'` in order to continue parsing.