# Pidgin **Repository Path**: tangzhiming/Pidgin ## Basic Information - **Project Name**: Pidgin - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: devirt - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 0 - **Created**: 2024-10-17 - **Last Updated**: 2024-12-09 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README Pidgin ====== A lightweight, fast, and flexible parsing library for C#. Installing ---------- Pidgin is [available on Nuget](https://www.nuget.org/packages/Pidgin/). API docs are hosted [on my website](https://www.benjamin.pizza/Pidgin). Tutorial -------- There's a tutorial on using Pidgin to parse a subset of Prolog [on my website](https://www.benjamin.pizza/posts/2019-12-08-parsing-prolog-with-pidgin.html). ### Getting started Pidgin is a _parser combinator library_, a lightweight, high-level, declarative tool for constructing parsers. Parsers written with parser combinators look like a high-level specification of a language's grammar, but they're expressed within a general-purpose programming language and require no special tools to produce executable code. Parser combinators are more powerful than regular expressions - they can parse a larger class of languages - but simpler and easier to use than parser generators like ANTLR. Pidgin's core type, `Parser`, represents a procedure which consumes an input stream of `TToken`s, and may either fail with a parsing error or produce a `T` as output. You can think of it as: ```csharp delegate T? Parser(IEnumerator input); ``` In order to start building parsers we need to import two classes which contain factory methods: `Parser` and `Parser`. ```csharp using Pidgin; using static Pidgin.Parser; using static Pidgin.Parser; // we'll be parsing strings - sequences of characters. For other applications (eg parsing binary file formats) TToken may be some other type (eg byte). ``` ### Primitive parsers Now we can create some simple parsers. `Any` represents a parser which consumes a single character and returns that character. ```csharp Assert.AreEqual('a', Any.ParseOrThrow("a")); Assert.AreEqual('b', Any.ParseOrThrow("b")); ``` `Char`, an alias for `Token`, consumes a _particular_ character and returns that character. If it encounters some other character then it fails. ```csharp Parser parser = Char('a'); Assert.AreEqual('a', parser.ParseOrThrow("a")); Assert.Throws(() => parser.ParseOrThrow("b")); ``` `Digit` parses and returns a single digit character. ```csharp Assert.AreEqual('3', Digit.ParseOrThrow("3")); Assert.Throws(() => Digit.ParseOrThrow("a")); ``` `String` parses and returns a particular string. If you give it input other than the string it was expecting it fails. ```csharp Parser parser = String("foo"); Assert.AreEqual("foo", parser.ParseOrThrow("foo")); Assert.Throws(() => parser.ParseOrThrow("bar")); ``` `Return` (and its synonym `FromResult`) never consumes any input, and just returns the given value. Likewise, `Fail` always fails without consuming any input. ```csharp Parser parser = Return(3); Assert.AreEqual(3, parser.ParseOrThrow("foo")); Parser parser2 = Fail(); Assert.Throws(() => parser2.ParseOrThrow("bar")); ``` ### Sequencing parsers The power of parser combinators is that you can build big parsers out of little ones. The simplest way to do this is using `Then`, which builds a new parser representing two parsers applied sequentially (discarding the result of the first). ```csharp Parser parser1 = String("foo"); Parser parser2 = String("bar"); Parser sequencedParser = parser1.Then(parser2); Assert.AreEqual("bar", sequencedParser.ParseOrThrow("foobar")); // "foo" got thrown away Assert.Throws(() => sequencedParser.ParseOrThrow("food")); ``` `Before` throws away the second result, not the first. ```csharp Parser parser1 = String("foo"); Parser parser2 = String("bar"); Parser sequencedParser = parser1.Before(parser2); Assert.AreEqual("foo", sequencedParser.ParseOrThrow("foobar")); // "bar" got thrown away Assert.Throws(() => sequencedParser.ParseOrThrow("food")); ``` `Map` does a similar job, except it keeps both results and applies a transformation function to them. This is especially useful when you want your parser to return a custom data structure. (`Map` has overloads which operate on between one and eight parsers; the one-parser version also has a postfix synonym `Select`.) ```csharp Parser parser1 = String("foo"); Parser parser2 = String("bar"); Parser sequencedParser = Map((foo, bar) => bar + foo, parser1, parser2); Assert.AreEqual("barfoo", sequencedParser.ParseOrThrow("foobar")); Assert.Throws(() => sequencedParser.ParseOrThrow("food")); ``` `Bind` uses the result of a parser to choose the next parser. This enables parsing of context-sensitive languages. For example, here's a parser which parses any character repeated twice. ```csharp /// parse any character, then parse a character matching the first character Parser parser = Any.Bind(c => Char(c)); Assert.AreEqual('a', parser.ParseOrThrow("aa")); Assert.AreEqual('b', parser.ParseOrThrow("bb")); Assert.Throws(() => parser.ParseOrThrow("ab")); ``` Pidgin parsers support LINQ query syntax. It may be easier to see what the above example does when it's written out using LINQ: ```csharp Parser parser = from c in Any from c2 in Char(c) select c2; ``` Parsers written like this look like a simple imperative script. "Run the `Any` parser and name its result `c`, then run `Char(c)` and name its result `c2`, then return `c2`." ### Choosing from alternatives `Or` represents a parser which can parse one of two alternatives. It runs the left parser first, and if it fails it tries the right parser. ```csharp Parser parser = String("foo").Or(String("bar")); Assert.AreEqual("foo", parser.ParseOrThrow("foo")); Assert.AreEqual("bar", parser.ParseOrThrow("bar")); Assert.Throws(() => parser.ParseOrThrow("baz")); ``` `OneOf` is equivalent to `Or`, except it takes a variable number of arguments. Here's a parser which is equivalent to the one using `Or` above: ```csharp Parser parser = OneOf(String("foo"), String("bar")); ``` If one of `Or` or `OneOf`'s component parsers fails _after consuming input_, the whole parser will fail. ```csharp Parser parser = String("food").Or(String("foul")); Assert.Throws(() => parser.ParseOrThrow("foul")); // why didn't it choose the second option? ``` What happened here? When a parser successfully parses a character from the input stream, it advances the input stream to the next character. `Or` only chooses the next alternative if the given parser fails _without consuming any input_; Pidgin does not perform any lookahead or backtracking by default. Backtracking is enabled using the `Try` function. ```csharp // apply Try to the first option, so we can return to the beginning if it fails Parser parser = Try(String("food")).Or(String("foul")); Assert.AreEqual("foul", parser.ParseOrThrow("foul")); ``` ### Recursive grammars Almost any non-trivial programming language, markup language, or data interchange language will feature some sort of recursive structure. C# doesn't support recursive values: a recursive referral to a variable currently being initialised will return `null`. So we need some sort of deferred execution of recursive parsers, which Pidgin enables using the `Rec` combinator. Here's a simple parser which parses arbitrarily nested parentheses with a single digit inside them. ```csharp Parser expr = null; Parser parenthesised = Char('(') .Then(Rec(() => expr)) // using a lambda to (mutually) recursively refer to expr .Before(Char(')')); expr = Digit.Or(parenthesised); Assert.AreEqual('1', expr.ParseOrThrow("1")); Assert.AreEqual('1', expr.ParseOrThrow("(1)")); Assert.AreEqual('1', expr.ParseOrThrow("(((1)))")); ``` However, Pidgin does not support left recursion. A parser must consume some input before making a recursive call. The following example will produce a stack overflow because a recursive call to `arithmetic` occurs before any input can be consumed by `Digit` or `Char('+')`: ```csharp Parser arithmetic = null; Parser addExpr = Map( (x, _, y) => x + y, Rec(() => arithmetic), Char('+'), Rec(() => arithmetic) ); arithmetic = addExpr.Or(Digit.Select(d => (int)char.GetNumericValue(d))); arithmetic.Parse("2+2"); // stack overflow! ``` ### Derived combinators Another powerful element of this programming model is that you can write your own functions to compose parsers. Pidgin contains a large number of higher-level combinators, built from the primitives outlined above. For example, `Between` runs a parser surrounded by two others, keeping only the result of the central parser. ```csharp Parser InBraces(this Parser parser, Parser before, Parser after) => before.Then(parser).Before(after); ``` ### Parsing expressions Pidgin features operator-precedence parsing tools, for parsing expression grammars with associative infix operators. The `ExpressionParser` class builds a parser from a parser to parse a single expression term and a table of operators with rules to combine expressions. ### More examples Examples, such as parsing (a subset of) JSON and XML into document structures, can be found in the `Pidgin.Examples` project. Tips ---- ### A note on variance Why doesn't this code compile? ```csharp class Base {} class Derived : Base {} Parser p = Return(new Derived()); // Cannot implicitly convert type 'Pidgin.Parser' to 'Pidgin.Parser' ``` This would be possible if `Parser` were defined as a _covariant_ in its second type parameter (ie `interface Parser`). For the purposes of efficiency, Pidgin parsers return a struct. Structs and classes aren't allowed to have variant type parameters (only interfaces and delegates); since a Pidgin parser's return value isn't variant, nor can the parser itself. In my experience, this crops up most frequently when returning a node of a syntax tree from a parser using `Select`. The least verbose way of rectifying this is to explicitly set `Select`'s type parameter to the supertype: ```csharp Parser p = Any.Select(() => new Derived()); ``` ### Speed tips Pidgin is designed to be fast and produce a minimum of garbage. A carefully written Pidgin parser can be competitive with a hand-written recursive descent parser. If you find that parsing is a bottleneck in your code, here are some tips for minimising the runtime of your parser. * Avoid LINQ query syntax. Query comprehensions are defined by translation into core C# using `SelectMany`, however, for long queries the translation can allocate a large number of anonymous objects. This generates a lot of garbage; while those objects often won't survive the nursery it's still preferable to avoid allocating them! * Avoid backtracking where possible. If consuming a streaming input like a `TextReader` or an `IEnumerable`, `Try` _buffers_ its input to enable backtracking, which can be expensive. * Use specialised parsers where possible: the provided `Skip*` parsers can be used when the result of parsing is not required. They typically run faster than their counterparts because they don't need to save the values generated. * Build your parser statically where possible. Pidgin is designed under the assumption that parser scripts are executed more than they are written; building a parser can be an expensive operation. * Avoid `Bind` and `SelectMany` where possible. Many practical grammars are _context-free_ and can therefore be written purely with `Map`. If you do have a context-sensitive grammar, it may make sense to parse it in a context-free fashion and then run a semantic checker over the result. Comparison to other tools ------------------------- ### Pidgin vs Sprache [Sprache](https://github.com/sprache/Sprache) is another parser combinator library for C# and served as one of the sources of inspiration for Pidgin. Pidgin's API is somewhat similar to that of Sprache, but Pidgin aims to improve on Sprache in a number of ways: * Sprache's input must be a string. This makes it inappropriate for parsing binary protocols or tokenised inputs. Pidgin supports input tokens of arbitrary type. * Sprache's input must be a string - an _in-memory_ array of characters. Pidgin supports streaming inputs. * Sprache automatically backtracks on failure. Pidgin uses a special combinator to enable backtracking because backtracking can be a costly operation. * Pidgin comes bundled with operator-precedence tools for parsing expression languages with associative infix operators. * Pidgin is faster and allocates less memory than Sprache. * Pidgin has more documentation coverage than Sprache. ### Pidgin vs FParsec [FParsec](https://github.com/stephan-tolksdorf/fparsec) is a parser combinator library for F# based on [Parsec](https://hackage.haskell.org/package/parsec-3.1.11). * FParsec is an F# library and consuming it from C# can be awkward. Pidgin is implemented in pure C#, and is designed for C# consumers. * FParsec only supports character input streams. * FParsec supports stateful parsing - it has an extra type parameter for an arbitrary user-defined state - which can make it easier to parse context-sensitive grammars. * FParsec is faster than Pidgin (though we're catching up!) ### Benchmarks This is how Pidgin compares to other tools in terms of performance. The benches can be found in the `Pidgin.Bench` project. ```ini BenchmarkDotNet=v0.11.5, OS=Windows 10.0.14393.3384 (1607/AnniversaryUpdate/Redstone1) Intel Core i5-4460 CPU 3.20GHz (Haswell), 1 CPU, 4 logical and 4 physical cores Frequency=3125000 Hz, Resolution=320.0000 ns, Timer=TSC .NET Core SDK=3.1.100 [Host] : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), 64bit RyuJIT DefaultJob : .NET Core 3.0.0 (CoreCLR 4.700.19.46205, CoreFX 4.700.19.46214), 64bit RyuJIT ``` #### `ExpressionBench` | Method | Mean | Error | StdDev | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated | |-------------------- |-------------:|-------------:|--------------:|------:|--------:|-------:|------:|------:|----------:| | LongInfixL_Pidgin | 625,148.8 ns | 3,015.040 ns | 2,672.7541 ns | 2.25 | 0.01 | - | - | - | 128 B | | LongInfixR_Pidgin | 625,530.1 ns | 4,104.833 ns | 3,839.6633 ns | 2.25 | 0.02 | - | - | - | 128 B | | LongInfixL_FParsec | 278,035.1 ns | 1,231.538 ns | 1,151.9816 ns | 1.00 | 0.00 | - | - | - | 200 B | | LongInfixR_FParsec | 326,047.3 ns | 931.485 ns | 871.3119 ns | 1.17 | 0.01 | - | - | - | 200 B | | | | | | | | | | | | | ShortInfixL_Pidgin | 1,506.5 ns | 5.515 ns | 5.1590 ns | 2.67 | 0.01 | 0.0401 | - | - | 128 B | | ShortInfixR_Pidgin | 1,636.6 ns | 6.882 ns | 5.7467 ns | 2.90 | 0.02 | 0.0401 | - | - | 128 B | | ShortInfixL_FParsec | 564.1 ns | 1.894 ns | 1.6788 ns | 1.00 | 0.00 | 0.0629 | - | - | 200 B | | ShortInfixR_FParsec | 567.7 ns | 1.200 ns | 0.9373 ns | 1.01 | 0.00 | 0.0629 | - | - | 200 B | #### `JsonBench` | Method | Mean | Error | StdDev | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated | |-------------------- |-----------:|----------:|----------:|------:|--------:|----------:|---------:|------:|-----------:| | BigJson_Pidgin | 684.6 us | 2.888 us | 2.701 us | 1.00 | 0.00 | 33.2031 | - | - | 101.7 KB | | BigJson_Sprache | 3,597.5 us | 17.595 us | 16.458 us | 5.25 | 0.03 | 1726.5625 | - | - | 5291.81 KB | | BigJson_Superpower | 2,884.4 us | 6.504 us | 5.766 us | 4.21 | 0.02 | 296.8750 | - | - | 913.43 KB | | BigJson_FParsec | 750.1 us | 3.516 us | 3.289 us | 1.10 | 0.01 | 111.3281 | - | - | 343.43 KB | | | | | | | | | | | | | LongJson_Pidgin | 517.5 us | 2.418 us | 2.261 us | 1.00 | 0.00 | 33.2031 | - | - | 104.25 KB | | LongJson_Sprache | 2,858.5 us | 10.491 us | 9.300 us | 5.53 | 0.03 | 1390.6250 | - | - | 4269.33 KB | | LongJson_Superpower | 2,348.1 us | 14.194 us | 13.277 us | 4.54 | 0.03 | 230.4688 | - | - | 706.79 KB | | LongJson_FParsec | 642.5 us | 2.708 us | 2.533 us | 1.24 | 0.01 | 125.0000 | - | - | 384.3 KB | | | | | | | | | | | | | DeepJson_Pidgin | 399.3 us | 1.784 us | 1.582 us | 1.00 | 0.00 | 26.3672 | - | - | 82.24 KB | | DeepJson_Sprache | 2,983.0 us | 42.512 us | 39.765 us | 7.46 | 0.09 | 761.7188 | 191.4063 | - | 2922.46 KB | | DeepJson_FParsec | 701.8 us | 1.665 us | 1.557 us | 1.76 | 0.01 | 112.3047 | - | - | 344.43 KB | | | | | | | | | | | | | WideJson_Pidgin | 427.8 us | 1.619 us | 1.515 us | 1.00 | 0.00 | 15.6250 | - | - | 48.42 KB | | WideJson_Sprache | 1,704.2 us | 9.246 us | 8.196 us | 3.98 | 0.02 | 900.3906 | - | - | 2763.22 KB | | WideJson_Superpower | 1,494.6 us | 9.581 us | 8.962 us | 3.49 | 0.02 | 148.4375 | - | - | 459.74 KB | | WideJson_FParsec | 379.5 us | 1.597 us | 1.494 us | 0.89 | 0.00 | 41.9922 | - | - | 129.02 KB |