Patterns
Parsing Expression Grammars are built out of patterns. These begin with atomic units of recognition, and are combined into complex rules, which can call other rules, recursively, thereby recognizing context free, and some context sensitive, languages. LPeg, JLpeg's inspiration, uses a SNOBOL-style set of operator overloads as the basic tool for building up patterns, a practice we also follow.
Patterns and their combination are the building block of JLpeg recognition engines. They are immutable and may be freely recombined and reused, unlike either regular expressions or the parsers generated by typical compiler-compilers.
The API of JLpeg hews closely to LPeg, with several extensions, refinements, and a more natively Julian character.
Combination
The basic operations are as follows:
Operator | Description |
---|---|
P(string::String) | match a literal String string |
P(n::UInt) | match any n characters |
P(-n) | match if there are at least n characters remaining |
P(sym::Symbol) | match the rule named :sym |
S(s::String) | match the S et of all characters in string |
R("xy") , R('x','y') | matches any character between x and y (R ange) |
B(patt) | match patt behind the cursor, without advancing |
U8(byte::UInt8) | match one byte of the string, invalid UTF-8 or not |
patt^n | match at least n repetitions of patt |
patt^-n | match at most n repetitions of patt |
patt^[n:m] | match between n and m repetitions of patt |
patt^[n] | match exactly n repetitions of patt |
patt1 * patt2 | match the sequence patt1 , patt2 |
patt1 | patt2 | match patt1 or patt2 , in that order |
patt1 - patt2 | match patt1 if patt2 does not match |
!patt , ¬patt | negative lookahead, succeeds if patt fails |
~patt | lookahead, match patt without advancing |
patt1 >> patt2 | match patt1 , then search the string for the next patt2 . |
P(true) , ε | always succeed |
P(false) , ∅ | always fail |
In keeping with the spirit of LPeg, P"string"
is equivalent to P("string")
, and this is true for S
and R
as well. These basic operations are not recursive, and without further modification will match to the longest substring recognized by the pattern. This is sufficient to match all regular languages. Note that ordered choice in PEGs is traditionally spelled "/", but this has the wrong precedence for us. Just remember that it has different semantics from the "|" in context-free grammars.
A Note About Piracy
You will note that combining Patterns involves a great deal of operator overloading. In Julian circles, operators are presumed to have a certain contract, although this is informal and has a certain latitude. Some of our operators comply with this expectation: *
and ^
are used for concatenation and repetition for AbstractString
s, as they are with Pattern
s, although the meaning of repetition is broader for patterns. Others do not: particularly egregious is !
, which is expected to always return a Bool
. |
and -
are justifiable, in my opinion: |
is firmly grounded in tradition and a | b
would be pronounced "a or b", while subtraction has a huge variety of meanings in mathematics; our use, as one should expect, is neither commutative nor associative. ~
and >>
bear little resemblance to their ordinary meanings.
Broadly speaking, the combinator operators in JLpeg are a combination of availability, operator precedence, and mnemnonic weight, in that order. For example, &patt
is the signifier for lookahead in the PEG definition, we use ~
because it's unary, and Julia has but few unary operators.
In any case, we shadow operators, rather than overloading the ones found in Base
, and they aren't exported. We provide JLpeg.Combinators
as an easy way to bring them into scope if desired. Most users will stick to the @rule
and @grammar
macros, which don't require bringing operators into scope.
Matching
match
(pattern::
Pattern
, string
::AbstractString) will attempt to match the pattern against the string, returning a PegMatch
<:
AbstractMatch. In the event of a failure, it returns a PegFail
, with the index of the failure at .errpos
. Note that unlike regular expressions, JLpeg will not skip ahead to find a pattern in a string, unless the pattern is so constructed. We offer the shorthand "" >> patt
, the "fast-forward" operator, to convert a pattern into its searching equivalent. P""
matches the empty string, and JLPeg will convert Strings and Integers (but not Bools) into patterns when able.
julia> match(P"123", "123456")
PegMatch(["123"])
julia> match(P"abc" * "123", "abc123")
PegMatch(["abc123"])
julia> match(P"abc" | "123", "123")
PegMatch(["123"])
julia> match(P"abc"^1, "abcabcabc")
PegMatch(["abcabcabc"])
julia> match((!S"123" * R"09")^1, "0987654321")
PegMatch(["0987654"])
julia> match("" >> P"5", "0987654321")
PegMatch(["098765"])
julia> match(~P"abc", "abc123")
PegMatch([""])
julia> match(~P"abc", "123abc") # fails
PegFail("123abc", 1)
The operators introduce a pattern 'context', where any a <op> b
combination where a
or b
is a Pattern will attempt to cast the other argument to a Pattern when appropriate. Generally, a MethodError
may be repaired by using P
on the left side of the the operator, although we can't guarantee that other method overloads for those operators might apply. Notably, *
is used for concatenation of strings, although in the JLpeg context, P"abc" * P"123"
is in fact the same as P("abc" * "123")
.
This UI is adequate for light work, but the macros discussed later are cleaner to work with, defined such that P
should never be necessary, although any of the public names in the JLpeg
module may be used, and needn't be imported into your package to do so.
Note that, unlike regular expressions, PEG matching always starts with the first character. Any match returned by a call to match(patt, string)
will therefore be a prefix of the string, up to and including the entire string.
Most interesting uses of pattern recognition will call for more than matching the longest substring. For those more complex cases, we have Captures
and Actions
.
Rules and Grammars
While simple patterns may be composed by assigning to variables and using those variable names to build up more complex patterns, this doesn't allow for recursion, which is essential for matching many strings of interest, perhaps most.
For this purpose, we have rules, which are simply patterns with a name. A rule with no references to another rule within it may be used for matching directly, while those with such references (including a reference to itself) must be composed into grammars.
As is the PEG convention, a rule reduction uses the left arrow ←
, which you can type as \leftarrow
(or in fact \lefta[TAB]
), also defined as <--
. A simple grammar can look like this:
abc_and = :a <-- P"abc" * (:b | P"")
_123s = :b ← P"123"^1 * :a
abc123 = Grammar(abc_and, _123s)
match(abc123, "abc123123123abc123abc")
Although we suggest as a matter of style that a grammar use one arrow form or the other, with ←
preferred.
The first rule is the start rule, which must succeed if the match is to succeed. A grammar which is missing rules will throw a PegError
, but duplicate rules are undefined behavior. Currently JLpeg will compile the last rule of that name it encounters, but this behavior must not be relied upon.
The preferred way to create rules and grammars is with the macros @rule
and @grammar
, which avoid the tedium of decorating expressions with P
entirely. Any Integer
, String
, Symbol
, or Char
, found on its own, is converted into the pattern equivalent. While this is not true of booleans, an idiomatic way to spell true
and false
in JLpeg is ""
and S""
respectively (read: "the empty string" and "the empty set"). These are compiled into the same code as P(true)
and P(false)
. JLpeg also defines, but does not export, ε
for P(true)
and ∅
for P(false)
, and these may be used in grammars and rules as well, with \varepsilon
(\vare[TAB]
) and \emptyset
(\emp[TAB]
) respectively.
Public variable names from JLpeg
will always refer to the values they have in the module. Any other variable will be escaped, so it will have the expected meaning in context.
To give an example, this rule:
@rule :a ← "foo" * [S"123" | "abc"^1]^1
Is equivalent to this expression:
a = :a ← P("foo") * Cg(S("123") | P("abc")^1)^1
Although the definitions of the operators and string macros would allow this reduction:
a = :a ← P"foo" * Cg(S"123" | P"abc"^1)^1
Which is a bit less cumbersome (we try). Note that the @rule
form doesn't require the importation of @S_str
, or any other exported name, thanks to the nature of Julia macros. You may always use any form of a pattern in @rule
or @grammar
, all of these are equivalent: P("string")
, P"string"
, and "string"
.
A Sample Grammar
A classic example of a task forever beyond the reach of regular expressions is balancing parentheses, with JLpeg this is easy:
julia> @grammar parens begin :par ← :s * !1 :s ← (:b | (!S"()" * 1))^1 :b ← '(' * :s * ')' end;
julia> match(parens, "(these (must) balance)")
PegMatch(["(these (must) balance)"])
julia> match(parens, "these (must) balance)")
PegFail("these (must) balance)", 22)
julia> match(parens, "(these (must) balance")
PegFail("(these (must) balance", 22)
julia> match(parens, "(these (must) balance))")
PegFail("(these (must) balance))", 24)
julia> match(parens, "(these (must))) balance)")
PegFail("(these (must))) balance)", 16)
!1
is our equivalent of $
in regex, a pattern which only succeeds at the end of input. 1
will match a single Unicode codepoint, and !
is negative lookahead. The line-oriented equivalent of $
is ~"\n"
.
The @grammar
macro doesn't define variable names for the rules, only the grammar name given before the expression block. The first rule is always the start rule. As the example shows, it doesn't necessarily match the variable name, although of course it may.
@constgrammar and @construle
It's better style, and good for performance, to define global variables as const
. Julia's macro system currently doesn't offer a way to determine what scope a macro is executing in, so for use in the global scope, JLpeg has the macros @constgrammar
and @construle
, which differ from their cousins only in declaring the variable to be constant.