README.md

  1# regexp2 - full featured regular expressions for Go
  2Regexp2 is a feature-rich RegExp engine for Go.  It doesn't have constant time guarantees like the built-in `regexp` package, but it allows backtracking and is compatible with Perl5 and .NET.  You'll likely be better off with the RE2 engine from the `regexp` package and should only use this if you need to write very complex patterns or require compatibility with .NET.
  3
  4## Basis of the engine
  5The engine is ported from the .NET framework's System.Text.RegularExpressions.Regex engine.  That engine was open sourced in 2015 under the MIT license.  There are some fundamental differences between .NET strings and Go strings that required a bit of borrowing from the Go framework regex engine as well.  I cleaned up a couple of the dirtier bits during the port (regexcharclass.cs was terrible), but the parse tree, code emmitted, and therefore patterns matched should be identical.
  6
  7## New Code Generation
  8For extra performance use `regexp2` with [`regexp2cg`](https://github.com/dlclark/regexp2cg). It is a code generation utility for `regexp2` and you can likely improve your regexp runtime performance by 3-10x in hot code paths. As always you should benchmark your specifics to confirm the results. Give it a try!
  9
 10## Installing
 11This is a go-gettable library, so install is easy:
 12
 13    go get github.com/dlclark/regexp2
 14
 15To use the new Code Generation (while it's in beta) you'll need to use the `code_gen` branch:
 16
 17    go get github.com/dlclark/regexp2@code_gen
 18
 19## Usage
 20Usage is similar to the Go `regexp` package.  Just like in `regexp`, you start by converting a regex into a state machine via the `Compile` or `MustCompile` methods.  They ultimately do the same thing, but `MustCompile` will panic if the regex is invalid.  You can then use the provided `Regexp` struct to find matches repeatedly.  A `Regexp` struct is safe to use across goroutines.
 21
 22```go
 23re := regexp2.MustCompile(`Your pattern`, 0)
 24if isMatch, _ := re.MatchString(`Something to match`); isMatch {
 25    //do something
 26}
 27```
 28
 29The only error that the `*Match*` methods *should* return is a Timeout if you set the `re.MatchTimeout` field.  Any other error is a bug in the `regexp2` package.  If you need more details about capture groups in a match then use the `FindStringMatch` method, like so:
 30
 31```go
 32if m, _ := re.FindStringMatch(`Something to match`); m != nil {
 33    // the whole match is always group 0
 34    fmt.Printf("Group 0: %v\n", m.String())
 35
 36    // you can get all the groups too
 37    gps := m.Groups()
 38
 39    // a group can be captured multiple times, so each cap is separately addressable
 40    fmt.Printf("Group 1, first capture", gps[1].Captures[0].String())
 41    fmt.Printf("Group 1, second capture", gps[1].Captures[1].String())
 42}
 43```
 44
 45Group 0 is embedded in the Match.  Group 0 is an automatically-assigned group that encompasses the whole pattern.  This means that `m.String()` is the same as `m.Group.String()` and `m.Groups()[0].String()`
 46
 47The __last__ capture is embedded in each group, so `g.String()` will return the same thing as `g.Capture.String()` and  `g.Captures[len(g.Captures)-1].String()`.
 48
 49If you want to find multiple matches from a single input string you should use the `FindNextMatch` method.  For example, to implement a function similar to `regexp.FindAllString`:
 50
 51```go
 52func regexp2FindAllString(re *regexp2.Regexp, s string) []string {
 53	var matches []string
 54	m, _ := re.FindStringMatch(s)
 55	for m != nil {
 56		matches = append(matches, m.String())
 57		m, _ = re.FindNextMatch(m)
 58	}
 59	return matches
 60}
 61```
 62
 63`FindNextMatch` is optmized so that it re-uses the underlying string/rune slice.
 64
 65The internals of `regexp2` always operate on `[]rune` so `Index` and `Length` data in a `Match` always reference a position in `rune`s rather than `byte`s (even if the input was given as a string). This is a dramatic difference between `regexp` and `regexp2`.  It's advisable to use the provided `String()` methods to avoid having to work with indices.
 66
 67## Compare `regexp` and `regexp2`
 68| Category | regexp | regexp2 |
 69| --- | --- | --- |
 70| Catastrophic backtracking possible | no, constant execution time guarantees | yes, if your pattern is at risk you can use the `re.MatchTimeout` field |
 71| Python-style capture groups `(?P<name>re)` | yes | no (yes in RE2 compat mode) |
 72| .NET-style capture groups `(?<name>re)` or `(?'name're)` | no | yes |
 73| comments `(?#comment)` | no | yes |
 74| branch numbering reset `(?\|a\|b)` | no | no |
 75| possessive match `(?>re)` | no | yes |
 76| positive lookahead `(?=re)` | no | yes |
 77| negative lookahead `(?!re)` | no | yes |
 78| positive lookbehind `(?<=re)` | no | yes |
 79| negative lookbehind `(?<!re)` | no | yes |
 80| back reference `\1` | no | yes |
 81| named back reference `\k'name'` | no | yes |
 82| named ascii character class `[[:foo:]]`| yes | no (yes in RE2 compat mode) |
 83| conditionals `(?(expr)yes\|no)` | no | yes |
 84
 85## RE2 compatibility mode
 86The default behavior of `regexp2` is to match the .NET regexp engine, however the `RE2` option is provided to change the parsing to increase compatibility with RE2.  Using the `RE2` option when compiling a regexp will not take away any features, but will change the following behaviors:
 87* add support for named ascii character classes (e.g. `[[:foo:]]`)
 88* add support for python-style capture groups (e.g. `(P<name>re)`)
 89* change singleline behavior for `$` to only match end of string (like RE2) (see [#24](https://github.com/dlclark/regexp2/issues/24))
 90* change the character classes `\d` `\s` and `\w` to match the same characters as RE2. NOTE: if you also use the `ECMAScript` option then this will change the `\s` character class to match ECMAScript instead of RE2.  ECMAScript allows more whitespace characters in `\s` than RE2 (but still fewer than the the default behavior).
 91* allow character escape sequences to have defaults. For example, by default `\_` isn't a known character escape and will fail to compile, but in RE2 mode it will match the literal character `_`
 92 
 93```go
 94re := regexp2.MustCompile(`Your RE2-compatible pattern`, regexp2.RE2)
 95if isMatch, _ := re.MatchString(`Something to match`); isMatch {
 96    //do something
 97}
 98```
 99
100This feature is a work in progress and I'm open to ideas for more things to put here (maybe more relaxed character escaping rules?).
101
102## Catastrophic Backtracking and Timeouts
103
104`regexp2` supports features that can lead to catastrophic backtracking.
105`Regexp.MatchTimeout` can be set to to limit the impact of such behavior; the
106match will fail with an error after approximately MatchTimeout. No timeout
107checks are done by default.
108
109Timeout checking is not free. The current timeout checking implementation starts
110a background worker that updates a clock value approximately once every 100
111milliseconds. The matching code compares this value against the precomputed
112deadline for the match. The performance impact is as follows.
113
1141.  A match with a timeout runs almost as fast as a match without a timeout.
1152.  If any live matches have a timeout, there will be a background CPU load
116    (`~0.15%` currently on a modern machine). This load will remain constant
117    regardless of the number of matches done including matches done in parallel.
1183.  If no live matches are using a timeout, the background load will remain
119    until the longest deadline (match timeout + the time when the match started)
120    is reached. E.g., if you set a timeout of one minute the load will persist
121    for approximately a minute even if the match finishes quickly.
122
123See [PR #58](https://github.com/dlclark/regexp2/pull/58) for more details and 
124alternatives considered.
125
126## Goroutine leak error
127If you're using a library during unit tests (e.g. https://github.com/uber-go/goleak) that validates all goroutines are exited then you'll likely get an error if you or any of your dependencies use regex's with a MatchTimeout. 
128To remedy the problem you'll need to tell the unit test to wait until the backgroup timeout goroutine is exited.
129
130```go
131func TestSomething(t *testing.T) {
132    defer goleak.VerifyNone(t)
133    defer regexp2.StopTimeoutClock()
134
135    // ... test
136}
137
138//or
139
140func TestMain(m *testing.M) {
141    // setup
142    // ...
143
144    // run 
145    m.Run()
146
147    //tear down
148    regexp2.StopTimeoutClock()
149    goleak.VerifyNone(t)
150}
151```
152
153This will add ~100ms runtime to each test (or TestMain). If that's too much time you can set the clock cycle rate of the timeout goroutine in an init function in a test file. `regexp2.SetTimeoutCheckPeriod` isn't threadsafe so it must be setup before starting any regex's with Timeouts.
154
155```go
156func init() {
157	//speed up testing by making the timeout clock 1ms
158	regexp2.SetTimeoutCheckPeriod(time.Millisecond)
159}
160```
161
162## ECMAScript compatibility mode
163In this mode the engine provides compatibility with the [regex engine](https://tc39.es/ecma262/multipage/text-processing.html#sec-regexp-regular-expression-objects) described in the ECMAScript specification.
164
165Additionally a Unicode mode is provided which allows parsing of `\u{CodePoint}` syntax that is only when both are provided.
166
167## Library features that I'm still working on
168- Regex split
169
170## Potential bugs
171I've run a battery of tests against regexp2 from various sources and found the debug output matches the .NET engine, but .NET and Go handle strings very differently.  I've attempted to handle these differences, but most of my testing deals with basic ASCII with a little bit of multi-byte Unicode.  There's a chance that there are bugs in the string handling related to character sets with supplementary Unicode chars.  Right-to-Left support is coded, but not well tested either.
172
173## Find a bug?
174I'm open to new issues and pull requests with tests if you find something odd!