Legacy Regular Expressions
Canopy uses two different regular expression syntax depending on whether you initiate a search via the command line or are defining custom detection rules.
Canopy’s search bar utilizes the Apache Lucene’s regular expression syntax. A regular expression is a way to match patterns in data using placeholder characters, called operators.
Canopy supports regular expressions in the following queries:
- regexp
- query_string
Use common Apache Lucene regular expression syntax to search using regular expression.
Type regex: followed by the regular expression.
For example,
regex:[0-9]{3}
will find all instances of 3 digits in a row.
Lucene’s regular expression engine supports all Unicode characters. However, the following characters are reserved as operators:
. ? + * | { } [ ] ( ) " \
Depending on the optional operators enabled, the following characters may also be reserved:
# @ & < > ~
To use one of these characters literally, escape it with a preceding backslash or surround it with double quotes. For example:
\@ # renders as a literal '@'
\\ # renders as a literal '\'
"john@smith.com" # renders as 'john@smith.com'
Lucene’s regular expression engine does not use the Perl Compatible Regular Expressions (PCRE) library, but it does support the following standard operators.
. – Matches any character. For example:
ab. # matches 'aba', 'abb', 'abz', etc.
? – Repeat the preceding character zero or one times. Often used to make the preceding character optional. For example:
abc? # matches 'ab' and 'abc'
+ – Repeat the preceding character one or more times. For example:
ab+ # matches 'ab', 'abb', 'abbb', etc.
* – Repeat the preceding character zero or more times. For example:
ab* # matches 'a', 'ab', 'abb', 'abbb', etc.
{} – Minimum and maximum number of times the preceding character can repeat. For example:
a{2} # matches 'aa'
a{2,4} # matches 'aa', 'aaa', and 'aaaa'
a{2,} # matches 'a' repeated two or more times
| – OR operator. The match will succeed if the longest pattern on either the left side OR the right side matches. For example:
abc|xyz # matches 'abc' and 'xyz'
( … ) – Forms a group. You can use a group to treat part of the expression as a single character. For example:
abc(def)? # matches 'abc' and 'abcdef' but not 'abcd'
[ … ] – Match one of the characters in the brackets. For example:
[abc] # matches 'a', 'b', 'c'
Inside the brackets, - indicates a range unless - is the first character or escaped. For example:
[a-c] # matches 'a', 'b', or 'c'
[-abc] # '-' is first character. Matches '-', 'a', 'b', or 'c'
[abc\-] # Escapes '-'. Matches 'a', 'b', 'c', or '-'
A ^ before a character in the brackets negates the character or range. For example:
[^abc] # matches any character except 'a', 'b', or 'c'
[^a-c] # matches any character except 'a', 'b', or 'c'
[^-abc] # matches any character except '-', 'a', 'b', or 'c'
[^abc\-] # matches any character except 'a', 'b', 'c', or '-'
You can use the flags parameter to enable more optional operators for Lucene’s regular expression engine.
To enable multiple operators, use a | separator. For example, a flags value of COMPLEMENT|INTERVAL enables the COMPLEMENT and INTERVAL operators.
Enables all optional operators.
Alias for the ALL value.
Enables the ~ operator. You can use ~ to negate the shortest following pattern. For example:
a~bc # matches 'adc' and 'aec' but not 'abc'
Enables the # (empty language) operator. The # operator doesn’t match any string, not even an empty string.
If you create regular expressions by programmatically combining values, you can pass # to specify “no string.” This lets you avoid accidentally matching empty strings or other unwanted strings. For example:
#|abc # matches 'abc' but nothing else, not even an empty string
Enables the <> operators. You can use <> to match a numeric range. For example:
foo<1-100> # matches 'foo1', 'foo2' ... 'foo99', 'foo100'
foo<01-100> # matches 'foo01', 'foo02' ... 'foo99', 'foo100'
Enables the & operator, which acts as an AND operator. The match will succeed if patterns on both the left side AND the right side matches. For example:
aaa.+&.+bbb # matches 'aaabbb'
Enables the @ operator. You can use @ to match any entire string.
You can combine the @ operator with & and ~ operators to create an “everything except” logic. For example:
@&~(abc.+) # matches everything except terms beginning with 'abc'
Disables all optional operators.
Lucene’s regular expression engine does not support anchor operators, such as ^ (beginning of line) or $ (end of line). To match a term, the regular expression must match the entire string.