Home
    Shop
    Advertise
    Write For Us
    Affiliate
    Newsletter
    Contact

.Net Regular Expressions Syntax

Regular expressions (also known as regex) originate from Perl and today are widely used in many programming languages. Although all implementations (or as RegEx developers say flavors) are very similar, there are differences in syntax that can cause that some regular expression works differently or don't work at all when used with different RegEx engine.

 

This article explains the syntax used by Microsoft .Net Framework Regular Expressions engine, located in System.Text.RegularExpressions namespace that includes different expression's syntax and Regular Expressions Options specified in System.Text.RegularExpressions.RegexOptions enumeration. Be sure to bookmark this .Net regular expressions syntax reminder, so you can find it fast when needed.

Anchors

Anchor Description
^

Start of string if RegexOptions.SingleLine is used, or start of the line if RegexOptions.Multiline is used.
If on first position inside character class [ ], means NOT
If inside character class but not in first position, has no special meaning, just matches ^ character as literal.

$ End of string if RegexOptions.SingleLine is used or end of the line if RegexOptions.Multiline is used.
\A Start of the string, similar to ^. But, RegexOptions.Multiline option doesn't affect its behavior. If Multiline is used, \A still matches only start of the complete string, not start of the line.
\Z End of the string, like $. If RegexOptions.Multiline is used its behavior stays the same: \Z still matches only end of the string, not end of the line.
\b Word boundary. Start or end of the word, matches between \w and \W. Delimiter could be a dot, comma, space etc.
If inside of [ ] means backspace \u0008.
\b is often used to match whole words. For example, \bcan\b will match "I can do it" but will ignore if pattern is sub string of main string, e.g. Americans
\B Non word boundary. \B matches everywhere where \b do not match (between \w and \w or between \W and \W). For example that could be inside of word or inside of multiple spaces.
\B does the opposite of \b. So, \Bcan\B would match Americans but match nothing in "I can do it"
\G Starts at position where previous match ended. This is useful when working with MatchCollection object, using Regex.Matches or Match.NextMatch methods.

Alternation and escape

Metacharacter Description Example
| Alternation, OR. Splits regular expression into multiple alternatives. red|green|blue matches either red or green or blue.

Alternation John|Johnny will match "John" in text "My name is Johnny" because when both options could match, first term has an advantage. You can use longer alternative first, like Johnny|John, but it's probably better to see what you are trying to do. Maybe using of word boundary ( \w ) is better solution. In case of this example, the expression \bJohn\b|\bJohnny\b or \b(John|Johnny)\b would match only if John or Johnny are whole words.
\ Backslash is used to change normal behavior of metacharacters and literals. Backslash and following character are called escaped sequence. ^ represents beginning of the string or line, but \^ matches only ^ character as literal.
Similarly, n matches a letter n, but when escaped \n means new line.

Character classes

Class Description
[ ]

Character set represents a collection of characters. Dash can be used to set character range. Caret (^) is used for negation.

[aeiou] matches any vovel
[A-Z] matches all uppercase letters
[^0-9] matches every character that is not a digit

\w

any word character, the same as [A-Za-z0-9], alphanumeric and underscore. \w matches letters in any language, not only Latin alphabet.

Matches letters a, b, or c; or numbers 1, 2, 3, but not spaces, brackets, tabs, new lines etc.

\W any non word character
\s matches space
\S matches non space character
\d matches a digit
\D non digit
\p{name}

Any character that belongs to specified character class. Name can be unicode group or block range.

For example, \p{IsGreek} matches greek letters

\P{name}

Any character that not belong to specified character class. Name can be unicode group or block range.

\P{IsGreek} matches every character which is not member of IsGreek unicode group

.

Dot represents any single character except a new line (new line is presented with \n)

(\d){1,2}.(\d){1,2}.\d{4} matches 12/5/1980, 2-4-2002 which is ok if date is inserted, but expression also matches 1294402345 or 43W23@9999. Be sure that you really need any character when use dot in regular expression

Grouping

Group Description
( ) Unnamed group. By default, RegEx engine stores its captured value in memory and could be used later using $1, $2,... notation or for backreferences. If ExplicitCapture option is used, unnamed group doesn't save values.
(?<name>expression)
or
(?'name'expression)
Named group. With named groups expression is more readable and easier to maintain. Use either < > or ' ' after (? to name a group.
(?<name1-name2> ) Balancing group
(?: ) Group, but not capture. RegEx engine doesn't store value of group to memory. If you need group just for sub expression and you don't need its value this could be good for performance. Unnamed group ( ) can also act as noncapturing group (?: ) if RegexOptions.ExplicitCapture is used.
(?imnsx-imnsx: ) Inline mode modifiers, enable changing of RegularExpression options inside expression. Not all options could be changed inline.
(?= ) Positive lookahead group
(?! ) Negative lookahead group
(?<= ) Positive lookbehind group
(?<! ) Negative lookbehind group
(?> ) Nonbacktracking subexpression

Greedy and lazy quantifiers

Greedy quantifiers Lazy counterparts Description
* *? zero or more occurrences. Could be presented as {0, } too.
+ +? Matches one or more occurrences, the same as {1, }.
? ?? 0 or 1 occurrences, makes character or group optional. Could be written as {0,1}.
{n} {n}? character is repeated exactly n times
{n, } {n, }? character is repeated n times or more
{n, m} {n, m}? character is repeated from n to m times

 

.Net Regular Expressions Escapes

Escape Description Example
\v macthes vertical tab \u000B  
\x## ASCII character in hexadecimal format \x20
\# Back reference. Positive integer. \b(\w)\w*\B\1\b
Matches words that have same first and last letter. First letter is captured with (\w) and referenced later in expression with \1
\k<name>
or
\k'name'
Named back reference. Gets matched value of named group. \b(?<FirstLetter>\w)\w*\B\k<FirstLetter>\b
Same as previous example, matches words with same first and last letter. But, in this case named group with name FirstLetter is used, and called later with \k<FirstLetter> back reference.
\### ASCII character in octal format. I avoid it because interferes with backreference. For less confusing, use ASCII characters in hexadecimal format. Single digit escaped sequences \1 to \9 are always considered as backreferences. Two or three digits depends of numbers of groups in expression.

\040

\11 is backreference if there is 11 or more groups in expression, otherwise represents ASCII character in octal format.

\u#### Character represented as unicode. Requires exactly four numbers in hexadecimal format (from 0 to F). See www.unicode.org for character tables. \u2122 is â„¢, trademark character
\c# ASCII control character, Ctrl plus character \cC is equal to Ctrl + C

Matching nonprintable characters

Classes Description
\a bell (alarm) \u0007
\e escape \u001B
[\b] backspace \u0008, if not inside [ ] means word boundary
\f form feed \u000C
\n new line \u000A
\r carriage return \000D
\t horizontal tab \u0009
\v vertical tab \u000B

RegEx comments

Syntax Description
# X-mode comment. Comment starts when # character occurs without escape. Comment ends at the end of the line. So, expressions that use this comment should have IgnorePatternWhitespace option turned on. Here is an example:
(?# ... ) Inline comment, enclosed with parentheses. Comment starts with "(?#" sequence and ends with ")"

 

Substitutions

Substitution Description
${GroupName} Replaces matched text with named group.
$number Replaces matched text with group that has specified index.
$$ Represents $ character as literal. This is needed because dollar ( $ ) character has special meaning in replacement string.
$&

Substitutes a copy of the whole match.

$` Represents all text before matched string.
$' Represents all text after matched string.
$+ Last matched group.
$_ Substitute input string.

.Net Regular Expressions Options (RegexOptions)

RegEx Option Inline character Description
Compiled N/A Compiler generates MSIL (Microsoft intermediate language) code. Compiled regular expression executes faster. But, be aware that expression is compiled dynamically and some time is needed for initial compilation. Because of this, use this option only when same expression is executed many times.
CultureInvariant N/A Ignores Culture information in string.
ECMAScript N/A Used with IgnoreCase and Multiline options.
ExplicitCapture n Specifies that unnamed groups will not be captured. Unmamed group "(expression)" acts like "(?:expression)". Only explicitly named or numbered groups in form of "(?<name>expression)" are captured.
IgnoreCase i Matching is case insensitive.
IgnorePatternWhitespace x Used when you want to add comments to regular expression. IgnorePatternWhitespace wlll ignore all white spaces (space, tab, new lines) but you still can use escaped sequences (e.g. \s or \t).
Multiline m Changes behavior of the caret ( ^ ) and dollar ( $ ) characters. If Multiline option is used, ^ matches start of the line, not just start of the string, and $ matches end of every line, not just end of the string. Ecapes \A and \Z always matches start and end of the string.
None N/A No Regex options specified, Regex will use default values.
RightToLeft N/A By default, regular expression searches text from left to right. This option is useful for languages that are read right-to-left (some Asian languages). RightToLeft option can't be used inline because that could lead to infinite loop.
Singleline s Changes behavior of the dot ( . ). If Singleline option is used, dot means any character including new line character \n.

Tutorial toolbar:  Tell A Friend  |  Add to favorites  |  Feedback  |   Google


comments powered by Disqus