Home
    Shop
    Advertise
    Write For Us
    Affiliate
    Newsletter
    Contact

Writing Regular Expressions Character Classes

In first tutorial How To Write Regular Expressions In .Net, I explained basics of regular expressions syntax used in .Net Framework. Also, you can Test .Net Regular Expressions with four online testers: Extract Data, Search and Replace, Data Validation and Split String. In this tutorial, we'll go one step further and demistify regular expressions character classes.

 

Character class matches any character from set (or we can say class) of characters.

Dot character as most general character class

Simplest character class is a dot or period ( . ) metacharacter. Dot matches any single character except a new line. It works similar like ? character in Windows search or _ character in SQL queries. For example, regular expression b.t will match words with any character between "b" and "t", for example: bit, bat, but, bot etc.

Dot is often used in validation scenario together with caret ( ^ ) and dollar ( $ ) metacharacters, where string must have certain length, but inside could contain different characters. To match dot as a literal, we need to escaped it. Regular expression \. will search for dots in text.

More limited character classes

Dot character is very general. You need to be sure that you really allow ANY character in expression. Often, developers need more limited classes. For example, if user should enter a postal code, then regular expression should accept only five digits, not five letters or spaces. Fortunatelly, there are six useful regex tokens:

\d - matches any digit. These are 1, 2, 3, 4, 5, 6, 7, 8, 9 or 0. So, expression \d\d\d will match any three digits in sequence.

\D - matches if character is not a digit (anything not matched by \d)

\w - matches any word character (any alphanumeric plus underscore ( _ ) )

\W - matches if character is not letter, digit or underscore (anything not matched by \w)

\s - whitespace character (for example space or tab)

\S - character is not a whitespace character (like digits, letters,... in general anything not matched by \s)

Character classes using brackets

[ ... ] is general form of character classes. All other forms mentioned above are just shorhands of this. Inside "[" and "]", we can place set of two or more characters, and expression will match any of that characters.

For example, m[ae]n matches man or men (inside [ ] are letters "a" and "e", so expression looks for any of them); expression b[aeoui]t matches bat, bet, bot, but, bit.

Character classes could be used to make case insensitive parts of expression even if you don't use RegexOptions.IgnoreCase. For example, to match all possible variations of string RegEx, you can write [rR][eE][gG][eE][xX]. This will match regex, Regex, RegEx etc.

Character class ranges

Dash ( - ) is used to describe a range of characters so we don't need to list them all one by one. For example:

[1-6] is shorter but the same as [123456], and expression [5-8] is like [5678].

[0-9] means any digit, it is equal to \d

[a-z] means any lowercase letter (reversed, like [z-a] or [9-4] ranges are not allowed)

[A-Z] means any uppercase letter

Finally, you can write [0-9a-zA-Z] to match any digit, or any lower case letter, or any upper case letter

If you want to search for a dash ( - ) as literal, place it first inside character class, like this: [-abc] matches a or b or c or -.

One interesting thing, since \d is negation of \D, \w is negation of \W and \s is negation of \S. Thus, expressions [\d\D], [\w\W] and [\s\S] are all equal.

Negated character classes and class subtraction

Caret ( ^ ) inside square brackets, immediately after [ opening bracket, means NOT (outside of the square brackets indicates start of the string). Thus, expression [^0-9] means not a digit. If ^ is not on first position inside character class, matches just literal ^ and has no special meaning.

Notice that metacharacters, like ., /, | etc., inside class are not metacharacter and have not special meaning. Because of that, expression m[a|e]n will match strings man, men, but also m|n string too.

It is possible to subtract classes. For example, expression [a-z-[o-t]] means that expression matches all characters in a-z range except characters in o-t range.

Conclusion

As you see, character classes can make regular expressions shorter and easier to read. Check .Net Regular Expressions Syntax in case that you forget the meaning of some element of regular expression syntax. To find out how to use regular expressions in C# or VB.NET check Using Regex Class in ASP.NET tutorial.

In .Net Regular Expressions Quantifiers tutorial I will cover quantifiers, one more step in creating powerful regular expressions. Happy coding!


Tutorial toolbar:  Tell A Friend  |  Add to favorites  |  Feedback  |   Google


comments powered by Disqus