Home
    Shop
    Advertise
    Write For Us
    Affiliate
    Newsletter
    Contact

How To Write Regular Expressions In .Net

Regular expressions (a.k.a. RegEx) appear almost as a separate language inside .Net Framework. Their syntax looks cryptic for beginners thus many ASP.NET and other .Net developers avoid to learn regular expressions as long as possible, while using only classic string manipulation when work with textual data.

 

Although regular expression could look hard to understand at a first glance, keep in mind that they are just specially formatted strings; with small number of grammar rules you need to follow. This article will explain you the rules of writing regular expressions, so with a little practice you can add this powerful tool under your expertise.

To execute regular expressions against some text, we need to use classes from System.Text.RegularExpressions namespace. Regex class represents regular expression. More about how to use Regex class in C# or VB.NET, including four common uses you can read at Using Regex Class in ASP.NET tutorial. There is also online ASP.NET web application you can use to Test .Net Regular Expressions. This short application includes four web forms: Extract Data, Search and Replace, Data Validation and Split String. It is probably best to start with Extract Data tester and try example expressions from this tutorial.

Metacharacters and literals

Even if the term "regular expressions" sounds strange to you (or their short name, RegEx), you are probably already familiar with simple file search on DOS or Windows. For example, if you want to search for files in Windows, you can type something like *.pdf to find all files with .pdf extension. In this case, * character have special meaning and means "any file". Thus, this search returns all files that end with string ".pdf". Regular expressions are similar to this, but they have more rules and more special characters. Special characters in regular expressions are called metacharacters, and other normal characters are called literals.

Starting with Regular Expressions syntax

Regular expressions can be simple or complex. Here is one very simple regular expression:

car

Of course, this is just simple string with three letters, but it is also regular expression which contains three literals. This expression will match string car regardless of its position in text. It could be in the middle of other bigger word, on the beginning or end of word or even whole text etc. This is similar to Find... function in Notepad or many other programs. Type few letters, click Find Next button and application marks where typed character sequence occurs.

Matching start or the end of the string

To narrow previous search to start or end of the string only, we need ^ (caret) and $ (dollar) metacharacters. Caret ( ^ ) means "start of the text", and dollar ( $ ) metacharacter means "end of the text".

Expression ^car will match "car" only if text starts with "car" and will ignore other occurences in the middle or at the end of the string. In addition to that, Regex class allows using of RegexOptions.Multiline flag. If Multiline option is used in Regex class constructor, text is broken to lines so expression ^car will match the start of every line, not just start of complete text.

Expression car$ will match car only if it is at the end of the string. If RegexOptions.Multiline is used, it will match ends of each line too.

And, if we use both caret and dollar metacharacters to build expression like ^car$, it will match only if text or line (depending is Multiline option is used) is equal to "car".

How to match special characters (metacharacters)

\ (backslash) metacharacter is used to match metacharacters and also to add special meaning to literals. For example, regular expression ^b will match string if starts with letter b, but escaped \^b will literally search string for s substring "^b". For literals, expression d will just match letter d, but when escaped \d means any digit (from 0 to 9), expression n matches letter n, but escaped \n means new line. \ (backslash) is also a metacharacter, so to search string for this character you need to use \\.

Very useful escape sequences are \A and \Z. \A matches start of the text and \Z matches end of the text. This is similar to ^ and $ characters, but the difference is that using of Multiline option doesn't affect \A and \Z. Complete list of metacharacters and escape sequences you can find at .Net Regular Expressions Syntax summary page.

Finding only whole words

Expression "car" will match string anywhere in text even if it is a part of larger word. To find only whole words use \b sequence. \b means start or end of the word (word boundary). The expression \bcat\b would match whole word cat, but not as subword in category, communication, ducat, scat or location.

Letter b without \ matches "b". By adding \ before, it becomes escape sequence and has special meaning to regular expression engine.

Alternation - OR condition

Previous problems, like matching exact word in the middle, start or end of the text are very simple tasks that could be done without regular expressions, for example using common System.String class methods. But, as problem is harder, regular expressions are more useful tool and their power reveals.

Regular expressions use | metacharacter (known as vertical bar character, or pipe) for choice between two or more alternatives, similar to OR in VB.NET or || in C#.

For example, regular expression ^red|blue$ will match both strings "red" or "blue".

Case sensitive or case insensitive

Regular expressions are case sensitive by default. You can write alternation like b|B to match both cases of letter B, but .Net RegEx engine offers easier solution. Use RegexOptions.IgnoreCase for case insensitive expressions. Be aware that this option affects complete expression, so if you need just one part of the expression to be case insensitive, use classic way with alternation or using RegEx classes (more on classes in regular expressions in next tutorial).

Conclusion

As you see, regular expressions are not hard at all once you understand these few grammar rules. There is short .Net Regular Expressions Syntax summary you can use to remind in case you forget some metacharacter or escape sequence.

Don't forget to Test .Net Regular Expressions , for beginners is probably best to start with Extract Data page which just search for strings in given text. To use this tester, insert some text in "Input Text" textbox, then write regular expression in "Regular Expression" textbox control and finally click "Find Matches" button. All strings matched by RegEx engine will list bellow. You can test all examples in this tutorial, like extracting complete words, matching start or the end of the line or complete text, alternation, case sensitivity etc.

This is just fist step in understanding of regular expressions, but you already can see that regular expressions are not so difficult. In next tutorial, Writing Regular Expressions Character Classes, I cover regular expressions classes. I hope this tutorial series will be helpful for you and soon you'll impress your chief or coworkers with some "cryptic" but useful regular expressions :). Happy coding!


Tutorial toolbar:  Tell A Friend  |  Add to favorites  |  Feedback  |   Google


comments powered by Disqus