::Stuff for the multi-spec coder;

Coding, formats, standards, and other practical things.

 Home  //  Guides & Articles  //  Regular Expressions 

<!-- Guides & Articles

 Previous Page   1   2  3

Metacharacters - iterators


As hinted from the name itself, iterator metacharacters relate to occurrences. To define their role, iterator Metacharacters specify the number of occurrences of the previous character that may be either a literal character or a metacharacter or a group of metacharacters forming a sub-expression. The iterators can be greedy or non-greedy.

Greedy Iterators make up those regular expressions that match as many equivalents as possible; in other words, they match the maximum available! The results of greedy regular expressions can be very problematic, considering the exhaustive nature of results that are most prone to consist of majority of useless identifications.

On the other hand, non-greedy iterators form those regular expressions that grab as little as possible in terms of target matches, instead of as much as possible. Non-greedy iterators are simply greedy iterators followed by a question mark ('?') quantifier to limit their search pattern to the most relevant of results.

For instance, for as string "dooooggy", o+ and g* shall return "oooo" and "gg" respectively. On the other hand, for the same string, non-greedy expression o+? will return 'o' and g*? will return an empty string.

Following table enlists some iterators that have been marked as greedy or non-greedy, for your convenience and help:

* Matches zero or more. "Greedy" in nature and is similar to {0,}
+ Matches one or more. "Greedy" in nature and is similar to {1,}
? Matches zero or one. "Greedy" in nature and is similar to {0,1}
{n} Matches exactly n times. "Greedy" in nature.
{n} Matches at least n times. "Greedy" in nature.
{n,m} Matches at least n but not more than m times. "Greedy" expression
*? Matches zero or more. "Non-greedy" in nature and is similar to {0,}?
+? Matches one or more. "Non-greedy" in nature and is similar to {1,}?
?? Matches zero or one. "Non-greedy" in nature and is similar to {0,1}?
{n}? Matches exactly n times. "Non-greedy" in nature.
{n,}? Matches at least n times. "Non-greedy" in nature.
{n,m}? Matches at least n but not more than m times. "Non-greedy" expression.

Hence, curly brackets ({ }) are useful in quantifying the minimum and maximum number of times to match an expression, by specifying digits in curly brackets in the form {n,m}. While n indicates the minimum number of times to match an expression, m specifies the maximum number of times. The form {n} is as good as (n,n} implying exactly n number of times. An open-ended form {n, } leaves room for the maximum possible number of time to match an expression, while the minimum remains at least n number of times.
One must always be wary of the repercussions of using large n or m values as large numbers consume more memory and in effect slow down the regex.

If a curly bracket occurs in any other context, it is treated as a regular character.

Let us examine some relevant examples of iterators to grasp the concept better.

Starting with some instances of greedy expressions:

Tentat.*ve will match strings like 'Tentative', 'Tentative', 'Tentatsjgkve', 'Tentatve' etc.
Tentat.+ve will match all the matches of 'Tentat.*ve' except for 'Tentatve'.
Tentat.?ve will match strings like 'Tentative', 'Tentative', 'Tentatve' but not 'Tentatsjkshgve' etc.
Tentati{2}ve will match 'Tentatiive'.
Tentati{2,3}ve will match all results of 'Tentatiive' and 'Tentatiiive'.
Tentati{2, }ve will match 'Tentatiive', 'Tentatiiive', 'Tentatiiiiive', 'Tentatiiiiiiiive' etc.

Switching to non-greedy examples of the above regular expressions, the output will be as follows:

Tentat.*?ve will match 'Tentatve'.
Tentat.+?ve will match 'Tentative', 'Tentatove' etc.
Tentat.??ve will match 'Tentatve'.
Tentati{2}?ve will match 'Tentatiive'.
Tentati{2,3}?ve will match 'Tentatiive'.
Tentati{2, }?ve will match 'Tentatiive'.

The reason between the difference of output can be understood by reading the following more explicit examples.

For a string "Grrreaaat!", a non-greedy expression a{2,3}? shall return "aa", while greedy expression a{2,3} shall also return "aaa". To put it more explicitly, the non-greedy expression shall return the minimum limit set by the regex. Therefore, a regex r*? shall return an empty string as the syntax in effect means r{0,}?, the minimum limit being zero.

Regex /D[A-Z]*?G/ is a non-greedy regular expression, which matches a 'D', followed by only as many capital letters as are needed to find a 'G'.

You can switch all iterators into "non-greedy" mode using modifier /g.

Metacharacters - alternatives


There are times when you need to find a series of alternatives in a pattern. There may be strings or codes or expressions that may have variable pattern inside with identical outside pattern. In such cases you are offered to use alternative metacharacter - the pipe symbol or '|'. The symbol enables you to specify a series of alternatives to match in the target alphanumeric or otherwise text. For example, suppose your target strings are 'the', 'thy', 'thee' and 'this'. You can use the pipe (|) in regex to specify the target string, like, the|thy|thee|this or alternatively you can use th(e+|y|is). The use of parenthesis is important because otherwise the expression seems rather cluttered. The parentheses minimize the confusion in a regular expression with respect to the starting and ending of the alternatives.

Further, alternative metacharacters are attempted from left to right to match the first wholesome instance. In other words, the alternative metacharacters are not compulsorily greedy in nature. For example, while matching th(e|ei)r against "their", the metacharacter will match 'e' part only as the same being the precedent in the group, as well as, 'd' matches the entire string successfully.

Another noteworthy point about alternative metacharacter (|) is that it is interpreted literally when enclosed within square brackets. So if you write an expression [ee|y|is], you are actually matching [eyis|].

Citing a couple of examples of alternative metacharacters:

- Tentat(ive|ool) matches strings "Tentative" or "Tentatool".

- Th(is|ei|ee)r? matches "This", "Their" and "Thee".

Metacharacters - backreferences


Metacharacters \1 through \9 are interpreted as back-references. Back-references match previously matched sub-expression, for example, syntax (.)\1+ matches more than one repeat occurrence of any character like "cccc" or "dd". Citing another example, the syntax (.+)\1+ also matches repeat occurrences like "cdcd" and "245245", besides the matches of the syntax (.)\1+. Another syntax (["']?)(\d+)\1 matches numbers within double quotes, single quotes and without quotes like "24", '7' or 88.

Backreference is a powerful tool as it helps you exercise the option of creating search patterns that specifies to back reference, that is, a sub-expression that has already been matched earlier in a regular expression to be matched again later in the expression. The numbers 1 through 9, preceded by the backslash ('\') escape character constitute the naming pattern of exhaustive backreference metacharacters. Thus named, the backreferences actually refer to each successive group in the match pattern, as in (abcd) (efgh) \1 \2, 'abcd' refer to \1 and 'efgh' refers to \2.

What is noteworthy here that what gets matched by a backreference is the same literal string matched the first time, even if the pattern that matched the string could have matched other strings. Thus, backreference starkly differs from simply repeating the same grouped subexpression later in the regular expression, as it does not match the same targets as using a backreference. To elucidate further, lets say you have the following string to search:

"fox rabbit forest
fox forest rabbit
fox forest forest
fox rabbit rabbit"

Now a backreference expression like (rabbit|forest) \1, which will result in identification of "forest forest rabbit rabbit".

On the other hand, if the subexpression is simply repeated within the regular expression (rabbit|forest) (rabbit|forest), the results will be:

"rabbit forest
forest rabbit
forest forest
rabbit rabbit."

When you ever need to insert a fixed string everywhere a pattern occurs in the target text, you may instantly recall the use of 'replace all' utility of any application. But such a replacement is not context sensitive. What if you need to insert a string bearing much more relation to the matched patterns? Sounds impossible? Well, not really. Backreferences are a power tool in such replacement patterns. Replacement backreferences allow you to pick and choose from the matched patterns to use just the parts you need. For example, say the target string is:
"C42 E9 F112 G96670 E6658 AAA" The replacement backreferences syntax is:

s/ ([A-Z]) ([0-9] {2,4}) / \2:\1 /g
then the output of replacement pattern will be:
"42:A E9 112:F G96670 6658:E AAA"

Further, it is always advisable to refer to the parts of replacement pattern in a sequential order to keep it readable and uncluttered. This can be achieved using "grouping without backreferencing", allowed by some regular expression tools. Accordingly, if a question mark colon (?:) pattern precedes a group, the group is not treated as a backreference. You can use ?: syntax even when your backreferences are in the search pattern itself. For example:

Say the target string is:
"C-abc-42 # E:efgh:597 # E-ijk-11 # G-lmn-47"
and regular expression including the ?: pattern is:
s/ ([A-Z]) (?:-[a-z]{3}-) ([0-9]*)/\1\2/g
the output shall be:
"C42 # E:efgh:597 # E11 # G47"
wherein the second grouping in the search pattern has been ignored as a backreference in the target string due to use of ?: pattern.

Conclusion


In essence, regular expressions make you the master of your data to specify it, regulate it, manipulate it, replace it and put it to work.
An extremely powerful tool to handle magnanimous expanse of data irrespective of data types, technical or non-technical, the pertinence of regular expressions is explicated by its prevalence across tools, viz. editors, word processors, system tools, database designs, et al, and programming languages including Java, Jscript, Visual Basic, VB Script, JavaScript, ECMAScript, C, C++, C#, elisp, Perl, Python, Tcl, Ruby, PHP, sed and awk.
To be bare, regular expressions are the very heart of many programs written in some of these languages, speaking loads about the extreme power enjoyed by the programming tots. To conclude, regular expressions are indispensable bounty of theoretical computer science.


 Previous Page   1   2  3


Return to the Guides & Articles section, or go the to Main page.





Looking for the old guiStuff?

It's still here, the old content didn't go anywhere.