learn

RegExp - Regular Expressions

category: Advanced
Created by: Dr.Ferrous

RegExp - Regular Expressions

category: Advanced
Created by: Dr.Ferrous
twitter google+ facebook pinned 

RegExp (Regular Expresions) is a pattern string describing the set of possible strings that can be formed with that pattern, following certain rules. These regular expressions use parenthesis (round, square, braces) and special characters that form rules for forming words.
To start, let's see some simple patterns.

What is RegExp

In PHP, the RegExp pattern is usually write as a string within two forward slash characters "/" ("/regexp/"), but you can use any other nonalphanumeric character (other than the backslash), as long as you use the same characters on both ends of the pattern and they are not among the character pattern for which you are looking (e.g #regexp#)

 

  • The following regular expression: /s[ak]y/ can form the fallowing words: say and sky (an expression added within square brackets "[ak]", is called class pattern).

 

  • A pattern for strings that may contain only vowels it can be made using the expression: /[aeiou]/ (by including the possible values in class pattern).

 

  • If you wish to allow uppercase vowels, add them too, /[aeiouAEIOU]/ (or you can use the "i" modifier, /[aeiou]/i - modifiers are presented below).

 

  • For strings that may include any letters written in lower case, you can write: /[abcdefghijklmnopqrstuvwxyz]/. Or a more compact form: /[a-z]/, this expression means "a series of consecutive characters from 'a' to 'z'".

 

  • Similarly, the pattern /[0-9]/ represent strings that contain only numbers.
    To match a certain number of characters, put the quantity between curly braces, adding the minimum and maximum number of allowed characters. For example, the regular expression: /[aeiou]{2,4}/, matches any string that contain only vowels and has 2, 3 or 4 characters ("ai", "oue", "auio", etc.).

 

  • To specify that the characters within square brackets may be repeated in the string, use "+" or "*" after square brackets. As an example, /s[ak]+y/ would match: sky, saay, saakyy, etc.

 

  • To specify the repetition of several parts of a regular expression, include those parts between round brackets. (an expression added within round brackets, is called subpattern)

 

  • The fallowing RegExp, /(s[ak]y ){2,3}/ corresponds to a number of two or three repetitions of any of the strings: "say " and "sky ". This pattern would match: "say sky ", "say sky say ", etc. (Notice the space character after "y" in this RegExp, must be the same in the matching strings, with a space after "y").

There are several special characters that are used in forming regular expressions.

 

  • If a circumflex accent (^) is the first symbol added inside square brackets, it has the effect of reversing the regular expression placed between those parentheses.  So, /[^aeiou]/ will match any non-vowel string.

 

  • /[^a-z]/ matches any character that is not a lowercase letter.

 


When this character (^) is placed outside the square brackets, it represents the beginning of the string or line.

  • Regular expression /^s[ak]y/ corresponds to sub-string "say" or "sky" only if they are at the beginning of the string subject.
    There is also the dollar sign ($), which marks the conclusion of a pattern, the end of the string or line. 
  • /s[ak]y$/ will correspond to "say" or "sky" only if they are at the end of the string subject.

Here is a list of more special characters and their role in regular expressions:

  • ^   - Indicates the beginning of a string
  • $   - Indicates the end of a string
  • .   - Any single character except newline
  • ()   - subpattern
  • []   - class pattern (a character of the ones within square parentheses)
  • [^]   - any character except those in square brackets
  • /   - Escape character (disable the special role of the character in front of which is added)
  • +   - The character (or expression) before this sign should repeat at least one time (to infinite)
  • *   - The character (or expression) before this sign can repeat it 0 to infinite
  • ?   - The character (or expression) before this sign may repeat it 0 or 1 time
  • |   - Alternatives (or)
  • {x}   - Exactly "x" occurrences
  • {x,y}   - Between "x" and "y" occurrences
  • {x,}   - At least x occurrences
  • \r   - new row ("\r\n" for windows)
  • \t   - Tab

 

For example, /[ho|ca]me/ corresponds to home and came words.


To put these characters (+ , * , ? , < , > ( , { , [ , ...) in a regexp pattern, disabling their special role, you must prefix them with a backslash character "\".
For example, /[0-9]\*[0-9]/ matches a multiplication between two numbers ( "*" is no longer a repetition factor).

 


Besides these characters there are special formulas for shortening regexp expressions:

  • \w   - Alphanumeric characters plus "_". Equivalent: [a-zA-Z_]
  • \W   - Non-word characters. Equivalent: [^a-zA-Z_]
  • \s   - Whitespace characters. Equivalent: [ \t\r\n\v\f]
  • \S   - Non-whitespace characters. Equivalent: [^ \t\r\n\v\f]
  • \d   - Digits. Equivalent: [0-9]
  • \D   - Non-digits. Equivalent: [^0-9]

For example: /[\d\s]+/ match strings that contain only numbers and white spaces.



Here are some examples of regular expressions:

  • (.*)   - represents all characters (by ".") repeated as often as possible (by "*")
  • (fa|te)rms   - matches "farms" and "terms"
  • ^www.+net$   - strings that beginns with "www" and ends with "net"
  • ^www\.[a-z0-9]+\.com$   - matches the "www.__.com" strings, the "__" can be any word that contains lowercase letters and numbers
  • (^-\+[0-9]*)   - any number that starts with "-" or "+"
  • \<tag\>(.*?)\<\/tag\>   - represents the content within <tag>...</tag>
  • \<tag\>(.[^\<]+)   - The string from <tag> till the first "</"
  • ^([a-zA-Z0-9]+[a-zA-Z0-9._%-]*@([a-zA-Z0-9-]+\.)+[a-zA-Z]{2,4})$   - Regular expression for email addresses
  • ^(http://|https://)?([^/]+)   - Regular expression for domain name of a URL

 


Besides the special characters and formulas used for shortening the regular expression, there are also other special letters called modifiers. They have a special role only if they are placed after the closing delimiter ("/regexp/mods"), and alter the behavior of a regular expression.
The most used RegExp modifiers are listed below:

  • i   - (ignore-case)   - letters in the pattern match both upper and lower case letters.
  • m   - (multiline)   - change the role of "^" and "$". If "multiline" is not specified, they indicate the beginning and end of the text of the regexp, but when this modifier is added, they indicate the beginning and the end of the whole line.
  • s   - (dotall)   - makes the dot metacharacter in the pattern matches all characters, including newlines.
  • x   - (extended)   - If this modifier is set, whitespace data characters in the RegExp pattern are totally ignored except when escaped or inside a character class.

 

You can add one or more modifiers at the end of the pattern. Example: /\d{3}-[a-z]+/i - searches for "nnn-word" sub-strings, "nnn" is a 3-digit number and "word" can contain uppercase letters too.


Usually, regular expressions are used in PHP for string matching and string substituting. PHP has special functions for these operations.

String matching - preg_match

preg_match function searches a string for a match to the regular expression given in pattern.
Syntax:

  • "pattern" - The RegExp pattern to search for.
  • "string" - The input string.
  • $matches - It is optional. If it's added, will contain the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.
    preg_match() returns the number of times pattern matches, 0 times (no match) or 1 time because preg_match() will stop searching after the first match.

Let's see some examples with preg_match():

1) Looking for the string "courses" anywhere within the overall provided string.

The "i" after the pattern delimiter indicates a case-insensitive search.

 

2) Validate an Email address.

 

3) Getting the URL out of a HTML link.

preg_match() stops searching after the first match. If you want to get all matching data in a string, use preg_match_all(), this function will continue searching until it reaches the end of subject, and puts all matches in an Array.


4) Example with preg_match_all(). Getting the content of all <li> tags that have class="cls":

String substituting - preg_replace

To perform pattern searching and replacing, use the preg_replace function.
Syntax:

  • $pattern - The RegExp pattern to search for. It can be either a string or an array with strings.
  • $replacement - The string or an array with strings to replace.
  • $subject - The string or an array with strings to search and replace


If both $pattern and $replacement parameters are arrays, each pattern will be replaced by the replacement counterpart.


preg_replace() returns an array if the $subject parameter is an array, or a string. If matches are found, the new $subject will be returned, otherwise $subject will be returned unchanged, or NULL on error.

 

Let's see two examples with preg_replace():

1) Replacing a string (phpwin.org) with another string (porizm.com).

 

2) Using an Array with RegExp paterns to replace two diferent values same time.