When extracting information from a web page, typically you must remove a lot of HTML tags and extraneous characters. PHP has many useful functions for this, for example trim(string) which removes white space characters from the start and end of its argument.
However, functions with fixed functionality such as trim are not enough for information extraction. We need more sophisticated string handling for extracting specific items of data from web pages. Regular expressions are the answer.
For a good basic tutorial on regular expressions in PHP see http://www.phpbuilder.com/columns/dario19990616.php3.
Some examples below are from this.
In HTML source code, newlines have no meaning. You may want to remove them as a first step in information extraction.
A regular expression pattern matches a string if the pattern can be found anywhere in the string. So for example the pattern once matches the string There once was
The special character ^ means "at the start of the string" and $ means "at the end." Use these if you want to match a pattern, but not anywhere.
Escape sequences are character combinations that designate a character that otherwise has a special meaning. For example an escape sequence is needed to represent a period. Escape sequences begin with a slash, e.g. \n and \t and \. and \-.
Characters inside square brackets are alternatives, e.g. [aeiou] Inside square brackets the character - is special, so you have to escape it to represent it literally, as in [0-9\.\-] for example. You can write [ ] to indicate a space explicitly.
Immediately after [ the character ^ means "anything
except." Double square brackets indicate a special character class,
for example [[:alnum:]] and [[:space:]]
Note: Does .{3} means three of the same character, or any three characters? Try it and see.
Exercise: Write a pattern that will match any real number, similar to the PHP is_double() function.
? is an abbreviation for {0,1}Parentheses ( ) allow "multipliers" like the above to apply to a sequence of characters. The vertical bar gives alternatives. Examples:
* is an abbreviation for {0,}
+ is an abbreviation for {1,}
(Nant|b)ucketPrecedence, i.e. tightness of binding, is important for regular expression operators. It seems that $ binds more tightly than |
Fran|Nan$
(Fran|Nan)$
Exercise: What does this pattern test for: ^.+@.+\\..+$
Here \\. is a literal period and .+ means any one or more characters.
Intuitively, this matches anya@anyb.anyc which is close to
the syntax of email addresses. The pattern is not perfect because
it also matches
strings that are obviously not valid email addresses, for example strings
containing more than one @ symbol.
The function split($pattern,$input) returns an array of strings that is the result of dividing up its second argument into pieces, using matches to the first argument as a delimiter. For example:
list ($month, $day, $year) = split ('[/.-]', $date);Note that in this example the slash, period, and hyphen are not escape sequences: they stand for themselves. The list operator produces a tuple instead of an array.
If the delimiter is not found or the delimiter is empty, then the first element of the array (with subscript 0) gets the whole input. If the delimiter is repeated consecutively, the array will include null items.
An optional third argument says how many items to return. The last item then contains the whole remainder of the string. This is useful for writing before() and after() functions that take a specific part of a web page These functions are useful for information extraction, for example:
function after($pattern, $text) {Note that this function returns all of $text after the first occurrence of the pattern.
if ($pattern == "") return $text;
$s = split($pattern, $text, 2);
return $s[1];
}