PHP : Function Reference : Regular Expression Functions (Perl-Compatible) : preg

Example 1719. Find the word "web"

<?php
/* The \b in the pattern indicates a word boundary, so only the distinct

 * word "web" is matched, and not a word partial like "webbing" or "cobweb" */
if (preg_match("/\bweb\b/i", "PHP is the web scripting language of choice.")) {

    echo "A match was found.";

} else {

    echo "A match was not found.";

}


if (preg_match("/\bweb\b/i", "PHP is the website scripting language of choice.")) {

    echo "A match was found.";

} else {

    echo "A match was not found.";

}
?>

Example 1720. Getting the domain name out of a URL

copy to clipboard

<?php
// get host name from URL
preg_match('@^(?:http://)?([^/]+)@i',

    "http://www.php.net/index.php", $matches);
$host = $matches[1];

// get last two segments of host name
preg_match('/[^.]+\.[^.]+$/', $host, $matches);

echo "domain name is: {$matches[0]}\n";
?>

The above example will output:

copy to clipboard

domain name is: php.net

Related Examples ( Source code ) » preg_match

Code Examples / Notes » preg_match

Watch out when using c-style comments around a preg_match or preg_* for that matter. In certain situations (like example below) the result will not be as expected. This one is of course easy to catch but worth noting. /* we will comment out this section if (preg_match ("/anything.*/", $var)) { code here; } */ This is (I believe) because comments are interpreted first when parsing the code (and they should be). So in the preg_match the asterisk (*) and the ending delimiter (/) are interpreted as the end of the comment and the rest of your (supposedly commented) code is intrepreted as php.

To regex a North American phone number you can assume NxxNxxXXXX, where N = 2 through 9 and x = 0 through 9. North American numbers can not start with a 0 or a 1 in either the Area Code or the Office Code. So, adpated from the other phone number regex here you would get: /^[2-9][0-9]{2}[-][2-9][0-9]{2}[-][0-9]{4}$/

To check a Romanian landline phone number, and to return "Bucharest", "Proper" or "Unknown", I've used this function: <? function verify_destination($destination) { $dst_length=strlen($destination); if ($dst_length=="10"){ if(preg_match("/^021[2-7]{1}[0-9]{6}$/",$destination)) { $destination_match="Bucharest"; } elseif (preg_match("/^02[3-6]{1}[0-9]{1}[1-7]{1}[0-9]{5}$/",$destination)) { $destination_match = "Proper"; } else { $destination_match = "Unknown"; } } return ($destination_match); } ?>

This is the only function in which the assertion \\G can be used in a regular expression. \\G matches only if the current position in 'subject' is the same as specified by the index 'offset'. It is comparable to the ^ assertion, but whereas ^ matches at position 0, \\G matches at position 'offset'.

This is a function to convert byte offsets into (UTF-8) character offsets (this is reagardless of whether you use /u modifier: <?php function mb_preg_match($ps_pattern, $ps_subject, &$pa_matches, $pn_flags = NULL, $pn_offset = 0, $ps_encoding = NULL) { // WARNING! - All this function does is to correct offsets, nothing else: // if (is_null($ps_encoding)) $ps_encoding = mb_internal_encoding(); $pn_offset = strlen(mb_substr($ps_subject, 0, $pn_offset, $ps_encoding)); $ret = preg_match($ps_pattern, $ps_subject, $pa_matches, $pn_flags, $pn_offset); if ($ret && ($pn_flags & PREG_OFFSET_CAPTURE)) foreach($pa_matches as &$ha_subpattern) $ha_subpattern[1] = mb_strlen(substr($ps_subject, 0, $ha_subpattern[1]), $ps_encoding); return $ret; } ?>

This is a constant that helps in getting a valid phone number that does not need to be in a particular format. The following is a constant that matches the following US Phone formats: Phone number can be in many variations of the following: (Xxx) Xxx-Xxxx (Xxx) Xxx Xxxx Xxx Xxx Xxxx Xxx-Xxx-Xxxx XxxXxxXxxx Xxx.Xxx.Xxxx define( "REGEXP_PHONE", "/^($|){1}[2-9][0-9]{2}($|){1}([\.- ]|)[2-9][0-9]{2}([\.- ]|)[0-9]{4}$/" );

This function (for PHP 4.3.0+) uses preg_match to return the regex position (like strpos, but using a regex pattern instead): function preg_pos($sPattern, $sSubject, &$FoundString, $iOffset = 0) { $FoundString = NULL; if (preg_match($sPattern, $sSubject, $aMatches, PREG_OFFSET_CAPTURE, $iOffset) > 0) { $FoundString = $aMatches[0][0]; return $aMatches[0][1]; } else { return FALSE; } } It also returns the actual string found using the pattern, via $FoundString.

The ExtractString function does not have a real error, but some disfunction. What if is called like this: ExtractString($row, 'action="', '"'); It would find 'action="' correctly, but perhaps not the first " after the $start-string. If $row consists of <form method="post" action="script.php"> strpos($str_lower, $end) would return the first " in the method-attribute. So I made some modifications and it seems to work fine. function ExtractString($str, $start, $end) { $str_low = strtolower($str); $pos_start = strpos($str_low, $start); $pos_end = strpos($str_low, $end, ($pos_start + strlen($start))); if ( ($pos_start !== false) && ($pos_end !== false) ) { $pos1 = $pos_start + strlen($start); $pos2 = $pos_end - $pos1; return substr($str, $pos1, $pos2); } }

Test for valid US phone number, and get it back formatted at the same time: function getUSPhone($var) { $US_PHONE_PREG ="/^(?:\+?1[\-\s]?)?($\d{3}$|\d{3})[\-\s\.]?"; //area code $US_PHONE_PREG.="(\d{3})[\-\.]?(\d{4})"; // seven digits $US_PHONE_PREG.="(?:\s?x|\s|\s?ext(?:\.|\s)?)?(\d*)?$/"; // any extension if (!preg_match($US_PHONE_PREG,$var,$match)) { return false; } else { $tmp = "+1 "; if (substr($match[1],0,1) == "(") { $tmp.=$match[1]; } else { $tmp.="(".$match[1].")"; } $tmp.=" ".$match[2]."-".$match[3]; if ($match[4] <> '') $tmp.=" x".$match[4]; return $tmp; } } usage: $phone = $_REQUEST["phone"]; if (!($phone = getUSPhone($phone))) { //error gracefully :) }

regex for validating emails, from Perl's RFC2822 package: http://en.wikipedia.org/wiki/Talk:E-mail_address

Quick function to filter input. Filters any javascript, html, sql injections, and RFI. <?php function entities($text){ $text = ""; for ( $i = 0; $i <= strlen($text) - 1; $i += 1) { $text .= "&#" .ord($text{$i}); } return $eresult; } function filter($text){ if (preg_match("#(on(.*?)\=|script|xmlns|expression| javascript|\>|\<|http)#si","$text",$ntext)){ $re = entities($ntext[1]); $text = str_replace($ntext[0],$re,$text); } $text = mysql_real_escape_string($text); return $text; } foreach ($_POST as $x => $y){ $_POST[$x] = filter($y); } foreach ($_GET as $x => $y){ $_GET[$x] = filter($y); } foreach ($_COOKIE as $x => $y){ $_COOKIE[$x] = filter($y); } ?>

Pointing to the post of "internet at sourcelibre dot com": Instead of using PerlRegExp for e.g. german "Umlaute" like <?php $bolMatch = preg_match("/^[a-zA-ZäöüÄÖÜ]+$/", $strData); ?> use the setlocal command and the POSIX format like <?php setlocale (LC_ALL, 'de_DE'); $bolMatch = preg_match("/^[[:alpha:]]+$/", $strData); ?> This works for any country related special character set. Remember since the "Umlaute"-Domains have been released it's almost mandatory to change your RegExp to give those a chance to feed your forms which use "Umlaute"-Domains (e-mail and internet address). Live can be so easy reading the manual ;-)

Note that the PREG_OFFSET_CAPTURE flag, as far as I've tested, returns the offset in bytes not characters, which may not be what you're expecting if you're using the /u pattern modifier to make the regex UTF-8 aware (i.e. multibyte characters will result in a greater offset than you expect)

Ne'er try to verify email address by using some random regex you just invented sitting on the toilet seat. It will not work properly. The proper regex for email validation is something along the lines of "([-!#$%&'*+/=?_`{|}~a-z0-9^] +(\.[-!#$%&'*+/=?_`{|}~a-z0-9 ^]+)*|"([\x0b\x0c\x21\x01-\x08\ x0e-\x1f\x23-\x5b\x5d-\x7f]|\\[\x 0b\x0c\x01-\x09\x0e-\x7f])*")@(( [a-z0-9]([-a-z0-9]*[a-z0-9])?\.)+[ a-z0-9]([-a-z0-9]*[a-z0-9]){1,1}| \[((25[0-5]|2[0-4][0-9]|[01]?[0-9] [0-9]?)\.){3,3}(25[0-5]|2[0-4][0-9 ]|[01]?[0-9][0-9]?|[-a-z0-9]*[a-z0 -9]:([\x0b\x0c\x01-\x08\x0e-\x1f\ x21-\x5a\x53-\x7f]|\\[\x0b\x0c\x0 1-\x09\x0e-\x7f])+)\])". However, you shouldn't even try that regex. If you do not understand what that regexp does, then please do not try to write one yourself. If you need a _truly_ _valid_ e-mail address, no regexp is going to help you - just send a verification message to the user-supplied address with a link or code the user can paste to verify the address. IF you still WISH - against my recommendation - to use some validating regexp then *please* just make it warn loudly that the address may be invalid; do not write code that throws a fatal error outright. I am quite fed up with sites that do not accept my .name e-mail address, or some other valid, working forms for that matter.

Maybe it will sound obvious, but I've encountered this a few times... If you are using preg_match() to validate user input, remember about including ^ and $ to your regex or take input from $matches[0] after successfully matching a pattern ie. preg_match('/[0-9]+/', '123 UNION SELECT ... --') will return TRUE, but when you it in a SQL statement, injected code will be probably executed(if you don't escape user argument). Note that $matches[0] == '123', so it can be used as a valid input.

Match and replace for arrays. Useful for parsing entire $_POST Only array_preg_match examples: <?php function array_preg_match(array $patterns, array $subjects, &$errors = array()) { $errors = array(); foreach ($patterns as $k => $v) preg_match($v, $subjects[$k]) or $errors[$k] = TRUE; return count($errors) == 0 ? TRUE : FALSE; } function array_preg_replace(array $patterns, array $replacements, array $subject) { $r = array(); foreach ($patterns as $k => $v) $r[$k] = preg_replace($v, $replacements[$k], $subject[$k]); return $r+$subject; } $arr1 = array('name' => 'Alexandre', 'phone' => '44559999'); $arr2 = array('name' => '', 'phone' => '44559999c'); array_preg_match(array( 'name' => '#.+#', //Not empty 'phone' => '#^$|(\d[^\D])+#' // Only digits, optional ), $arr1, $match_errors); print_r($match_errors); // Empty, it is ok. array_preg_match(array( 'name' => '#.+#', //Not empty 'phone' => '#^$|(\d[^\D])+#' // Only digits, optional ), $arr2, $match_errors); print_r($match_errors); // Two indexes, name and phone, both not ok. ?>

Intending to use preg_match to check whether an email address is in a valid format? The following page contains some very useful information about possible formats of email addresses, some of which may surprise you: http://en.wikipedia.org/wiki/E-mail_address

If you wonder how to check for correct e-mail and such (you can use it for usernames and anything you want, but this is for e-mail) you can use this little code to validate the users e-mail: We'll assume that they have been processing a form, entering their e-mail as "email" and now PHP will take care of the rest: $emailcheck = $_POST["email"]; if(!preg_match("/^[a-z0-9\Ã¥\Ã¤\Ã¶._-]+@ [a-z0-9\Ã¥\Ã¤\Ã¶.-]+\.[a-z]{2,6}$/i", $emailcheck)) $errors[] = "- Your e-mail is missing or is not valid."; (note that the preg_match had to be cut or I couldn't post it since it was too long so I cut it after @ so just put them together again.) If we split the parts it would look like this: [a-z0-9._-]+@ This is the name of the email, such as greatguy3 (then @domain.com) so this allows dot, underscore and - aswell as alphabetical letters and decimals. [a-z0-9.-]+\. This is the domain part, note that there must be a dot after domain name, so it's harder to fake an email. Same here though, A-Z, 0-9, dot and - (if your domain has - in it, such as nintendo-wii.com) [a-z]{2,6}$/i This is the last part of your email, the .com/.net/.info, whichever you use. The numbers between {} is how many letters are limited (in this case min 2 and max 6) and it would allow "us" up to "org.uk" and "museum" and only A-Z letters are used for obvious reasons. The "i" there is there so you can use both uppercase and lowercase characters. (A-Z & a-z) So a valid email address with this code would be "coolguy3@cooldomain.com" and a nonvalid one would be "zÃ»mgz^;*@hot_mail.bananÃ¡" This is only the email part, so this is not the fullcode. Paste this in your form process to use it with the rest of your code! Hope this helps.

If you want to have perl equivalent regexp match: $`, $& and $' before the match, the match itself, after the match Here's one way to do it: echo preg_match("/(.*?)(and)(.*)/", "this and that",$matches); print_r($matches); $` = ${1}; $& = ${2}; $' = ${3}; Notice (.*) else the end won't match. Note that if you only need $&, simply use ${0}. Here's another way, which is a bit simpler to remember: echo preg_match("/^(.*?)(and)(.*?)$/", "this and that",$matches); print_r($matches);

I'm not happy with any pattern of email address that I have seen. The fallowing address are wrong: email1..@myserver.com email1.-@myserver.com email1._@myserver.com email1@2sub.myserver.com email1@sub.sub.2sub.myserver.com So, this is my pattern: $pat = "/^[a-z]+[a-z0-9]*[\.|\-|_]?[a-z0-9]+ @([a-z]+[a-z0-9]*[\.|\-]?[a-z]+[a-z0-9]*[a-z0-9]+){1,4} \.[a-z]{2,4}$/"; Best Regards, Elier http://www.faqs.org/rfcs/rfc1035.html RFC 1035 - Domain names - implementation and specification

I you want to match all scandinavian characters (æÆøØåÅöÖäÄ) in addition to those matched by \w, you might want to use this regexp: /^[\w\xe6\xc6\xf8\xd8\xe5\xc5\xf6\xd6\xe4\xc4]+$/ Remember that \w respects the current locale used in PCRE's character tables.

I just started using PHP and this section doesn't clarify whether or not you must use "/" as your regular expression delimiters. I want to clarify that you can use almost any character as your delimiter. The delimiter is automatically the first character of your regular expression string. This makes it a bit easier if you are looking for things that might contain a forward slash. For example:: preg_match('#</b>#', $string); Instead of: preg_match('/<\/b>/', $string); Or: preg_match('@/my/dir/name/@', $string); Instead of: preg_match('/\/my\/dir\/name\//', $string); This can greatly boost readability. Not quite as flexible as in Perl (You can't use control characters or \n which can really come in handy when you aren't quite sure what characters might be in your regular expression), but switching to another delimiter can make your code a bit easier to read.

How to verify a Canadian postal code! if (!preg_match("/^[a-z]\d[a-z] ?\d[a-z]\d$/i" , $postalcode)) { echo "Your postal code has an incorrect format." }

Here's a little function to manipulate the default MySQL datetime. //return date, time, timestamp from a MYSQL datetime YYYY-MM-DD hh:mm:ss function getMySQL_datetime($datetime) { if(preg_match("/(\d{4})-(\d{2})-(\d{2})\s(\d{2}):(\d{2}):(\d{2})/", $datetime, $dt)) { $d["year"] = $dt[1]; $d["month"] = $dt[2]; $d["day"] = $dt[3]; $d["hour"] = $dt[4]; $d["min"] = $dt[5]; $d["sec"] = $dt[6]; $d["timestamp"] = mktime($d["hour"], $d["min"], $d["sec"], $d["month"], $d["day"], $d["year"]); return $d; } else echo "Match Not Found!"; }

Here's a format for matching US phone numbers in the following formats: ###-###-#### (###) ###-#### ########## It restricts the area codes to >= 200 and exchanges to >= 100, since values below these are invalid. <?php $pattern = "/($[2-9]\d{2}$\s?|[2-9]\d{2}-|[2-9]\d{2})" . "[1-9]\d{2}" . "-?\d{4}/"; ?>

Here is a sample code to check for alphabetic characters only with an exception to space, hyphen and single quotes using preg_match(). $alpha = "some very funny string'9-2'"; /* check for alphabets and hyphens, quotes and space in the string but no numbers */ if(preg_match("/^[a-zA-Z\-\'\ ]+$/u", $alpha)){ return 1; }else return 0; one can just add a '\' followed by the character he wish to allow for use [\@]. i hope it would be helpful to some one their. -Erandra

Do not forget PCRE has many compatible features with Perl. One that is often neglected is the ability to return the matches as an associative array (Perl's hash). For example, here's a code snippet that will parse a subset of the XML Schema 'duration' datatype: <?php $duration_tag = 'PT2M37.5S'; // 2 minutes and 37.5 seconds // drop the milliseconds part preg_match( '#^PT(?:(?P<minutes>\d+)M)?(?P<seconds>\d+)(?:\.\d+)?S$#', $duration_tag, $matches); print_r($matches); ?> Here is the corresponding output: Array ( [0] => PT2M37.5S [minutes] => 2 [1] => 2 [seconds] => 37 [2] => 37 )

Concerning the German umlauts (and other language-specific chars as accented letters etc.): If you use unicode (utf-8), you can match them easily with the unicode character property \pL (match any unicode letter) and the "u" modifier, so e.g. <?php preg_match("/[\w\pL]/u",$var); ?> would really match all "words" in $var - whether they contain umlauts or not. Took me a while to figure this out, so maybe this comment will safe the day for someone else :-)

Backreferences (ala preg_replace) work within the search string if you use the backslash syntax. Consider: <?php if (preg_match("/([0-9])(.*?)(\\1)/", "01231234", $match)) { print_r($match); } ?> Result: Array ( [0] => 1231 [1] => 1 [2] => 23 [3] => 1 ) This is alluded to in the description of preg_match_all, but worth reiterating here.

As I did not find any working IPv6 Regexp, I just created one. Here is it: $pattern1 = '([A-Fa-f0-9]{1,4}:){7}[A-Fa-f0-9]{1,4}'; $pattern2 = '[A-Fa-f0-9]{1,4}::([A-Fa-f0-9]{1,4}:){0,5}[A-Fa-f0-9]{1,4}'; $pattern3 = '([A-Fa-f0-9]{1,4}:){2}:([A-Fa-f0-9]{1,4}:){0,4}[A-Fa-f0-9]{1,4}'; $pattern4 = '([A-Fa-f0-9]{1,4}:){3}:([A-Fa-f0-9]{1,4}:){0,3}[A-Fa-f0-9]{1,4}'; $pattern5 = '([A-Fa-f0-9]{1,4}:){4}:([A-Fa-f0-9]{1,4}:){0,2}[A-Fa-f0-9]{1,4}'; $pattern6 = '([A-Fa-f0-9]{1,4}:){5}:([A-Fa-f0-9]{1,4}:){0,1}[A-Fa-f0-9]{1,4}'; $pattern7 = '([A-Fa-f0-9]{1,4}:){6}:[A-Fa-f0-9]{1,4}'; patterns 1 to 7 represent different cases. $full is the complete pattern which should work for all correct IPv6 addresses. $full = "/^($pattern1)$|^($pattern2)$|^($pattern3)$ |^($pattern4)$|^($pattern5)$|^($pattern6)$|^($pattern7)$/";

A web server log record can be parsed as follows: $line_in = '209.6.145.47 - - [22/Nov/2003:19:02:30 -0500] "GET /dir/doc.htm HTTP/1.0" 200 6776 "http://search.yahoo.com/search?p=key+words=UTF-8" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)"'; if (preg_match('!^([^ ]+) ([^ ]+) ([^ ]+) \[([^\]]+)\] "([^ ]+) ([^ ]+) ([^/]+)/([^"]+)" ([^ ]+) ([^ ]+) ([^ ]+) (.+)!', $line_in, $elements)) { print_r($elements); } Array ( [0] => 209.6.145.47 - - [22/Nov/2003:19:02:30 -0500] "GET /dir/doc.htm HTTP/1.0" 200 6776 "http://search.yahoo.com/search?p=key+words=UTF-8" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)" [1] => 209.6.145.47 [2] => - [3] => - [4] => 22/Nov/2003:19:02:30 -0500 [5] => GET [6] => /dir/doc.htm [7] => HTTP [8] => 1.0 [9] => 200 [10] => 6776 [11] => "http://search.yahoo.com/search?p=key+words=UTF-8" [12] => "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)" ) Notes: 1) For the referer field ($elements[11]), I intentially capture the double quotes (") and don't use them as delimiters, because sometimes double-quotes do appear in a referer URL. Double quotes can appear as %22 or \". Both have to be handled correctly. So, I strip off the double quotes in a second step. 2) The URLs should be further parsed, using parse_url, which is quicker and more reliable then preg_match. 3) I assume the requested protocol (HTTP/1.1) always has a slash character in the middle, which might not always be the case, but I'll take the risk. 4) The agent field ($elments[12]) is the most unstructured field, so I make no assumptions about it's format. If the record is truncated, the agent field will not be delimited properly with a quote at the end. So, both cases must be handled. 5) A hyphen (- or "-") means a field has no value. It is necessary to convert these to appropriate value (such as empty string, null, or 0). 6) Finally, there should be appropriate code to handle malformed web log enteries, which are common, due to junk data. I never assume I've seen all cases.

A very simple Phone number validation function. Returns the Phone number if the number is in the xxx-xxx-xxxx format. x being 0-9. Returns false if missing digits or improper characters are included. <? function VALIDATE_USPHONE($phonenumber) { if ( (preg_match("/^[0-9]{3,3}[-]{1,1}[0-9]{3,3}[-]{1,1} [0-9]{4,4}$/", $phonenumber) ) == TRUE ) { return $phonenumber; } else { return false; } } ?>

<?php // some may find this usefull... :) $iptables = file ('/proc/net/ip_conntrack'); $services = file ('/etc/services'); $GREP = '!([a-z]+) ' .// [1] protocol '\\s*([^ ]+) ' .// [2] protocl in decimal '([^ ]+) ' .// [3] time-to-live '?([A-Z_]|[^ ]+)?'.// [4] state ' src=(.*?) ' .// [5] source address 'dst=(.*?) ' .// [6] destination address 'sport=(\\d{1,5}) '.// [7] source port 'dport=(\\d{1,5}) '.// [8] destination port 'src=(.*?) ' .// [9] reversed source 'dst=(.*?) ' .//[10] reversed destination 'sport=(\\d{1,5}) './/[11] reversed source port 'dport=(\\d{1,5}) './/[12] reversed destination port '\\[([^]]+)\\] ' .//[13] status 'use=([0-9]+)!'; //[14] use $ports = array(); foreach($services as $s) { if (preg_match ("/^([a-zA-Z-]+)\\s*([0-9]{1,5})\\//",$s,$x)) { $ports[ $x[2] ] = $x[1]; } } for($i=0;$i <= count($iptables);$i++) { if ( preg_match ($GREP, $iptables[$i], $x) ) { // translate known ports... . . $x[7] =(array_key_exists($x[7],$ports))?$ports[$x[7]]:$x[7]; $x[8] =(array_key_exists($x[8],$ports))?$ports[$x[8]]:$x[8]; print_r($x); } // on a nice sortable-table... bon appetite! } ?>

preg_match

Perform a regular expression match (PHP 4, PHP 5)
int preg_match ( string pattern, string subject [, array &matches [, int flags [, int offset]]] )

Example 1718. Find the string of text "php"

Example 1719. Find the word "web"

Example 1720. Getting the domain name out of a URL

Related Examples ( Source code ) » preg_match

Code Examples / Notes » preg_match

Change Language

preg_match

Perform a regular expression match (PHP 4, PHP 5) int preg_match ( string pattern, string subject [, array &matches [, int flags [, int offset]]] )

Example 1718. Find the string of text "php"

Example 1719. Find the word "web"

Example 1720. Getting the domain name out of a URL

Related Examples ( Source code ) » preg_match

Code Examples / Notes » preg_match

Change Language

Perform a regular expression match (PHP 4, PHP 5)
int preg_match ( string pattern, string subject [, array &matches [, int flags [, int offset]]] )