Pattern Modifiers

Describes possible modifiers in regex patterns ()

Code Examples / Notes » reference.pcre.pattern.modifiers

ebarnard

When adding comments with the /x modifier, don't use the pattern delimiter in the comments. It may not be ignored in the comments area. Example: <?php $target = 'some text'; if(preg_match('/ e # Comments here /x',$target)) { print "Target 1 hit.\n"; } if(preg_match('/ e # /Comments here with slash /x',$target)) { print "Target 1 hit.\n"; } ?> prints "Target 1 hit." but then generates a PHP warning message for the second preg_match(): Warning: preg_match() [function.preg-match]: Unknown modifier 'C' in /ebarnard/x-modifier.php on line 11

varrah no_garbage_or_spam

Spent a few days, trying to understand how to create a pattern for Unicode chars, using the hex codes. Finally made it, after reading several manuals, that weren't giving any practical PHP-valid examples. So here's one of them: For example we would like to search for Japanese-standard circled numbers 1-9 (Unicode codes are 0x2460-0x2468) in order to make it through the hex-codes the following call should be used: preg_match('/[\x{2460}-\x{2468}]/u', $str); Here $str is a haystack string \x{hex} - is an UTF-8 hex char-code and /u is used for identifying the class as a class of Unicode chars. Hope, it'll be useful.

hfuecks

Regarding the validity of a UTF-8 string when using the /u pattern modifier, some things to be aware of; 1. If the pattern itself contains an invalid UTF-8 character, you get an error (as mentioned in the docs above - "UTF-8 validity of the pattern is checked since PHP 4.3.5" 2. When the subject string contains invalid UTF-8 sequences / codepoints, it basically result in a "quiet death" for the preg_* functions, where nothing is matched but without indication that the string is invalid UTF-8 3. PCRE regards five and six octet UTF-8 character sequences as valid (both in patterns and the subject string) but these are not supported in Unicode ( see section 5.9 "Character Encoding" of the "Secure Programming for Linux and Unix HOWTO" - can be found at http://www.tldp.org/ and other places ) 4. For an example algorithm in PHP which tests the validity of a UTF-8 string (and discards five / six octet sequences) head to: http://hsivonen.iki.fi/php-utf8/ The following script should give you an idea of what works and what doesn't; <?php $examples = array( 'Valid ASCII' => "a", 'Valid 2 Octet Sequence' => "\xc3\xb1", 'Invalid 2 Octet Sequence' => "\xc3\x28", 'Invalid Sequence Identifier' => "\xa0\xa1", 'Valid 3 Octet Sequence' => "\xe2\x82\xa1", 'Invalid 3 Octet Sequence (in 2nd Octet)' => "\xe2\x28\xa1", 'Invalid 3 Octet Sequence (in 3rd Octet)' => "\xe2\x82\x28", 'Valid 4 Octet Sequence' => "\xf0\x90\x8c\xbc", 'Invalid 4 Octet Sequence (in 2nd Octet)' => "\xf0\x28\x8c\xbc", 'Invalid 4 Octet Sequence (in 3rd Octet)' => "\xf0\x90\x28\xbc", 'Invalid 4 Octet Sequence (in 4th Octet)' => "\xf0\x28\x8c\x28", 'Valid 5 Octet Sequence (but not Unicode!)' => "\xf8\xa1\xa1\xa1\xa1", 'Valid 6 Octet Sequence (but not Unicode!)' => "\xfc\xa1\xa1\xa1\xa1\xa1", ); echo "++Invalid UTF-8 in pattern\n"; foreach ( $examples as $name => $str ) { echo "$name\n"; preg_match("/".$str."/u",'Testing'); } echo "++ preg_match() examples\n"; foreach ( $examples as $name => $str ) { preg_match("/\xf8\xa1\xa1\xa1\xa1/u", $str, $ar); echo "$name: "; if ( count($ar) == 0 ) { echo "Matched nothing!\n"; } else { echo "Matched {$ar[0]}\n"; } } echo "++ preg_match_all() examples\n"; foreach ( $examples as $name => $str ) { preg_match_all('/./u', $str, $ar); echo "$name: "; $num_utf8_chars = count($ar[0]); if ( $num_utf8_chars == 0 ) { echo "Matched nothing!\n"; } else { echo "Matched $num_utf8_chars character\n"; } } ?>

csaba

Extracting lines of text: You might want to grab a line of text within a multiline piece of text. For example, suppose you want to replace the first and last lines within the <body> portion of a web $page with your own $lineFirst and $lineLast. Here's one possible way: <?php $lineFirst = "This is a new first line \r\n"; $lineLast = "This is a new last line \r\n"; $page = <<<EOD <html><head> <title>This is a test page</title> </head><body> This is the first line Hi Fred Hi Bill This is the last line </body> </html> EOD; $re = "/<body>.*^(.+)(^.*?^)(.+)(^<\\/body>.*?)/smU"; if (preg_match($re, $page, $aMatch, PREG_OFFSET_CAPTURE)) $newPage = substr($text, 0, $aMatch[1][1]) . $lineFirst . $aMatch[2][0] . $lineLast . $aMatch[4][0]; print $newPage; ?> The two (.+) are supposed to match the first and last lines within the <body> tag. The /s option (dot all) is needed so the .* can also match newlines. The /m option (multiline) is needed so that the ^ can match newlines. The /U option (ungreedy) is needed so that the .* and .+ will only gobble up the minimum number of characters necessary to get to the character following the * or +. The exception to this, however, is that the .*? temporarily overrides the /U setting on .* turning it from non greedy to greedy. In the middle, this ensures that all the lines except the first and last (within the <body> tag) are put into $aMatch[2]. At the end, it ensures that all the remaining characters in the string are gobbled up, which could also have been achieved by .*)\\z/ instead of .*?)/ Csaba Gabor from Vienna

Change Language

Pattern Modifiers
Pattern Syntax
preg_grep
preg_last_error
preg_match_all
preg_match
preg_quote
preg_replace_callback
preg_replace
preg_split