|
strtok
Tokenize string
(PHP 4, PHP 5)
Example 2467. strtok() example<?php Example 2468. Old strtok() behavior<?php Output: string(0) "" Example 2469. New strtok() behavior<?php Output: string(9) "something" Related Examples ( Source code ) » strtok Examples ( Source code ) » Use more than one delimiter to split string Examples ( Source code ) » String token for string split Examples ( Source code ) » String token Examples ( Source code ) » String string token value Examples ( Source code ) » Dividing a String into Tokens with strtok() Examples ( Source code ) » Extract Email Address from any text Code Examples / Notes » strtoktysonlt
Why use strtok at all? If it's so flaky, why not just use split? eg. $token_array = split("$delim", $string); Then you can use all the nice array functions on it! :) mac.com@nemo
This function takes a string and returns an array with words (delimited by spaces), also taking into account quotes, doublequotes, backticks and backslashes (for escaping stuff). So $string = "cp 'my file' to `Judy's file`"; var_dump(parse_cli($string)); would yield: array(4) { [0]=> string(2) "cp" [1]=> string(7) "my file" [2]=> string(5) "to" [3]=> string(11) "Judy's file" } Way it works, runs through the string character by character, for each character looking up the action to take, based on that character and its current $state. Actions can be (one or more of) adding the character/string to the current word, adding the word to the output array, and changing or (re)storing the state. For example a space will become part of the current 'word' (or 'token') if $state is 'doublequoted', but it will start a new token if $state was 'unquoted'. I was later told it's a "tokeniser using a finite state automaton". Who knew :-) <?php #_____________________ # parse_cli($string) / function parse_cli($string) { $state = 'space'; $previous = ''; // stores current state when encountering a backslash (which changes $state to 'escaped', but has to fall back into the previous $state afterwards) $out = array(); // the return value $word = ''; $type = ''; // type of character // array[states][chartypes] => actions $chart = array( 'space' => array('space'=>'', 'quote'=>'q', 'doublequote'=>'d', 'backtick'=>'b', 'backslash'=>'ue', 'other'=>'ua'), 'unquoted' => array('space'=>'w ', 'quote'=>'a', 'doublequote'=>'a', 'backtick'=>'a', 'backslash'=>'e', 'other'=>'a'), 'quoted' => array('space'=>'a', 'quote'=>'w ', 'doublequote'=>'a', 'backtick'=>'a', 'backslash'=>'e', 'other'=>'a'), 'doublequoted' => array('space'=>'a', 'quote'=>'a', 'doublequote'=>'w ', 'backtick'=>'a', 'backslash'=>'e', 'other'=>'a'), 'backticked' => array('space'=>'a', 'quote'=>'a', 'doublequote'=>'a', 'backtick'=>'w ', 'backslash'=>'e', 'other'=>'a'), 'escaped' => array('space'=>'ap', 'quote'=>'ap', 'doublequote'=>'ap', 'backtick'=>'ap', 'backslash'=>'ap', 'other'=>'ap')); for ($i=0; $i<=strlen($string); $i++) { $char = substr($string, $i, 1); $type = array_search($char, array('space'=>' ', 'quote'=>'\'', 'doublequote'=>'"', 'backtick'=>'`', 'backslash'=>'\\')); if (! $type) $type = 'other'; if ($type == 'other') { // grabs all characters that are also 'other' following the current one in one go preg_match("/[ \'\"\`\\\]/", $string, $matches, PREG_OFFSET_CAPTURE, $i); if ($matches) { $matches = $matches[0]; $char = substr($string, $i, $matches[1]-$i); // yep, $char length can be > 1 $i = $matches[1] - 1; }else{ // no more match on special characters, that must mean this is the last word! // the .= hereunder is because we *might* be in the middle of a word that just contained special chars $word .= substr($string, $i); break; // jumps out of the for() loop } } $actions = $chart[$state][$type]; for($j=0; $j<strlen($actions); $j++) { $act = substr($actions, $j, 1); if ($act == ' ') $state = 'space'; if ($act == 'u') $state = 'unquoted'; if ($act == 'q') $state = 'quoted'; if ($act == 'd') $state = 'doublequoted'; if ($act == 'b') $state = 'backticked'; if ($act == 'e') { $previous = $state; $state = 'escaped'; } if ($act == 'a') $word .= $char; if ($act == 'w') { $out[] = $word; $word = ''; } if ($act == 'p') $state = $previous; } } if (strlen($word)) $out[] = $word; return $out; } ?> cs2xz
There is a method to remove all the punctuations and only put the words into an array called "$token", where variable $invalid lists all the punctuations and "\xxx" are the octal numbers of punctuations. At the end, dispalys total number of words in $string and the 4th words in the string. $string = "Hello, $%^\n\\\"jeff!!!!\"/. 'How are you!'"; $invalid = "\40\41\42\43\44\45\46\47\48\49\50\51\52\53 \54\55\56\57\72\73\74\75\76 \77\100\133\134\135\136\137\138\139\140 \173\174\175\176\n\r\t"; $tok = strtok($string, $invalid); while ($tok) { echo "Word=$tok "; $token[]=$tok; $tok = strtok($invalid); } // displays the number of words in the string and the 4th word echo "Number of token: " . count($token) . " "; echo $token[3]; 11-dec-2001 08:57
The example is unnecessarily confusing for beginners. 1) It is NOT strtok that fails when the returned string evaluates to false in conditional expression, it is the loop test. A correct test is while($tok !== false) 2) the same functionality (AS THE EXAMPLE) can be obtained with explode. Note that if you only need the first few tokens you can put a limit on explode!! read the manual :) array explode (string separator, string string [, INT LIMIT]) What you can NOT do with explode (or split) is changing the separator after a token is returned, as for example, when parsing a string along a simple format : $styleStr = "color:#FFFFFF;font-size:12"; $key = strtok($styleStr,":"); while ($key !== false){ $styleTab[$key]= strtok(";"); // do not change the target $key = strtok(":"); // string, just the separator list } $styleTab is array("color"=>"#FFFFFF","font-size"=>"12") If you need the remaining of the string do : $remaining = strtok(''); //(empty separator) Ivan soletan
strtok's new behaviour isn't more correct than the old one. Example: When parsing a string for a quoted-string (e.g. RFC822-header without wanting to install mailparse from PECL!) then I walk char by char and whenever I encounter a double-quote I take strtok to find the related closing double-quote in string quite easily ... this is done for improved performance. But what if there's an empty quoted-string ... Another example is then having lines like name="quoted-value"; second="another one"; I get the name using strtok with '=', then I check value to be quoted, which is true and thus I take the method described before to get the quoted string. Then all what's left is ; second="another one"; now I advance and drop any whitespaces after current value assignment ... well users shouldn't obey to never ever in life have no whitespaces before that damn semicolon for sure, and that's why I drop that with strtok (using ';') again to get to the next optional assignment with another $s = strtok( '' ) I KNOW, there are ways to work around this using trim and the alikes. But that doesn't explain why strtok is now working "correct" while it didn't do before ... geert
Shauns function needs a little update because it produces an error message that the variables $text and $words were not defined. Written like this it won't produce an error: <?php function summarize($paragraph, $limit){ $tok = strtok($paragraph, " "); $text=""; $words='0'; while($tok){ $text .= " ".$tok; $words++; if(($words >= $limit) && ((substr($tok, -1) == "!")||(substr($tok, -1) == "."))) break; $tok = strtok(" "); } return ltrim($text); } ?> david dot mazur
If you want to tokenize only part of the string, and store the "untokenized" part in some variable, you have to call strtok one last time with separator "" (i.e. the empty string). dethmetal jeff
If you need to parse through a very large delimited text file (such as a word list) combine strtok with file_get_contents. It is much faster than all of the other alternatives i have found (using file() to parse the file into an array, reading the file line by line using fgets()) $dictionary=file_get_contents('path/to/dictionary', 1); //check that the file was read properly if(!$dictionary){ return("read error"); } //dictionary is \n delimited $tok=strtok($dictionary, "\n"); //loop through until we reach the end of the string while($tok){ //do whatever it is you need to do with the $tok string here $tok=strtok("\n"); //get next string } rawat dot arun
I was trying to compare two strings of equal length using strtok. However using them at same time leads into erratic output. Therefore the output of each strok can first be stored in an array and then be used for comparison. Here is small code for it. <?php $string = "This is an XYZ example string"; $tok = strtok($string, ' '); while ($tok !== false) { $toks[] = $tok; $tok = strtok(' '); } $string_1= "This is an unknown example string"; $tok1= strtok($string_1, ' ');while ($tok1 !== false) { $toks1[] = $tok1;$tok1 = strtok(' '); } $ctr=0; while (each ($toks)) if ($toks[$ctr]==$toks1[$ctr]) {echo "W=$toks[$ctr]<br />"; echo "W1=$toks1[$ctr]<br />"; $ctr++; } else $ctr++; ?> Thanks, Arun brian dot cairns dot remove dot this
I was looking for a function to tokenize a string, taking double-quoted inline strings into account (for breaking up search queries, for example), but none of the ones I found seemed particularly efficient or elegant, so I wrote my own. Here it is: <? // split a string into an array of space-delimited tokens, taking double-quoted strings into account function tokenizeQuoted($string) { for($tokens=array(), $nextToken=strtok($string, ' '); $nextToken!==false; $nextToken=strtok(' ')) { if($nextToken{0}=='"') $nextToken = $nextToken{strlen($nextToken)-1}=='"' ? substr($nextToken, 1, -1) : substr($nextToken, 1) . ' ' . strtok('"'); $tokens[] = $nextToken; } return $tokens; } ?> Example: $tokens = tokenizeQuoted('this is "my test string" single "words" work too'); Results in $tokens containing: Array ( [0] => this [1] => is [2] => my test string [3] => single [4] => words [5] => work [6] => too ) Hope this helps someone. shaun
Here's some code to extract the first part of a long paragraph, e.g. to use as a summary. Starting at the beginning of the paragraph it gets as many complete sentences as are necessary to contain $limit words. For example, with $limit at 20 it would return the first two sentences of the paragraph you're reading right now (the first 20 words plus the rest of the sentence in which the limit was hit). function summarize($paragraph, $limit){ $tok = strtok($paragraph, " "); while($tok){ $text .= " $tok"; $words++; if(($words >= $limit) && ((substr($tok, -1) == "!")||(substr($tok, -1) == "."))) break; $tok = strtok(" "); } return ltrim($text); } Might be a better way to do this, but it worked for me. Hope you find it useful! desolate19
Here is yet another explanation of strtok for the explode/split comments. You can do things with strtok that you can't do with explode/split. explode breaks a string using another string, split breaks a string using a regular expression. strtok breaks a string using single _characters_ , but the best part is you can use multiple characters at the same time. For example, if you are accepting user input and aren't sure how the user will decide to divide up their data you could choose to tokenize on spaces, hyphens, slashes and backslashes ALL AT THE SAME TIME: <?PHP $teststr = "blah1 blah2/blah3-blah4\\blah5"; $tok = strtok($teststr," /-\\"); while ($tok !== FALSE) { $toks[] = $tok; $tok = strtok(" /-\\"); } while (list($k,$v) = each($toks)) { print ("$k => $v<BR>\n"); } ?> /* OUTPUT: 0 => blah1 1 => blah2 2 => blah3 3 => blah4 4 => blah5 */ You can't do that with explode, and this should be faster than using split because split uses regular expressions. And for the comments about explode/split putting your output into an array... as you can see, it's not hard to work with arrays in PHP. jrust
Had a website which was using way too many of the old functionality of strtok to convert to the new >PHP 4.1.0 way so I wrote this function to mimic the way strtok was done prior to 4.1.0 function strtok_old($string, $delim = null) { static $origDelim, $origString, $origPos; if (!isset($origDelim)) { $origDelim = null; } if (!isset($origString)) { $origString = null; } if (!isset($origPos)) { $origPos = null; } // continuing an already started strtok if ($string == $origDelim) { $string = $origString; $delim = $origDelim; } // else starting from scratch else { $origString = $string; $origDelim = $delim; $origPos = 0; } if ($origPos !== false && $origPos < strlen($string)) { $newPos = strpos($string, $delim, $origPos); } else { $newPos = false; } // the token wasn't found, go to end of string if ($newPos === false) { $newPos = strlen($string); } $return = substr($string, $origPos, ($newPos - $origPos)); $origPos = ++$newPos; return $return; } torsten
Beware! This function cannot be used to start a recursion during the loop. Sh.. You have to collect the results in an array and then cycle the recursion through that array. Example: $word=strtok($line,TOKENS); while ($word) { // DO NOT START RECURSION HERE USING $word PARAMETER $words[] = $word; } foreach( $words as $word ) { *RECURSE*($word); } // This seems very silly but as the function is not instantiated between recursions it cannot work directly. james
Be very careful with using strtok if there's any chance that you may be calling other functions that may use strtok as well. If any other function that you call while parsing the string decides to call strtok as well, it will clobber the internal string pointer being used by strtok and you may get unexpected results. Here's some code to explain what I mean: function parse_string2($string2) { for($tok = strtok($string2, '.'); $tok !== false; $tok = strtok(".")) { echo $tok; } } $string1 = "1.2.3.4.!.8.9"; $string2 = "5.6.7"; for($word = strtok($string1, '.'); $word !== false; $word = strtok(".")) { if ($word == '!') { echo parse_string2($string2); } else { echo $word; } } If I didn't know the internals of the function parse_string2 (say someone else develops that), but all I know is that parse_string2 should print out 567, then my expected output might be: 123456789 Instead, you only get: 1234567. It would be interesting if they could implement a strtok_r where you could explicitly denote which string to tokenize. slt
As 'mckay' wrote, strtok 2nd argument is a list of tokens, not a string delimiter. It's not so obvious as one may think and it may be confusing for beginners like me. So, in the docs, it should state sth. like that strtok(string where2search, char token2cut). And for the above split-lover =) 'tysonlt' -> it's better to use explode bcoz it's lighter than split (to quote original manual: "(...) use explode(), which doesn't incur the overhead of the regular expression engine") regards, StarLight manicdepressive
<pre><?php /** get leading, trailing, and embedded separator tokens that were 'skipped' if for some ungodly reason you are using php to implement a simple parser that needs to detect nested clauses as it builds a parse tree */ $str = "(((alpha(beta))(gamma))"; $seps = '()'; $tok = strtok( $str,$seps ); // return false on empty string or null $cur = 0; $dumbDone = FALSE; $done = (FALSE===$tok); while (!$done) { // process skipped tokens (if any at first iteration) (special for last) $posTok = $dumbDone ? strlen($str) : strpos($str, $tok, $cur ); $skippedMany = substr( $str, $cur, $posTok-$cur ); // false when 0 width $lenSkipped = strlen($skippedMany); // 0 when false if (0!==$lenSkipped) { $last = strlen($skippedMany) -1; for($i=0; $i<=$last; $i++){ $skipped = $skippedMany[$i]; $cur += strlen($skipped); echo "skipped: $skipped\n"; } } if ($dumbDone) break; // this is the only place the loop is terminated // process current tok echo "curr tok: ".$tok."\n"; // update cursor $cur += strlen($tok); // get any next tok if (!$dumbDone){ $tok = strtok($seps); $dumbDone = (FALSE===$tok); // you're not really done till you check for trailing skipped } }; ?></pre> |
Change Languageaddcslashes addslashes bin2hex chop chr chunk_split convert_cyr_string convert_uudecode convert_uuencode count_chars crc32 crypt echo explode fprintf get_html_translation_table hebrev hebrevc html_entity_decode htmlentities htmlspecialchars_decode htmlspecialchars implode join levenshtein localeconv ltrim md5_file md5 metaphone money_format nl_langinfo nl2br number_format ord parse_str printf quoted_printable_decode quotemeta rtrim setlocale sha1_file sha1 similar_text soundex sprintf sscanf str_getcsv str_ireplace str_pad str_repeat str_replace str_rot13 str_shuffle str_split str_word_count strcasecmp strchr strcmp strcoll strcspn strip_tags stripcslashes stripos stripslashes stristr strlen strnatcasecmp strnatcmp strncasecmp strncmp strpbrk strpos strrchr strrev strripos strrpos strspn strstr strtok strtolower strtoupper strtr substr_compare substr_count substr_replace substr trim ucfirst ucwords vfprintf vprintf vsprintf wordwrap |