mb_detect_encoding

Detect character encoding (PHP 4 >= 4.0.6, PHP 5)
string mb_detect_encoding ( string str [, mixed encoding_list [, bool strict]] )

Example 1396. mb_detect_encoding() example

copy to clipboard

<?php
/* Detect character encoding with current detect_order */
echo mb_detect_encoding($str);

/* "auto" is expanded to "ASCII,JIS,UTF-8,EUC-JP,SJIS" */
echo mb_detect_encoding($str, "auto");

/* Specify encoding_list character encoding by comma separated list */
echo mb_detect_encoding($str, "JIS, eucjp-win, sjis-win");

/* Use array to specify encoding_list  */
$ary[] = "ASCII";
$ary[] = "JIS";
$ary[] = "EUC-JP";

echo mb_detect_encoding($str, $ary);
?>

Code Examples / Notes » mb_detect_encoding

maarten

Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8. To verify utf 8 use the following: // // utf8 encoding validation developed based on Wikipedia entry at: // http://en.wikipedia.org/wiki/UTF-8 // // Implemented as a recursive descent parser based on a simple state machine // copyright 2005 Maarten Meijer // // This cries out for a C-implementation to be included in PHP core // function valid_1byte($char) { if(!is_int($char)) return false; return ($char & 0x80) == 0x00; } function valid_2byte($char) { if(!is_int($char)) return false; return ($char & 0xE0) == 0xC0; } function valid_3byte($char) { if(!is_int($char)) return false; return ($char & 0xF0) == 0xE0; } function valid_4byte($char) { if(!is_int($char)) return false; return ($char & 0xF8) == 0xF0; } function valid_nextbyte($char) { if(!is_int($char)) return false; return ($char & 0xC0) == 0x80; } function valid_utf8($string) { $len = strlen($string); $i = 0; while( $i < $len ) { $char = ord(substr($string, $i++, 1)); if(valid_1byte($char)) { // continue continue; } else if(valid_2byte($char)) { // check 1 byte if(!valid_nextbyte(ord(substr($string, $i++, 1)))) return false; } else if(valid_3byte($char)) { // check 2 bytes if(!valid_nextbyte(ord(substr($string, $i++, 1)))) return false; if(!valid_nextbyte(ord(substr($string, $i++, 1)))) return false; } else if(valid_4byte($char)) { // check 3 bytes if(!valid_nextbyte(ord(substr($string, $i++, 1)))) return false; if(!valid_nextbyte(ord(substr($string, $i++, 1)))) return false; if(!valid_nextbyte(ord(substr($string, $i++, 1)))) return false; } // goto next char } return true; // done } for a drawing of the statemachine see: http://www.xs4all.nl/~mjmeijer/unicode.png and http://www.xs4all.nl/~mjmeijer/unicode2.png

php-note-2005

Much simpler UTF-8-ness checker using a regular expression created by the W3C: <?php // Returns true if $string is valid UTF-8 and false otherwise. function is_utf8($string) { // From http://w3.org/International/questions/qa-forms-utf-8.html return preg_match('%^(?: [\x09\x0A\x0D\x20-\x7E] # ASCII | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*$%xs', $string); } // function is_utf8 ?>

jaaks

Last example for verifying UTF-8 has one little bug. If 10xxxxxx byte occurs alone i.e. not in multibyte char, then it is accepted although it is against UTF-8 rules. Make following replacement to repair it. Replace } // goto next char with } else { return false; // 10xxxxxx occuring alone } // goto next char

chrigu

If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list: mb_detect_encoding($string, 'UTF-8, ISO-8859-1'); if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.

I used Chris's function "detectUTF8" to detect the need from conversion from utf8 to 8859-1, which works fine. I did have a problem with the following iconv-conversion. The problem is that the iconv-conversion to 8859-1 (with //TRANSLIT) replaces the euro-sign with EUR, although it is common practice that \x80 is used as the euro-sign in the 8859-1 charset. I could not use 8859-15 since that mangled some other characters, so I added 2 str_replace's: if(detectUTF8($str)){ $str=str_replace("\xE2\x82\xAC","€",$str); $str=iconv("UTF-8","ISO-8859-1//TRANSLIT",$str); $str=str_replace("€","\x80",$str); } If html-output is needed the last line is not necessary (and even unwanted).

sunggsun

from PHPDIG function isUTF8($str) { if ($str === mb_convert_encoding(mb_convert_encoding($str, "UTF-32", "UTF-8"), "UTF-8", "UTF-32")) { return true; } else { return false; } }

mark

For: rl at itfigures dot nl Just note that your Euro symbol being \x80 is NOT standard for ISO-8859-1 or ISO-8859-15 as \x80 is a reserved character. It is however "common practice" for windows developpers to mix windows-1252 and ISO-8859-1. Just convert to windows-1252 instead of ISO-8859-1 and you'll get your â‚¬ symbol at the right place.

telemach

beware : even if you need to distinguish between UTF-8 and ISO-8859-1, and you the following detection order (as chrigu suggests) mb_detect_encoding('accentuée' , 'UTF-8, ISO-8859-1') returns ISO-8859-1, while mb_detect_encoding('accentué' , 'UTF-8, ISO-8859-1') returns UTF-8 bottom line : an ending 'é' (and probably other accentuated chars) mislead mb_detect_encoding

chris

Based upon that snippet below using preg_match() I needed something faster and less specific. That function works and is brilliant but it scans the entire strings and checks that it conforms to UTF-8. I wanted something purely to check if a string contains UTF-8 characters so that I could switch character encoding from iso-8859-1 to utf-8. I modified the pattern to only look for non-ascii multibyte sequences in the UTF-8 range and also to stop once it finds at least one multibytes string. This is quite a lot faster. <?php function detectUTF8($string) { return preg_match('%(?: [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte |\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte |\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates |\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 |[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 |\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )+%xs', $string); } ?>

Change Language

mb_check_encoding
mb_convert_case
mb_convert_encoding
mb_convert_kana
mb_convert_variables
mb_decode_mimeheader
mb_decode_numericentity
mb_detect_encoding
mb_detect_order
mb_encode_mimeheader
mb_encode_numericentity
mb_ereg_match
mb_ereg_replace
mb_ereg_search_getpos
mb_ereg_search_getregs
mb_ereg_search_init
mb_ereg_search_pos
mb_ereg_search_regs
mb_ereg_search_setpos
mb_ereg_search
mb_ereg
mb_eregi_replace
mb_eregi
mb_get_info
mb_http_input
mb_http_output
mb_internal_encoding
mb_language
mb_output_handler
mb_parse_str
mb_preferred_mime_name
mb_regex_encoding
mb_regex_set_options
mb_send_mail
mb_split
mb_strcut
mb_strimwidth
mb_stripos
mb_stristr
mb_strlen
mb_strpos
mb_strrchr
mb_strrichr
mb_strripos
mb_strrpos
mb_strstr
mb_strtolower
mb_strtoupper
mb_strwidth
mb_substitute_character
mb_substr_count
mb_substr

mb_detect_encoding

Detect character encoding (PHP 4 >= 4.0.6, PHP 5) string mb_detect_encoding ( string str [, mixed encoding_list [, bool strict]] )

Example 1396. mb_detect_encoding() example

Code Examples / Notes » mb_detect_encoding

Change Language

Detect character encoding (PHP 4 >= 4.0.6, PHP 5)
string mb_detect_encoding ( string str [, mixed encoding_list [, bool strict]] )