Bookmark this on Delicious Share on Facebook

Slashdot It!

Digg

PHP : Function Reference : String Functions : html_entity_decode

html_entity_decode

Convert all HTML entities to their applicable characters (PHP 4 >= 4.3.0, PHP 5)
string html_entity_decode ( string string [, int quote_style [, string charset]] )

Example 2412. Decoding HTML entities

copy to clipboard

<?php

$orig = "I'll \"walk\" the <b>dog</b> now";

$a = htmlentities($orig);

$b = html_entity_decode($a);


echo $a; // I'll "walk" the &lt;b&gt;dog&lt;/b&gt; now

echo $b; // I'll "walk" the <b>dog</b> now



// For users prior to PHP 4.3.0 you may do this:
function unhtmlentities($string)

{

    // replace numeric entities

    $string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);

    $string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);

    // replace literal entities

    $trans_tbl = get_html_translation_table(HTML_ENTITIES);

    $trans_tbl = array_flip($trans_tbl);

    return strtr($string, $trans_tbl);

}

$c = unhtmlentities($a);


echo $c; // I'll "walk" the <b>dog</b> now

?>

Code Examples / Notes » php_entity_decode

hayley watson

To go further with Fabian's comment: The XML specification (production 66) says that (decimal) numeric character references start with '&#', followed by one or more digits [0-9], and end with a ';' - just as the documented regular expression states. Hex references start with "&#x" and the allowed digits are [0-9a-fA-F]. And indeed, &#000000000000000000039; is a legitimate reference for an apostrophe (but don't tell Internet Explorer). So Fabien's alteration to the expression is necessary. It's still insufficient, however, as chr() does not handle multibyte characters such as "€".

marius

To convert html entities into unicode characters, use the following: $trans_tbl = get_html_translation_table(HTML_ENTITIES); foreach($trans_tbl as $k => $v) { $ttr[$v] = utf8_encode($k); } $text = strtr($text, $ttr);

aidan

This functionality is now implemented in the PEAR package PHP_Compat. More information about using this function without upgrading your version of PHP can be found on the below link: http://pear.php.net/package/PHP_Compat

daniel

This function seems to have to have two limitations (at least in PHP 4.3.8): a) it does not work with multibyte character codings, such as UTF-8 b) it does not decode numeric entity references a) can be solved by using iconv to convert to ISO-8859-1, then decoding the entities, than convert to UTF-8 again. But that's quite ugly and detroys all characters not present in Latin-1. b) can be solved rather nicely using the following code: <?php function decode_entities($text) { $text= html_entity_decode($text,ENT_QUOTES,"ISO-8859-1"); #NOTE: UTF-8 does not work! $text= preg_replace('/&#(\d+);/me',"chr(\\1)",$text); #decimal notation $text= preg_replace('/&#x([a-f0-9]+);/mei',"chr(0x\\1)",$text); #hex notation return $text; } ?> HTH

jojo

The decipherment does the character encoded by the escape function of JavaScript. When the multi byte is used on the page, it is effective. javascript escape('aaã‚ã‚aa') ..... 'aa%u3042%u3042aa' php jsEscape_decode('aa%u3042%u3042aa')..'aaã‚ã‚aa' <? function jsEscape_decode($jsEscaped,$outCharCode='SJIS'){ $arrMojis = explode("%u",$jsEscaped); for ($i = 1;$i < count($arrMojis);$i++){ $c = substr($arrMojis[$i],0,4); $cc = mb_convert_encoding(pack('H*',$c),$outCharCode,'UTF-16'); $arrMojis[$i] = substr_replace($arrMojis[$i],$cc,0,4); } return implode('',$arrMojis); } ?>

php dot net

Quick & dirty code that translates numeric entities to UTF-8. <?php function replace_num_entity($ord) { $ord = $ord[1]; if (preg_match('/^x([0-9a-f]+)$/i', $ord, $match)) { $ord = hexdec($match[1]); } else { $ord = intval($ord); } $no_bytes = 0; $byte = array(); if ($ord < 128) { return chr($ord); } elseif ($ord < 2048) { $no_bytes = 2; } elseif ($ord < 65536) { $no_bytes = 3; } elseif ($ord < 1114112) { $no_bytes = 4; } else { return; } switch($no_bytes) { case 2: { $prefix = array(31, 192); break; } case 3: { $prefix = array(15, 224); break; } case 4: { $prefix = array(7, 240); } } for ($i = 0; $i < $no_bytes; $i++) { $byte[$no_bytes - $i - 1] = (($ord & (63 * pow(2, 6 * $i))) / pow(2, 6 * $i)) & 63 | 128; } $byte[0] = ($byte[0] & $prefix[0]) | $prefix[1]; $ret = ''; for ($i = 0; $i < $no_bytes; $i++) { $ret .= chr($byte[$i]); } return $ret; } $test = 'This is a čא test''; echo $test . "<br />\n"; echo preg_replace_callback('/&#([0-9a-fx]+);/mi', 'replace_num_entity', $test); ?>

silvan

Passing NULL or FALSE as a string will generate a '500 Internal Server Error' (or break the script when inside a function). So always test your string first before passing it to html_entity_decode().

florianborn

Note that <?php echo urlencode(html_entity_decode(" ")); ?> will output "%A0" instead of "+".

laurynas dot butkus

In PHP4 html_entity_decode() is not working well with UTF-8 spitting: "Warning: cannot yet handle MBCS in html_entity_decode()!". This is working solution combining several workarounds: <?php function html_entity_decode_utf8($string) { static $trans_tbl; // replace numeric entities $string = preg_replace('~&#x([0-9a-f]+);~ei', 'code2utf(hexdec("\\1"))', $string); $string = preg_replace('~&#([0-9]+);~e', 'code2utf(\\1)', $string); // replace literal entities if (!isset($trans_tbl)) { $trans_tbl = array(); foreach (get_html_translation_table(HTML_ENTITIES) as $val=>$key) $trans_tbl[$key] = utf8_encode($val); } return strtr($string, $trans_tbl); } // Returns the utf string corresponding to the unicode value (from php.net, courtesy - romans@void.lv) function code2utf($num) { if ($num < 128) return chr($num); if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128); if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128); if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128); return ''; } ?>

akniep

In answer to "laurynas dot butkus at gmail dot com" and "romans@void.lv" and their great code2utf-function I added the functionality for entries between [128, 160[ that are not ASCii, but equal for all major western encodings like ISO8859-X and UTF-8 that has been mentioned before. Now, the following function should in fact convert any number (table-entry) into an UTF-8-character. Thus, the return-value code2utf( <number> ) equals the character that is represented by the XML-entity &#<number>; (exceptions: #129, #141, #143, #144, #157). To give an example, the function may be useful for creating a UTF-8-compatible html_entity_decode-function or determining the entry-position of UTF-8-characters in order to find the correct entity-replacement or similar. function code2utf($number) { if ($number < 0) return FALSE; if ($number < 128) return chr($number); // Removing / Replacing Windows Illegals Characters if ($number < 160) { if ($number==128) $number=8364; elseif ($number==129) $number=160; // (Rayo:) #129 using no relevant sign, thus, mapped to the saved-space #160 elseif ($number==130) $number=8218; elseif ($number==131) $number=402; elseif ($number==132) $number=8222; elseif ($number==133) $number=8230; elseif ($number==134) $number=8224; elseif ($number==135) $number=8225; elseif ($number==136) $number=710; elseif ($number==137) $number=8240; elseif ($number==138) $number=352; elseif ($number==139) $number=8249; elseif ($number==140) $number=338; elseif ($number==141) $number=160; // (Rayo:) #129 using no relevant sign, thus, mapped to the saved-space #160 elseif ($number==142) $number=381; elseif ($number==143) $number=160; // (Rayo:) #129 using no relevant sign, thus, mapped to the saved-space #160 elseif ($number==144) $number=160; // (Rayo:) #129 using no relevant sign, thus, mapped to the saved-space #160 elseif ($number==145) $number=8216; elseif ($number==146) $number=8217; elseif ($number==147) $number=8220; elseif ($number==148) $number=8221; elseif ($number==149) $number=8226; elseif ($number==150) $number=8211; elseif ($number==151) $number=8212; elseif ($number==152) $number=732; elseif ($number==153) $number=8482; elseif ($number==154) $number=353; elseif ($number==155) $number=8250; elseif ($number==156) $number=339; elseif ($number==157) $number=160; // (Rayo:) #129 using no relevant sign, thus, mapped to the saved-space #160 elseif ($number==158) $number=382; elseif ($number==159) $number=376; } //if if ($number < 2048) return chr(($number >> 6) + 192) . chr(($number & 63) + 128); if ($number < 65536) return chr(($number >> 12) + 224) . chr((($number >> 6) & 63) + 128) . chr(($number & 63) + 128); if ($number < 2097152) return chr(($number >> 18) + 240) . chr((($number >> 12) & 63) + 128) . chr((($number >> 6) & 63) + 128) . chr(($number & 63) + 128); return FALSE; } //code2utf()

gaui

if( !function_exists( 'html_entity_decode' ) ) { function html_entity_decode( $given_html, $quote_style = ENT_QUOTES ) { $trans_table = array_flip(get_html_translation_table( HTML_SPECIALCHARS, $quote_style )); $trans_table['''] = "'"; return ( strtr( $given_html, $trans_table ) ); } }

loufoque

If you want to decode NCRs to utf-8 use this function instead of chr(). function utf8_chr($code) { if($code<128) return chr($code); else if($code<2048) return chr(($code>>6)+192).chr(($code&63)+128); else if($code<65536) return chr(($code>>12)+224).chr((($code>>6)&63)+128).chr(($code&63)+128); else if($code<2097152) return chr($code>>18+240).chr((($code>>12)&63)+128) .chr(($code>>6)&63+128).chr($code&63+128)); }

emilianomartinezluque

I've been using the great replace_num_entity function posted below. But there seems to be some problems with the 128 to 160 characters range. Ie, try: <?php header("Content-type: text/html; charset=utf-8"); ?> <html><body> <?php for($x=128; $x<161; $x++) { echo('&#' . $x . '; -- ' . preg_replace_callback('/&#([0-9a-fx]+);/mi', 'replace_num_entity', '&#' . $x . ';') . '</br>'); } ?> </body></html> I really don´t know the reason for this (since according to UTF-8 specs the function should have worked) but I did a modified version of the function to address this. Hope it helps. function replace_num_entity($ord) { $ord = $ord[1]; if (preg_match('/^x([0-9a-f]+)$/i', $ord, $match)) { $ord = hexdec($match[1]); } else { $ord = intval($ord); } $no_bytes = 0; $byte = array(); if($ord == 128) { return chr(226).chr(130).chr(172); } elseif($ord == 129) { return chr(239).chr(191).chr(189); } elseif($ord == 130) { return chr(226).chr(128).chr(154); } elseif($ord == 131) { return chr(198).chr(146); } elseif($ord == 132) { return chr(226).chr(128).chr(158); } elseif($ord == 133) { return chr(226).chr(128).chr(166); } elseif($ord == 134) { return chr(226).chr(128).chr(160); } elseif($ord == 135) { return chr(226).chr(128).chr(161); } elseif($ord == 136) { return chr(203).chr(134); } elseif($ord == 137) { return chr(226).chr(128).chr(176); } elseif($ord == 138) { return chr(197).chr(160); } elseif($ord == 139) { return chr(226).chr(128).chr(185); } elseif($ord == 140) { return chr(197).chr(146); } elseif($ord == 141) { return chr(239).chr(191).chr(189); } elseif($ord == 142) { return chr(197).chr(189); } elseif($ord == 143) { return chr(239).chr(191).chr(189); } elseif($ord == 144) { return chr(239).chr(191).chr(189); } elseif($ord == 145) { return chr(226).chr(128).chr(152); } elseif($ord == 146) { return chr(226).chr(128).chr(153); } elseif($ord == 147) { return chr(226).chr(128).chr(156); } elseif($ord == 148) { return chr(226).chr(128).chr(157); } elseif($ord == 149) { return chr(226).chr(128).chr(162); } elseif($ord == 150) { return chr(226).chr(128).chr(147); } elseif($ord == 151) { return chr(226).chr(128).chr(148); } elseif($ord == 152) { return chr(203).chr(156); } elseif($ord == 153) { return chr(226).chr(132).chr(162); } elseif($ord == 154) { return chr(197).chr(161); } elseif($ord == 155) { return chr(226).chr(128).chr(186); } elseif($ord == 156) { return chr(197).chr(147); } elseif($ord == 157) { return chr(239).chr(191).chr(189); } elseif($ord == 158) { return chr(197).chr(190); } elseif($ord == 159) { return chr(197).chr(184); } elseif($ord == 160) { return chr(194).chr(160); } if ($ord < 128) { return chr($ord); } elseif ($ord < 2048) { $no_bytes = 2; } elseif ($ord < 65536) { $no_bytes = 3; } elseif ($ord < 1114112) { $no_bytes = 4; } else { return; } switch($no_bytes) { case 2: { $prefix = array(31, 192); break; } case 3: { $prefix = array(15, 224); break; } case 4: { $prefix = array(7, 240); } } for ($i = 0; $i < $no_bytes; $i++) { $byte[$no_bytes - $i - 1] = (($ord & (63 * pow(2, 6 * $i))) / pow(2, 6 * $i)) & 63 | 128; } $byte[0] = ($byte[0] & $prefix[0]) | $prefix[1]; $ret = ''; for ($i = 0; $i < $no_bytes; $i++) { $ret .= chr($byte[$i]); } return $ret; }

hurricane

I shortened the function repace_num_entity a bit to make more understandable and clean. Maybe now someone sees the problem it possibly has... (as mentioned below) <?php function replace_num_entity($ord) { $ord = $ord[1]; if (preg_match('/^x([0-9a-f]+)$/i', $ord, $match)) $ord = hexdec($match[1]); else $ord = intval($ord); $no_bytes = 0; $byte = array(); if ($ord < 128) return chr($ord); if ($ord < 2048) $no_bytes = 2; else if ($ord < 65536) $no_bytes = 3; else if ($ord < 1114112) $no_bytes = 4; else return; switch($no_bytes) { case 2: $prefix = array(31, 192); break; case 3: $prefix = array(15, 224); break; case 4: $prefix = array(7, 240); } for ($i=0; $i < $no_bytes; ++$i) $byte[$no_bytes-$i-1] = (($ord & (63 * pow(2,6*$i))) / pow(2,6*$i)) & 63 | 128; $byte[0] = ($byte[0] & $prefix[0]) | $prefix[1]; $ret = ''; for ($i=0; $i < $no_bytes; ++$i) $ret .= chr($byte[$i]); return $ret; } ?>

elektronaut gmx.net

I made my own fix to allow numerical entities in utf8 in php4... <? function utf8_replaceEntity($result){ $value = (int)$result[1]; $string = ''; $len = round(pow($value,1/8)); for($i=$len;$i>0;$i--){ $part = ($value & (255>>2)) | pow(2,7); if ( $i == 1 ) $part |= 255<<(8-$len); $string = chr($part) . $string; $value >>= 6; } return $string; } function utf8_html_entity_decode($string){ return preg_replace_callback( '/&#([0-9]+);/u', 'utf8_replaceEntity', $string ); } $string = '’‘ – “ ”' .' ć ń ř' ; $string = utf8_html_entity_decode($string,null,'UTF-8'); header('Content-Type: text/html; charset=UTF-8'); echo '<li>'.$string; ?>

teecee a teecee pont hu

Hi! The main problem with the UTF-8 strings if You try to unhtmlentities them is that the get_html_translation_table() gives back a non-UTF8 conversion table. So the idea is to get the translation table and then translate the needed non-UTF8 strings to UTF8... I have this code working, actually this code is the one sent by 'daviscabral', just with an extra foreach in it ( http://hu.php.net/manual/en/function.htmlentities.php#68479 ) And the code is: <? function unhtmlentitiesUtf8($string) { // replace numeric entities $string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string); $string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string); // replace literal entities $trans_tbl = get_html_translation_table(HTML_ENTITIES); $trans_tbl = array_flip($trans_tbl); // changing translation table to UTF-8 foreach( $trans_tbl as $key => $value ) { $trans_tbl[$key] = iconv( 'ISO-8859-1', 'UTF-8', $value ); } return strtr($string, $trans_tbl); } ?> If You need this in production code, I suggest to get the $trans_tbl into a common-includable file I think it should be faster. ( Maybe the easiest way to do this is to write after the translation: die(var_export($trans_tbl, true)); and copy&paste the source of the displaying text. And don't forget to check if the browser uses UTF8 codepage! ;)

romekt

here's a simple workaround for the UTF-8 support problem $var=iconv("UTF-8","ISO-8859-1",$var); $var=html_entity_decode($var, ENT_QUOTES, 'ISO-8859-1'); $var=iconv("ISO-8859-1","UTF-8",$var);

grvg

Here is the ultimate functions to convert HTML entities to UTF-8Â : The main function isÂ htmlentities2utf8 Others are helper functions function chr_utf8($code) { if ($code < 0) return false; elseif ($code < 128) return chr($code); elseif ($code < 160) // Remove Windows Illegals Cars { if ($code==128) $code=8364; elseif ($code==129) $code=160; // not affected elseif ($code==130) $code=8218; elseif ($code==131) $code=402; elseif ($code==132) $code=8222; elseif ($code==133) $code=8230; elseif ($code==134) $code=8224; elseif ($code==135) $code=8225; elseif ($code==136) $code=710; elseif ($code==137) $code=8240; elseif ($code==138) $code=352; elseif ($code==139) $code=8249; elseif ($code==140) $code=338; elseif ($code==141) $code=160; // not affected elseif ($code==142) $code=381; elseif ($code==143) $code=160; // not affected elseif ($code==144) $code=160; // not affected elseif ($code==145) $code=8216; elseif ($code==146) $code=8217; elseif ($code==147) $code=8220; elseif ($code==148) $code=8221; elseif ($code==149) $code=8226; elseif ($code==150) $code=8211; elseif ($code==151) $code=8212; elseif ($code==152) $code=732; elseif ($code==153) $code=8482; elseif ($code==154) $code=353; elseif ($code==155) $code=8250; elseif ($code==156) $code=339; elseif ($code==157) $code=160; // not affected elseif ($code==158) $code=382; elseif ($code==159) $code=376; } if ($code < 2048) return chr(192 | ($code >> 6)) . chr(128 | ($code & 63)); elseif ($code < 65536) return chr(224 | ($code >> 12)) . chr(128 | (($code >> 6) & 63)) . chr(128 | ($code & 63)); else return chr(240 | ($code >> 18)) . chr(128 | (($code >> 12) & 63)) . chr(128 | (($code >> 6) & 63)) . chr(128 | ($code & 63)); } // Callback for preg_replace_callback('~&(#(x?))?([^;]+);~', 'html_entity_replace', $str); function html_entity_replace($matches) { if ($matches[2]) { return chr_utf8(hexdec($matches[3])); } elseif ($matches[1]) { return chr_utf8($matches[3]); } switch ($matches[3]) { case "nbsp": return chr_utf8(160); case "iexcl": return chr_utf8(161); case "cent": return chr_utf8(162); case "pound": return chr_utf8(163); case "curren": return chr_utf8(164); case "yen": return chr_utf8(165); //... etc with all named HTML entities } return false; } function htmlentities2utf8 ($string) // because of the html_entity_decode() bug with UTF-8 { $string = preg_replace_callback('~&(#(x?))?([^;]+);~', 'html_entity_replace', $string); return $string; }

hayley watson

Fabian's observation that chr(039) returns "a heart character" is explained by the fact that numeric literals that start with '0' are interpreted in base 8, which doesn't have a digit '9'. So 039==3 and hence chr(039) is equivalent to chr(3), NOT chr(39).

derernst

Combining the suggestions by buraks78 at gmail dot com, gaui at gaui dot is, daniel at brightbyte dot de, and the version in PEAR_PHP_Compat, I come to the following, which should work in an UTF-8 environment, with PHP < or > 4.3: <?php function decode_entities($text, $quote_style = ENT_COMPAT) { if (function_exists('html_entity_decode')) { $text = html_entity_decode($text, $quote_style, 'ISO-8859-1'); // NOTE: UTF-8 does not work! } else { $trans_tbl = get_html_translation_table(HTML_ENTITIES, $quote_style); $trans_tbl = array_flip($trans_tbl); $text = strtr($text, $trans_tbl); } $text = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $text); $text = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $text); return $text; } ?> Note that I omitted the line $trans_table['''] = "'"; as it would override the quote_style setting and thus lead to unexpected results for quote_styles ENT_NOQUOTES and ENT_COMPAT.

matt robinson

Bafflingly, html_entity_decode() only converts the 100 most common named entities, whereas the HTML 4.01 Recommendation lists over 250. This wrapper function converts all known named entities to numeric ones before handing over to the original html_entity_decode, and hopefully isn't too insufferably slow (am I right in thinking that making the conversion table static will prevent it being reinitialised on each call?) Unfortunately it's just a little too long for this documentation. You can see the code at http://www.lazycat.org/software/html_entity_decode_full.phps

fabian

Actually I am not sure about the regex replacements from numeric entities back. If you give ' to a browser. ' will also turn into a single quote. But if I do a: <?php chr(039); ?> I will get not a single quote but a heart character (haven't seen it since DOS days :)) However <?php chr(39); ?> gives the correct result. This makes the correct preg something like this <?php $string = preg_replace('~&#x0*([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string); $string = preg_replace('~&#0*([0-9]+);~e', 'chr(\\1)', $string); ?> The reason is also already found on preg_replace manual page: http://de.php.net/manual/en/function.preg-replace.php#69478 039 is interpreted as octal

inco

@ romekt: iconv could not be implemented, so alternatively use utf8_decode and utf8_encode to solve the utf-8 / iso-8859-1 problem

Change Language

addcslashes
addslashes
bin2hex
chop
chr
chunk_split
convert_cyr_string
convert_uudecode
convert_uuencode
count_chars
crc32
crypt
echo
explode
fprintf
get_html_translation_table
hebrev
hebrevc
html_entity_decode
htmlentities
htmlspecialchars_decode
htmlspecialchars
implode
join
levenshtein
localeconv
ltrim
md5_file
md5
metaphone
money_format
nl_langinfo
nl2br
number_format
ord
parse_str
print
printf
quoted_printable_decode
quotemeta
rtrim
setlocale
sha1_file
sha1
similar_text
soundex
sprintf
sscanf
str_getcsv
str_ireplace
str_pad
str_repeat
str_replace
str_rot13
str_shuffle
str_split
str_word_count
strcasecmp
strchr
strcmp
strcoll
strcspn
strip_tags
stripcslashes
stripos
stripslashes
stristr
strlen
strnatcasecmp
strnatcmp
strncasecmp
strncmp
strpbrk
strpos
strrchr
strrev
strripos
strrpos
strspn
strstr
strtok
strtolower
strtoupper
strtr
substr_compare
substr_count
substr_replace
substr
trim
ucfirst
ucwords
vfprintf
vprintf
vsprintf
wordwrap

html_entity_decode

Convert all HTML entities to their applicable characters (PHP 4 >= 4.3.0, PHP 5) string html_entity_decode ( string string [, int quote_style [, string charset]] )

Example 2412. Decoding HTML entities

Code Examples / Notes » php_entity_decode

Change Language

Convert all HTML entities to their applicable characters (PHP 4 >= 4.3.0, PHP 5)
string html_entity_decode ( string string [, int quote_style [, string charset]] )