HTML-Encoding UTF-8 Characters
It happens sometimes that a web page isn’t using UTF-8, but there’s a need to display UTF-8 data. Thankfully HTML offers encoding that allows displaying any arbitrary UTF-8 characters (of course, if the font supports the character, but that’s another topic). Sadly, though, there aren’t any quick helpers, like the Apache Commons Lang StringEscapeUtils to do the encoding. StringEscapeUtils will translate double-quotes into " and ampersands into & and other common entities, but it doesn’t seem to touch the UTF-8 characters.
For example, there may be a need to offer a language drop-down, and the decided-upon best way to do that is to offer each language in that language. Then rather than seeing “Japanese” in English, the user would see 日本語 and can recognize their desired language. While it displays in the browser as “Japanese” in Japanese, on the HTML page it’s presented as the encoded string 日本語.
If the HTML page isn’t being delivered in UTF-8, and the font used has the Unicode characters, the HTML-encoded string will display properly. Even if the page is delivered in UTF-8, the encoded characters will be displayed, so it’s a nice safety net. Plus it allows storage of UTF-8 characters in databases and on file systems or in file types that don’t support UTF-8 (since technically it’s all ASCII when encoded). Of course, the HTML-encoding is really only useful if the end-target is HTML, but it may be the case that the files used will be to serve HTML, like, well, HTML files.
Since all strings in Java are UTF-8, it’s easy to forget that a string may have characters that aren’t going to be displayed correctly once it reaches the browser. This little snippet will correct that gap. It can be used to encode strings going to a database or file, too. There’s no corresponding decode mechanism, but it’s pretty simple to pull apart the ampersand-octothorpe-number-semicolon strings to return to UTF-8; plus, curiously, these strings are usually decoded when received by a Servlet into UTF-8, if that’s where the application is working.
/** * Takes UTF-8 strings and encodes non-ASCII as * ampersand-octothorpe-digits-semicolon * HTML-encoded characters * * @param string * @return HTML-encoded String */ private String htmlEncode(final String string) { final StringBuffer stringBuffer = new StringBuffer(); for (int i = 0; i < string.length(); i++) { final Character character = string.charAt(i); if (CharUtils.isAscii(character)) { // Encode common HTML equivalent characters stringBuffer.append( StringEscapeUtils.escapeHtml4(character.toString())); } else { // Why isn't this done in escapeHtml4()? stringBuffer.append( String.format("&#x%x;", Character.codePointAt(string, i))); } } return stringBuffer.toString(); }
Jeff, I spent all afternoon looking for code to do exactly this. Thanks very much.
Solved my problem, thanks!
Thanks. It helped me a lot
I just wrote the Scala version of this to answer my own StackOverflow question on how to do this inside Play — http://stackoverflow.com/questions/31417718/scala-play-2-4-x-handling-extended-characters-through-anorm-mysql-to-java-mail/31438899#31438899
Great post and thank you!
Unfortunately, this code is broken for surrogate pairs (e.g. emojis such as
Unfortunately, this code is broken for surrogate pairs, e.g. emojis such as smiley face (U+1F600). See this answer for handling them correctly: http://stackoverflow.com/a/37040891/305973