Apr 24, 2013

HTML-Encoding UTF-8 Characters

It happens sometimes that a web page isn’t using UTF-8, but there’s a need to display UTF-8 data. Thankfully HTML offers encoding that allows displaying any arbitrary UTF-8 characters (of course, if the font supports the character, but that’s another topic). Sadly, though, there aren’t any quick helpers, like the Apache Commons Lang StringEscapeUtils to do the encoding. StringEscapeUtils will translate double-quotes into " and ampersands into & and other common entities, but it doesn’t seem to touch the UTF-8 characters.

For example, there may be a need to offer a language drop-down, and the decided-upon best way to do that is to offer each language in that language. Then rather than seeing “Japanese” in English, the user would see 日本語 and can recognize their desired language. While it displays in the browser as “Japanese” in Japanese, on the HTML page it’s presented as the encoded string 日本語.

If the HTML page isn’t being delivered in UTF-8, and the font used has the Unicode characters, the HTML-encoded string will display properly. Even if the page is delivered in UTF-8, the encoded characters will be displayed, so it’s a nice safety net. Plus it allows storage of UTF-8 characters in databases and on file systems or in file types that don’t support UTF-8 (since technically it’s all ASCII when encoded). Of course, the HTML-encoding is really only useful if the end-target is HTML, but it may be the case that the files used will be to serve HTML, like, well, HTML files.

Since all strings in Java are UTF-8, it’s easy to forget that a string may have characters that aren’t going to be displayed correctly once it reaches the browser. This little snippet will correct that gap. It can be used to encode strings going to a database or file, too. There’s no corresponding decode mechanism, but it’s pretty simple to pull apart the ampersand-octothorpe-number-semicolon strings to return to UTF-8; plus, curiously, these strings are usually decoded when received by a Servlet into UTF-8, if that’s where the application is working.

/**
* Takes UTF-8 strings and encodes non-ASCII as
* ampersand-octothorpe-digits-semicolon
* HTML-encoded characters
*
* @param string
* @return HTML-encoded String
*/
private String htmlEncode(final String string) {
  final StringBuffer stringBuffer = new StringBuffer();
  for (int i = 0; i < string.length(); i++) {
    final Character character = string.charAt(i);
    if (CharUtils.isAscii(character)) {
      // Encode common HTML equivalent characters
      stringBuffer.append(
          StringEscapeUtils.escapeHtml4(character.toString()));
    } else {
      // Why isn't this done in escapeHtml4()?
      stringBuffer.append(
          String.format("&#x%x;",
              Character.codePointAt(string, i)));
    }
  }
  return stringBuffer.toString();
}

About the Author

Object Partners profile.

One thought on “HTML-Encoding UTF-8 Characters

  1. Michael says:

    Jeff, I spent all afternoon looking for code to do exactly this. Thanks very much.

  2. Nick says:

    Solved my problem, thanks!

  3. Vishnu says:

    Thanks. It helped me a lot

  4. Gary Hewett says:

    I just wrote the Scala version of this to answer my own StackOverflow question on how to do this inside Play — http://stackoverflow.com/questions/31417718/scala-play-2-4-x-handling-extended-characters-through-anorm-mysql-to-java-mail/31438899#31438899

    Great post and thank you!

  5. Robin says:

    Unfortunately, this code is broken for surrogate pairs (e.g. emojis such as

  6. Robin says:

    Unfortunately, this code is broken for surrogate pairs, e.g. emojis such as smiley face (U+1F600). See this answer for handling them correctly: http://stackoverflow.com/a/37040891/305973

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Blog Posts
Using Conftest to Validate Configuration Files
Conftest is a utility within the Open Policy Agent ecosystem that helps simplify writing validation tests against configuration files. In a previous blog post, I wrote about using the Open Policy Agent utility directly to […]
SwiftGen with Image & Color Asset Catalogs
You might remember back in 2015 when iOS 9 was introduced, and we were finally given a way to manage all of our assets in one place with Asset Catalogs. A few years later, support […]
Tracking Original URL Through Authentication
If you read my other post about refreshing AWS tokens, then you probably have a use case for keeping track of the original requested resource while the user goes through authentication so you can route […]
Using Spring Beans in a Kafka Streams ExceptionHandler
There are many things to know before diving into Kafka Streams. If you haven’t already, check out these 5 things as a starting point. Bullet 2 mentions designing for exceptions. Ironically, this seems to be […]