Apr 24, 2013

HTML-Encoding UTF-8 Characters

It happens sometimes that a web page isn’t using UTF-8, but there’s a need to display UTF-8 data. Thankfully HTML offers encoding that allows displaying any arbitrary UTF-8 characters (of course, if the font supports the character, but that’s another topic). Sadly, though, there aren’t any quick helpers, like the Apache Commons Lang StringEscapeUtils to do the encoding. StringEscapeUtils will translate double-quotes into " and ampersands into & and other common entities, but it doesn’t seem to touch the UTF-8 characters.

For example, there may be a need to offer a language drop-down, and the decided-upon best way to do that is to offer each language in that language. Then rather than seeing “Japanese” in English, the user would see 日本語 and can recognize their desired language. While it displays in the browser as “Japanese” in Japanese, on the HTML page it’s presented as the encoded string 日本語.

If the HTML page isn’t being delivered in UTF-8, and the font used has the Unicode characters, the HTML-encoded string will display properly. Even if the page is delivered in UTF-8, the encoded characters will be displayed, so it’s a nice safety net. Plus it allows storage of UTF-8 characters in databases and on file systems or in file types that don’t support UTF-8 (since technically it’s all ASCII when encoded). Of course, the HTML-encoding is really only useful if the end-target is HTML, but it may be the case that the files used will be to serve HTML, like, well, HTML files.

Since all strings in Java are UTF-8, it’s easy to forget that a string may have characters that aren’t going to be displayed correctly once it reaches the browser. This little snippet will correct that gap. It can be used to encode strings going to a database or file, too. There’s no corresponding decode mechanism, but it’s pretty simple to pull apart the ampersand-octothorpe-number-semicolon strings to return to UTF-8; plus, curiously, these strings are usually decoded when received by a Servlet into UTF-8, if that’s where the application is working.

/**
* Takes UTF-8 strings and encodes non-ASCII as
* ampersand-octothorpe-digits-semicolon
* HTML-encoded characters
*
* @param string
* @return HTML-encoded String
*/
private String htmlEncode(final String string) {
  final StringBuffer stringBuffer = new StringBuffer();
  for (int i = 0; i < string.length(); i++) {
    final Character character = string.charAt(i);
    if (CharUtils.isAscii(character)) {
      // Encode common HTML equivalent characters
      stringBuffer.append(
          StringEscapeUtils.escapeHtml4(character.toString()));
    } else {
      // Why isn't this done in escapeHtml4()?
      stringBuffer.append(
          String.format("&#x%x;",
              Character.codePointAt(string, i)));
    }
  }
  return stringBuffer.toString();
}

About the Author

Object Partners profile.

One thought on “HTML-Encoding UTF-8 Characters

  1. Michael says:

    Jeff, I spent all afternoon looking for code to do exactly this. Thanks very much.

  2. Nick says:

    Solved my problem, thanks!

  3. Vishnu says:

    Thanks. It helped me a lot

  4. Gary Hewett says:

    I just wrote the Scala version of this to answer my own StackOverflow question on how to do this inside Play — http://stackoverflow.com/questions/31417718/scala-play-2-4-x-handling-extended-characters-through-anorm-mysql-to-java-mail/31438899#31438899

    Great post and thank you!

  5. Robin says:

    Unfortunately, this code is broken for surrogate pairs (e.g. emojis such as

  6. Robin says:

    Unfortunately, this code is broken for surrogate pairs, e.g. emojis such as smiley face (U+1F600). See this answer for handling them correctly: http://stackoverflow.com/a/37040891/305973

Leave a Reply

Your email address will not be published.

Related Blog Posts
A security model for developers
Software security is more important than ever, but developing secure applications is more confusing than ever. TLS, mTLS, RBAC, SAML, OAUTH, OWASP, GDPR, SASL, RSA, JWT, cookie, attack vector, DDoS, firewall, VPN, security groups, exploit, […]
Building Better Data Visualization Experiences: Part 1 of 2
Through direct experience with data scientists, business analysts, lab technicians, as well as other UX professionals, I have found that we need a better understanding of the people who will be using our data visualization products in order to build them. Creating a product utilizing data with the goal of providing insight is fundamentally different from a typical user-centric web experience, although traditional UX process methods can help.
Kafka Schema Evolution With Java Spring Boot and Protobuf
In this blog I will be demonstrating Kafka schema evolution with Java, Spring Boot and Protobuf.  This app is for tutorial purposes, so there will be instances where a refactor could happen. I tried to […]
Redis Bitmaps: Storing state in small places
Redis is a popular open source in-memory data store that supports all kinds of abstract data structures. In this post and in an accompanying example Java project, I am going to explore two great use […]