Feb 14, 2012

Manipulating HTML with Java and jsoup

Have you ever needed to manipulate some HTML in your Java code? Maybe you are working with some HTML fragments that need some decorating or you simply need to clean up some possibly bad syntax or you have a need to do some screen scraping? A handy little library named jsoup is just what you need.

It’s easy to setup. Just download the jar from the jsoup download area and include it in your class path. There is even a Maven artifact if you are into that sort of thing. No other dependencies are needed other than Java 5 or higher.

Let’s start with a basic example. Let’s read in an HTML fragment:
String fragment =
"<div id='div1'>" +
"<p id='para1'>This is the first paragraph</p>" +
"<p id='para2'>Second paragraph here!" +
"</div>";
Document doc = Jsoup.parseBodyFragment(fragment);
System.out.println(doc.toString());

The output from this is:
<html>
<head></head>
<body>
<div id="div1">
<p id="para1">This is the first paragraph</p>
<p id="para2">Second paragraph here!</p>
</div>
</body>
</html>

The first thing you’ll notice is that jsoup wraps your fragment with all the necessary tags to create a valid HTML document. This can be helpful or hindersome at times. You can also read in a complete HTML document using Jsoup.parse().  Notice in the output the missing p tag in the source HTML has been added to the document. Jsoup does it’s best to clean up invalid HTML to make it valid. If you want to get back to your (valid) fragment without the added html, head, and body tags, you can do this:

doc.body().children().toString();

Now you probably want to manipulate the document a little. Say you want to add in a third paragraph. Using the same HTML fragment as above, add in a paragraph like this:

doc.select("p").last().after("<p id='para3'>Third paragraph I just added</p>");

Output:
<div id="div1">
<p id="para1">This is the first paragraph</p>
<p id="para2">Second paragraph here!</p>
<p id="para3">Third paragraph I just added</p>
</div>

Does that look familiar? Hint:

$("p").last().after("Third paragraph I just added");

If you are used to jQuery, jsoup should be an easy transition. Many of the same methods and selectors are available in jsoup. You can select on id, tag name (e.g. “p” or “div”), class name, or elements with specific attributes.  Just like jQuery you can retrieve children, siblings, and parents, insert and remove elements, and get values of elements or attributes.

System.out.println(doc.select("#para1").toString());

This will get you what you’d expect:

<p id="para1">This is the first paragraph</p>

Similarly, to find all p elements:

Elements elements = doc.select("p");
System.out.println(elements.toString());

Output:

<p id="para1">This is the first paragraph</p>
<p id="para2">Second paragraph here!</p>

To remove an element:

Elements elements = doc.select("#para1").remove();
System.out.println(doc.body().children().toString());
System.out.println("---------------------------------");
System.out.println(elements.toString());

Output:
<div id="div1">
<p id="para2">Second paragraph here!</p>
</div>
---------------------------------
<p id="para1">This is the first paragraph</p>

The removed elements are returned in an Elements object, but no longer exist in the Document.

A powerful feature of jsoup is it’s ability to scrub HTML.  You may be accepting HTML from users on your website, but you don’t want them injecting potentially harmful tags or code.  The clean() method on the Jsoup class takes a Whitelist as one of it’s parameters.  Jsoup comes with several Whitelists and you can create your own if you need something customized.  Here’s an example of cleaning the example HTML from above with the “basic” Whitelist:

System.out.println(Jsoup.clean(fragment, Whitelist.basic()));

Output:

<p>This is the first paragraph</p>
<p>Second paragraph here!</p>

Notice the missing &lt;div&gt; tags.  The basic Whitelist does not allow <div> tags.  The built-in Whitelists range anywhere from allowing no tags (only text) to a pretty wide variety of tags.  You can even limit protocols (e.g. http and ftp) and allowed attributes on specific tags.

Whitelist myWhitelist = new Whitelist();
myWhitelist.addTags("div", "p");
myWhitelist.addAttributes("div", "class");
myWhitelist.addAttributes("p", "id");
System.out.println(Jsoup.clean(fragment, myWhitelist));

Output:

&lt;div&gt;
&lt;p id="para1"&gt;This is the first paragraph&lt;/p&gt;
&lt;p id="para2"&gt;Second paragraph here!&lt;/p&gt;
&lt;/div&gt;

Notice the missing id attribute from the div tag.

Some other nice features of jsoup are it’s ability to read directly from a url (Jsoup.connect(url)), testing a string of HTML against a Whitelist to check for validity, CSS selectors and more.

If you need to manipulate HTML in your Java code, you need jsoup!

About the Author

Brendon Anderson profile.

Brendon Anderson

Sr. Consultant

Brendon has over 15 years of software development experience at organizations large and small.  He craves learning new technologies and techniques and lives in and understands large enterprise application environments with complex software and hardware architectures.

One thought on “Manipulating HTML with Java and jsoup

  1. Diva says:

    you can find some more details in the below link,”http://javadomain.in/parse-div-using-jsoup-in-java/”

  2. Daruka Roshan RajKumar says:

    how to replace a p tag’s text like..
    hello world
    to
    hi sam

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Blog Posts
Using Conftest to Validate Configuration Files
Conftest is a utility within the Open Policy Agent ecosystem that helps simplify writing validation tests against configuration files. In a previous blog post, I wrote about using the Open Policy Agent utility directly to […]
SwiftGen with Image & Color Asset Catalogs
You might remember back in 2015 when iOS 9 was introduced, and we were finally given a way to manage all of our assets in one place with Asset Catalogs. A few years later, support […]
Tracking Original URL Through Authentication
If you read my other post about refreshing AWS tokens, then you probably have a use case for keeping track of the original requested resource while the user goes through authentication so you can route […]
Using Spring Beans in a Kafka Streams ExceptionHandler
There are many things to know before diving into Kafka Streams. If you haven’t already, check out these 5 things as a starting point. Bullet 2 mentions designing for exceptions. Ironically, this seems to be […]