Manipulating HTML with Java and jsoup

Have you ever needed to manipulate some HTML in your Java code? Maybe you are working with some HTML fragments that need some decorating or you simply need to clean up some possibly bad syntax or you have a need to do some screen scraping? A handy little library named jsoup is just what you need.

It’s easy to setup. Just download the jar from the jsoup download area and include it in your class path. There is even a Maven artifact if you are into that sort of thing. No other dependencies are needed other than Java 5 or higher.

Let’s start with a basic example. Let’s read in an HTML fragment:

String fragment =
"<div id='div1'>" +
"<p id='para1'>This is the first paragraph</p>" +
"<p id='para2'>Second paragraph here!" +
"</div>";
Document doc = Jsoup.parseBodyFragment(fragment);
System.out.println(doc.toString());

The output from this is:

<html>
<head></head>
<body>
<div id="div1">
<p id="para1">This is the first paragraph</p>
<p id="para2">Second paragraph here!</p>
</div>
</body>
</html>

The first thing you’ll notice is that jsoup wraps your fragment with all the necessary tags to create a valid HTML document. This can be helpful or hindersome at times. You can also read in a complete HTML document using Jsoup.parse().  Notice in the output the missing p tag in the source HTML has been added to the document. Jsoup does it’s best to clean up invalid HTML to make it valid. If you want to get back to your (valid) fragment without the added html, head, and body tags, you can do this:

doc.body().children().toString();

Now you probably want to manipulate the document a little. Say you want to add in a third paragraph. Using the same HTML fragment as above, add in a paragraph like this:

doc.select("p").last().after("<p id='para3'>Third paragraph I just added</p>");

Output:

<div id="div1">
<p id="para1">This is the first paragraph</p>
<p id="para2">Second paragraph here!</p>
<p id="para3">Third paragraph I just added</p>
</div>

Does that look familiar? Hint:

$("p").last().after("Third paragraph I just added");

If you are used to jQuery, jsoup should be an easy transition. Many of the same methods and selectors are available in jsoup. You can select on id, tag name (e.g. “p” or “div”), class name, or elements with specific attributes.  Just like jQuery you can retrieve children, siblings, and parents, insert and remove elements, and get values of elements or attributes.

System.out.println(doc.select("#para1").toString());

This will get you what you’d expect:

<p id="para1">This is the first paragraph</p>

Similarly, to find all p elements:

Elements elements = doc.select("p");
System.out.println(elements.toString());

Output:

<p id="para1">This is the first paragraph</p>
<p id="para2">Second paragraph here!</p>

To remove an element:

Elements elements = doc.select("#para1").remove();
System.out.println(doc.body().children().toString());
System.out.println("---------------------------------");
System.out.println(elements.toString());

Output:

<div id="div1">
<p id="para2">Second paragraph here!</p>
</div>
---------------------------------
<p id="para1">This is the first paragraph</p>

The removed elements are returned in an Elements object, but no longer exist in the Document.

A powerful feature of jsoup is it’s ability to scrub HTML.  You may be accepting HTML from users on your website, but you don’t want them injecting potentially harmful tags or code.  The clean() method on the Jsoup class takes a Whitelist as one of it’s parameters.  Jsoup comes with several Whitelists and you can create your own if you need something customized.  Here’s an example of cleaning the example HTML from above with the “basic” Whitelist:

System.out.println(Jsoup.clean(fragment, Whitelist.basic()));

Output:

<p>This is the first paragraph</p>
<p>Second paragraph here!</p>

Notice the missing <div> tags.  The basic Whitelist does not allow <div> tags.  The built-in Whitelists range anywhere from allowing no tags (only text) to a pretty wide variety of tags.  You can even limit protocols (e.g. http and ftp) and allowed attributes on specific tags.

Whitelist myWhitelist = new Whitelist();
myWhitelist.addTags("div", "p");
myWhitelist.addAttributes("div", "class");
myWhitelist.addAttributes("p", "id");
System.out.println(Jsoup.clean(fragment, myWhitelist));

Output:

<div>
<p id="para1">This is the first paragraph</p>
<p id="para2">Second paragraph here!</p>
</div>

Notice the missing id attribute from the div tag.

Some other nice features of jsoup are it’s ability to read directly from a url (Jsoup.connect(url)), testing a string of HTML against a Whitelist to check for validity, CSS selectors and more.

If you need to manipulate HTML in your Java code, you need jsoup!

One thought on “Manipulating HTML with Java and jsoup

  1. Diva says:

    you can find some more details in the below link,”http://javadomain.in/parse-div-using-jsoup-in-java/”

  2. Daruka Roshan RajKumar says:

    how to replace a p tag’s text like..
    hello world
    to
    hi sam

Leave a Reply

Your email address will not be published. Required fields are marked *

*

*