HTML Navigation – Java Examples

Data Extraction, also well known as web data extraction, is used for extracting data from websites. A Data Extraction software will help you to automate the process of extracting data based on your requirements. Using Aspose.HTML class library, you can easily create your own application, since our API provides a powerful toolset to analyze and collect information from HTML documents.

HTML navigation

There are many ways that can be used to make HTML navigation. The following shortlist shows the simplest way to access to all DOM elements:

PropertyDescription
FirstChildAccessing this attribute of an element must return a reference to the first child node.
LastChildAccessing this attribute of an element must return a reference to the last child node
NextSiblingAccessing this attribute of an element must return a reference to the sibling node of that element which most immediately follows that element.
PreviousSiblingAccessing this attribute of an element must return a reference to the sibling node of that element which most immediately precedes that element.
ChildNodesReturns a list that contains all children of that element.

Using the mentioned properties, you can navigate through an HTML document as it follows:

 1// Prepare an HTML code
 2String html_code = "<span>Hello</span> <span>World!</span>";
 3
 4// Initialize a document from the prepared code
 5com.aspose.html.HTMLDocument document = new com.aspose.html.HTMLDocument(html_code, ".");
 6try {
 7    // Get the reference to the first child (first SPAN) of the BODY
 8    com.aspose.html.dom.Element element = document.getBody().getFirstElementChild();
 9    System.out.println(element.getTextContent()); // output: Hello
10
11    // Get the reference to the second SPAN element
12    element = element.getNextElementSibling();
13    System.out.println(element.getTextContent()); // output: World!
14} finally {
15    if (document != null) {
16        document.dispose();
17    }
18}

For the more complicated scenarios, when you need to find a node based on a specific pattern (e.g., get the list of headers, links, etc.), you can use a specialized TreeWalker or NodeIterator object with a custom Filter implementation.

The next example shows how to implement your own NodeFilter to skip all elements except images:

 1package com.aspose.html.examples.java;
 2
 3public class OnlyImageFilter extends com.aspose.html.dom.traversal.filters.NodeFilter {
 4
 5    @Override
 6    public short acceptNode(com.aspose.html.dom.Node n) {
 7        // The current filter skips all elements, except IMG elements.
 8        return "img".equals(n.getLocalName()) ? FILTER_ACCEPT : FILTER_SKIP;
 9    }
10
11}

Once you implement a filter, you can use HTML navigation as it follows:

 1// Prepare an HTML code
 2String code = "    < p > Hello </p >\n" +
 3              "    <img src = 'image1.png' >\n" +
 4              "    <img src = 'image2.png' >\n" +
 5              "    <p > World ! </p >\n";
 6
 7// Initialize a document based on the prepared code
 8com.aspose.html.HTMLDocument document = new com.aspose.html.HTMLDocument(code, ".");
 9try {
10    // To start HTML navigation we need to create an instance of TreeWalker.
11    // The specified parameters mean that it starts walking from the root of the document, iterating all nodes and use our custom implementation of the filter
12    com.aspose.html.dom.traversal.ITreeWalker iterator = document.createTreeWalker(document, com.aspose.html.dom.traversal.filters.NodeFilter.SHOW_ALL, new OnlyImageFilter());
13    // Use
14    while (iterator.nextNode() != null) {
15        // Since we are using our own filter, the current node will always be an instance of the HTMLImageElement.
16        // So, we don't need the additional validations here.
17        com.aspose.html.HTMLImageElement image = (com.aspose.html.HTMLImageElement) iterator.getCurrentNode();
18
19        System.out.println(image.getSrc());
20        // output: image1.png
21        // output: image2.png
22    }
23} finally {
24    if (document != null) {
25        document.dispose();
26    }
27}

XPath

The alternative to the HTML Navigation is XML Path Language. The syntax of the XPath expressions is quite simple and what is more important, it is easy to read and support.

The following example shows how to use XPath queries within Aspose.HTML API:

 1// Prepare an HTML code
 2String code = "< div class='happy' >\n" +
 3              "        <div >\n" +
 4              "            <span > Hello ! </span >\n" +
 5              "        </div >\n" +
 6              "    </div >\n" +
 7              "    <p class='happy' >\n" +
 8              "        <span > World </span >\n" +
 9              "    </p >\n";
10
11// Initialize a document based on the prepared code
12com.aspose.html.HTMLDocument document = new com.aspose.html.HTMLDocument(code, ".");
13try {
14    // Here we evaluate the XPath expression where we select all child SPAN elements from elements whose 'class' attribute equals to 'happy':
15    com.aspose.html.dom.xpath.IXPathResult result = document.evaluate("//*[@class='happy']//span",
16            document,
17            null,
18            com.aspose.html.dom.xpath.XPathResultType.Any,
19            null
20    );
21
22    // Iterate over the resulted nodes
23    for (com.aspose.html.dom.Node node; (node = result.iterateNext()) != null; ) {
24        System.out.println(node.getTextContent());
25        // output: Hello
26        // output: World!
27    }
28} finally {
29    if (document != null) {
30        document.dispose();
31    }
32}

CSS Selector

Along with HTML Navigation and XPath you can use CSS Selector API that is also supported by our library. This API is designed to create a search pattern to match elements in a document tree based on CSS Selectors syntax.

 1// Prepare an HTML code
 2String code = "< div class='happy' >\n" +
 3              "        <div >\n" +
 4              "            <span > Hello </span >\n" +
 5              "        </div >\n" +
 6              "    </div >\n" +
 7              "    <p class='happy' >\n" +
 8              "        <span > World ! </span >\n" +
 9              "    </p >\n";
10
11// Initialize a document based on the prepared code
12com.aspose.html.HTMLDocument document = new com.aspose.html.HTMLDocument(code, ".");
13try {
14    // Here we create a CSS Selector that extract all elements whose 'class' attribute equals to 'happy' and their child SPAN elements
15    com.aspose.html.collections.NodeList elements = document.querySelectorAll(".happy span");
16
17    // Iterate over the resulted list of elements
18    elements.forEach(element -> {
19        System.out.println(((com.aspose.html.HTMLElement) element).getInnerHTML());
20        // output: Hello
21        // output: World!
22    });
23} finally {
24    if (document != null) {
25        document.dispose();
26    }
27}
Subscribe to Aspose Product Updates

Get monthly newsletters & offers directly delivered to your mailbox.