yari-html-parser

HTML parser that builds on the XML parser and additionally processes embedded JavaScript and CSS found in <script>, <style>, onclick, style attributes, and more. Produces a single unified AST where each sub-language is parsed into its own typed tree. Never throws on malformed input.

Installation

// Gradle (Groovy DSL)
implementation 'com.easyparsingapi:yari-html-parser:VERSION'

// Maven
<dependency>
    <groupId>com.easyparsingapi</groupId>
    <artifactId>yari-html-parser</artifactId>
    <version>VERSION</version>
</dependency>

Pulls in yari-xml-parser, yari-css-parser, and yari-javascript-parser transitively.

HtmlParser — Entry Points

HtmlParser — not instantiable, all methods are static
SignatureReturnsDescription
parseUnit(String html, HtmlConfig htmlConfig) → AstResult<Html> The single public entry point. Parses the HTML string, then identifies JavaScript and CSS tags/attributes according to htmlConfig and re-parses their content with the dedicated parsers. Returns the Html AST paired with the consolidated full token list.
AstResult<Html> result = HtmlParser.parseUnit(source, HtmlConfig.defaulConfig());
Html html = result.unit();

HtmlConfig — Parser Configuration

The static factory defaulConfig() returns a pre-built instance that covers standard HTML out of the box — use it unless you need custom tag/attribute recognition.

defaulConfig() — default values

The default configuration is pre-built with the following settings:

HtmlConfig.defaulConfig() — pre-built instance
Builder callValue
javascriptTag(…) <script> with no type, or type="text/javascript" / type="module"
cssTag(String) "style"
cssAttribute(String) "style"
acceptUnclosedTag(boolean) true
tagAsPlainText(String) "script"

Builder methods

HtmlConfig.Builder — obtained via HtmlConfig.builder()
SignatureReturnsDescription
javascriptTag(String name)BuilderMark the tag with the given local name as a JavaScript block
javascriptTag(String namespace, String name)BuilderSame with an explicit namespace
javascriptTag(Function<TagEntity, Boolean>)BuilderCustom predicate — return true to mark a tag as JavaScript
javascriptAttribute(String name)BuilderMark the attribute with the given name as containing JavaScript (e.g. onclick)
javascriptAttribute(Function<Attribute, Boolean>)BuilderCustom predicate for JavaScript attributes
cssTag(String name)BuilderMark the tag with the given local name as a CSS block (e.g. style)
cssTag(String namespace, String name)BuilderSame with an explicit namespace
cssTag(Function<TagEntity, Boolean>)BuilderCustom predicate for CSS tags
cssAttribute(String name)BuilderMark the attribute with the given name as containing CSS (e.g. style)
cssAttribute(Function<Attribute, Boolean>)BuilderCustom predicate for CSS attributes
tagAsPlainText(String name)BuilderForce the content of the named tag to be tokenised as raw text, not XML markup
tagAsPlainText(Function<TagEntity, Boolean>)BuilderCustom predicate for plain-text tags
acceptUnclosedTag(boolean)BuilderDefault: true in defaulConfig(). When true, void tags without /> are accepted; when false, they produce an error node
build()HtmlConfigMaterialise the configuration

Predicate types

The Function<…, Boolean> overloads above receive lightweight records declared in com.easyparsingapi.yari.parser.xml.lexer.TagEntity. These are the values a custom predicate inspects to decide whether a tag or attribute holds JavaScript or CSS.

TagEntity — record (Markup markup, List<Attribute> attributes)

A parsed tag: its qualified name plus its attribute list. Passed to javascriptTag, cssTag and tagAsPlainText predicates.

AccessorReturnsDescription
markup()MarkupThe tag's qualified name (namespace + local name)
attributes()List<Attribute>The tag's attributes, in source order
TagEntity.Attribute — record (Markup markup, String value)

A single attribute. Passed to javascriptAttribute and cssAttribute predicates.

AccessorReturnsDescription
markup()MarkupThe attribute's qualified name
value()StringThe unquoted attribute value, or null if the attribute has no value
TagEntity.Markup — record (String namespace, String name)

A namespace-qualified name shared by tags and attributes. toString() renders it as namespace:name (or just name when there is no namespace).

AccessorReturnsDescription
namespace()StringThe namespace prefix, or null if none
name()StringThe local name
// Treat every <x-script> tag, in any namespace, as JavaScript
HtmlConfig config = HtmlConfig.builder()
    .javascriptTag(tag -> "x-script".equals(tag.markup().name()))
    .cssAttribute(attr -> attr.markup().name().startsWith("css-"))
    .build();

Html — Root AST Node

Root of a parsed HTML document. Extends Xml (which implements AstUnit), so its child nodes are XmlNode instances — the typed HTML nodes (ScriptTag, StyleTag) all implement XmlNode.

Html — extends Xml
getNodes() → List<XmlNode> Top-level nodes (inherited from Xml)
walkChildren(Consumer<Handler>) Walk direct children with a typed Handler
astComments() → List<AstComment>
astCommentsOf(AstNode, Position…) → List<AstComment>

Unified Tree — HTML Nodes

The HTML parser reuses the XML AST node types and adds JavaScript/CSS-aware nodes. Every node in the tree is an XmlNode; the types below are the ones you will branch on. The parser always produces a complete tree — even from malformed input.

Node type Description
Tag A regular HTML element with a head, body and foot. getHead() → TagHead, getBody() → TagBody
ScriptTag A <script> element (extends TagAbstract<Script>). getBody() → Script
StyleTag A <style> element (extends TagAbstract<Style>). getBody() → Style
ScriptAttributeValue The value of a JavaScript attribute (e.g. onclick). Extends TagAttribute.Value. getNodes() → List<JavascriptNode>
StyleAttributeValue The value of a CSS attribute (e.g. style). Extends TagAttribute.Value. getNodes() → List<CssNode>
XmlComment An HTML/XML comment. getComment() → String
XmlError Error node for malformed input. getFailureMessage() → String

Embedded CSS & JS

ScriptTag and StyleTag carry the parsed sub-language AST directly: Script extends Javascript and Style extends Css. So getBody() hands you a node that is the JavaScript or CSS tree — no extra parsing step, no unwrapping.

Html html = HtmlParser.parseUnit(source, HtmlConfig.defaulConfig()).unit();

html.walkChildren(handler -> {
    switch (handler.node()) {
        case Tag tag ->
            System.out.println("tag: " + tag.getHead().getName().getValue());
        case ScriptTag scriptTag -> {
            Script script = scriptTag.getBody();  // Script extends Javascript
            System.out.println("JS statements: " + script.getNodes().size());
        }
        case StyleTag styleTag -> {
            Style style = styleTag.getBody();     // Style extends Css
            System.out.println("CSS rules: " + style.getNodes().size());
        }
        default -> {}
    }
});
Script — extends Javascript, implements XmlNode
getNodes() → List<JavascriptNode> The script's JavaScript statements (inherited from Javascript)
astComments() → List<AstComment> Comments inside the script
Style — extends Css, implements XmlNode
getNodes() → List<CssNode> The style's CSS rules (inherited from Css)
astComments() → List<AstComment> Comments inside the style block

Inline event handlers (onclick, onchange…) and style attributes are also parsed. The matching TagAttribute's value becomes a ScriptAttributeValue or StyleAttributeValue, each exposing getNodes() for the embedded JavaScript or CSS.

Comments & Source Locations

Use parseUnit to get back an AstResult<Html> which bundles the root node together with the full token list. The token list enables accurate source substring extraction for any node in the tree.

AstResult<Html> result = HtmlParser.parseUnit(source, HtmlConfig.defaulConfig());
Html html = result.unit();

// All HTML comments
List<AstComment> htmlComments = html.astComments();

// Comments before a specific node
html.getNodes().forEach(node ->
    html.astCommentsOf(node, AstUnit.Position.before)
        .forEach(c -> System.out.println(((XmlComment) c).getComment())));

// Source substring for any node
System.out.println(result.substring(someNode.getSourceLocation()));

Error Recovery — Malformed Pages

The parser is fault-tolerant: it never throws on malformed HTML. Real-world pages are rarely well-formed, so the parser is built to keep going — it always returns a complete, usable tree and marks the broken spots instead of aborting. It recovers from, among others:

  • Unclosed tags — an element that is opened but never closed (e.g. <li>First<li>Second, or a bare <br> / <img>).
  • Closing-only tags — a closing tag with no matching opening tag (e.g. a stray </div>).
  • Broken nesting — overlapping or mis-ordered tags.

With acceptUnclosedTag(true) (the default in defaulConfig()) unclosed tags are accepted silently; any genuinely unrecoverable fragment becomes an XmlError node carrying its source location and a failure message — never an exception.

// A deliberately malformed page:
//  • <li> items are never closed   (unclosed tags)
//  • a stray </section> has no opening tag   (closing-only tag)
String malformed = """
    <ul>
      <li>First
      <li>Second
    </ul>
    </section>
    <p>Trailing text
    """;

// defaulConfig() already sets acceptUnclosedTag(true)
AstResult<Html> result = HtmlParser.parseUnit(malformed, HtmlConfig.defaulConfig());
Html html = result.unit();

// The whole document still parsed into a usable tree — no exception thrown
System.out.println(html.getNodes().size() + " top-level nodes");

// Inspect the recovered error markers
html.astStream()
    .filter(n -> n instanceof XmlError)
    .forEach(n -> System.out.println(
        "error at " + n.getSourceLocation()));

To reject malformed input instead, build a config with acceptUnclosedTag(false): unclosed tags then surface as XmlError nodes too, so you can validate strictly while still never having to catch an exception.

For the full code-level reference, see the README on GitHub and Javadoc under yari-html-parser/src/main/java/.