yari-html-parser

HTML parser that builds on the XML parser and additionally processes embedded JavaScript and CSS found in <script>, <style>, onclick, style attributes, and more. Produces a single unified AST where each sub-language is parsed into its own typed tree. Never throws on malformed input.

Installation

// Gradle (Groovy DSL)
implementation 'com.easyparsingapi:yari-html-parser:VERSION'

// Maven
<dependency>
    <groupId>com.easyparsingapi</groupId>
    <artifactId>yari-html-parser</artifactId>
    <version>VERSION</version>
</dependency>

Pulls in yari-xml-parser, yari-css-parser, and yari-javascript-parser transitively.

HtmlParser — Entry Points

Signature	Returns	Description
`parseUnit(String html, HtmlConfig htmlConfig)`	→ AstResult<Html>	The single public entry point. Parses the HTML string, then identifies JavaScript and CSS tags/attributes according to `htmlConfig` and re-parses their content with the dedicated parsers. Returns the `Html` AST paired with the consolidated full token list.

AstResult<Html> result = HtmlParser.parseUnit(source, HtmlConfig.defaulConfig());
Html html = result.unit();

HtmlConfig — Parser Configuration

The static factory defaulConfig() returns a pre-built instance that covers standard HTML out of the box — use it unless you need custom tag/attribute recognition.

defaulConfig() — default values

The default configuration is pre-built with the following settings:

Builder call	Value
`javascriptTag(…)`	`<script>` with no `type`, or `type="text/javascript"` / `type="module"`
`cssTag(String)`	`"style"`
`cssAttribute(String)`	`"style"`
`acceptUnclosedTag(boolean)`	`true`
`tagAsPlainText(String)`	`"script"`

Builder methods

Signature	Returns	Description
`javascriptTag(String name)`	Builder	Mark the tag with the given local name as a JavaScript block
`javascriptTag(String namespace, String name)`	Builder	Same with an explicit namespace
`javascriptTag(Function<TagEntity, Boolean>)`	Builder	Custom predicate — return `true` to mark a tag as JavaScript
`javascriptAttribute(String name)`	Builder	Mark the attribute with the given name as containing JavaScript (e.g. `onclick`)
`javascriptAttribute(Function<Attribute, Boolean>)`	Builder	Custom predicate for JavaScript attributes
`cssTag(String name)`	Builder	Mark the tag with the given local name as a CSS block (e.g. `style`)
`cssTag(String namespace, String name)`	Builder	Same with an explicit namespace
`cssTag(Function<TagEntity, Boolean>)`	Builder	Custom predicate for CSS tags
`cssAttribute(String name)`	Builder	Mark the attribute with the given name as containing CSS (e.g. `style`)
`cssAttribute(Function<Attribute, Boolean>)`	Builder	Custom predicate for CSS attributes
`tagAsPlainText(String name)`	Builder	Force the content of the named tag to be tokenised as raw text, not XML markup
`tagAsPlainText(Function<TagEntity, Boolean>)`	Builder	Custom predicate for plain-text tags
`acceptUnclosedTag(boolean)`	Builder	Default: `true` in `defaulConfig()`. When `true`, void tags without `/>` are accepted; when `false`, they produce an error node
`build()`	HtmlConfig	Materialise the configuration

Predicate types

The Function<…, Boolean> overloads above receive lightweight records declared in com.easyparsingapi.yari.parser.xml.lexer.TagEntity. These are the values a custom predicate inspects to decide whether a tag or attribute holds JavaScript or CSS.

Accessor	Returns	Description
`markup()`	Markup	The tag's qualified name (namespace + local name)
`attributes()`	List<Attribute>	The tag's attributes, in source order

Accessor	Returns	Description
`markup()`	Markup	The attribute's qualified name
`value()`	String	The unquoted attribute value, or `null` if the attribute has no value

Accessor	Returns	Description
`namespace()`	String	The namespace prefix, or `null` if none
`name()`	String	The local name

// Treat every <x-script> tag, in any namespace, as JavaScript
HtmlConfig config = HtmlConfig.builder()
    .javascriptTag(tag -> "x-script".equals(tag.markup().name()))
    .cssAttribute(attr -> attr.markup().name().startsWith("css-"))
    .build();

Html — Root AST Node

Root of a parsed HTML document. Extends Xml (which implements AstUnit), so its child nodes are XmlNode instances — the typed HTML nodes (ScriptTag, StyleTag) all implement XmlNode.

getNodes()	→ List<XmlNode>	Top-level nodes (inherited from `Xml`)
walkChildren(Consumer<Handler>)		Walk direct children with a typed Handler
astComments()	→ List<AstComment>
astCommentsOf(AstNode, Position…)	→ List<AstComment>

Unified Tree — HTML Nodes

The HTML parser reuses the XML AST node types and adds JavaScript/CSS-aware nodes. Every node in the tree is an XmlNode; the types below are the ones you will branch on. The parser always produces a complete tree — even from malformed input.

Node type	Description
`Tag`	A regular HTML element with a head, body and foot. `getHead() → TagHead`, `getBody() → TagBody`
`ScriptTag`	A `<script>` element (extends `TagAbstract<Script>`). `getBody() → Script`
`StyleTag`	A `<style>` element (extends `TagAbstract<Style>`). `getBody() → Style`
`ScriptAttributeValue`	The value of a JavaScript attribute (e.g. `onclick`). Extends `TagAttribute.Value`. `getNodes() → List<JavascriptNode>`
`StyleAttributeValue`	The value of a CSS attribute (e.g. `style`). Extends `TagAttribute.Value`. `getNodes() → List<CssNode>`
`XmlComment`	An HTML/XML comment. `getComment() → String`
`XmlError`	Error node for malformed input. `getFailureMessage() → String`

Embedded CSS & JS

ScriptTag and StyleTag carry the parsed sub-language AST directly: Script extends Javascript and Style extends Css. So getBody() hands you a node that is the JavaScript or CSS tree — no extra parsing step, no unwrapping.

Html html = HtmlParser.parseUnit(source, HtmlConfig.defaulConfig()).unit();

html.walkChildren(handler -> {
    switch (handler.node()) {
        case Tag tag ->
            System.out.println("tag: " + tag.getHead().getName().getValue());
        case ScriptTag scriptTag -> {
            Script script = scriptTag.getBody();  // Script extends Javascript
            System.out.println("JS statements: " + script.getNodes().size());
        }
        case StyleTag styleTag -> {
            Style style = styleTag.getBody();     // Style extends Css
            System.out.println("CSS rules: " + style.getNodes().size());
        }
        default -> {}
    }
});

getNodes()	→ List<JavascriptNode>	The script's JavaScript statements (inherited from `Javascript`)
astComments()	→ List<AstComment>	Comments inside the script

getNodes()	→ List<CssNode>	The style's CSS rules (inherited from `Css`)
astComments()	→ List<AstComment>	Comments inside the style block

Inline event handlers (onclick, onchange…) and style attributes are also parsed. The matching TagAttribute's value becomes a ScriptAttributeValue or StyleAttributeValue, each exposing getNodes() for the embedded JavaScript or CSS.

Comments & Source Locations

Use parseUnit to get back an AstResult<Html> which bundles the root node together with the full token list. The token list enables accurate source substring extraction for any node in the tree.

AstResult<Html> result = HtmlParser.parseUnit(source, HtmlConfig.defaulConfig());
Html html = result.unit();

// All HTML comments
List<AstComment> htmlComments = html.astComments();

// Comments before a specific node
html.getNodes().forEach(node ->
    html.astCommentsOf(node, AstUnit.Position.before)
        .forEach(c -> System.out.println(((XmlComment) c).getComment())));

// Source substring for any node
System.out.println(result.substring(someNode.getSourceLocation()));

Error Recovery — Malformed Pages

The parser is fault-tolerant: it never throws on malformed HTML. Real-world pages are rarely well-formed, so the parser is built to keep going — it always returns a complete, usable tree and marks the broken spots instead of aborting. It recovers from, among others:

Unclosed tags — an element that is opened but never closed (e.g. <li>First<li>Second, or a bare <br> / <img>).
Closing-only tags — a closing tag with no matching opening tag (e.g. a stray </div>).
Broken nesting — overlapping or mis-ordered tags.

With acceptUnclosedTag(true) (the default in defaulConfig()) unclosed tags are accepted silently; any genuinely unrecoverable fragment becomes an XmlError node carrying its source location and a failure message — never an exception.

// A deliberately malformed page:
//  • <li> items are never closed   (unclosed tags)
//  • a stray </section> has no opening tag   (closing-only tag)
String malformed = """
    <ul>
      <li>First
      <li>Second
    </ul>
    </section>
    <p>Trailing text
    """;

// defaulConfig() already sets acceptUnclosedTag(true)
AstResult<Html> result = HtmlParser.parseUnit(malformed, HtmlConfig.defaulConfig());
Html html = result.unit();

// The whole document still parsed into a usable tree — no exception thrown
System.out.println(html.getNodes().size() + " top-level nodes");

// Inspect the recovered error markers
html.astStream()
    .filter(n -> n instanceof XmlError)
    .forEach(n -> System.out.println(
        "error at " + n.getSourceLocation()));

To reject malformed input instead, build a config with acceptUnclosedTag(false): unclosed tags then surface as XmlError nodes too, so you can validate strictly while still never having to catch an exception.

For the full code-level reference, see the README on GitHub and Javadoc under yari-html-parser/src/main/java/.