yari-html-parser
HTML parser that builds on the XML parser and additionally processes embedded JavaScript and CSS found in <script>, <style>, onclick, style attributes, and more. Produces a single unified AST where each sub-language is parsed into its own typed tree. Never throws on malformed input.
Installation
// Gradle (Groovy DSL)
implementation 'com.easyparsingapi:yari-html-parser:VERSION'
// Maven
<dependency>
<groupId>com.easyparsingapi</groupId>
<artifactId>yari-html-parser</artifactId>
<version>VERSION</version>
</dependency>
Pulls in yari-xml-parser, yari-css-parser, and yari-javascript-parser transitively.
HtmlParser — Entry Points
| Signature | Returns | Description |
|---|---|---|
parseUnit(String html, HtmlConfig htmlConfig) |
→ AstResult<Html> |
The single public entry point. Parses the HTML string, then identifies
JavaScript and CSS tags/attributes according to htmlConfig and
re-parses their content with the dedicated parsers. Returns the
Html AST paired with the consolidated full token list.
|
AstResult<Html> result = HtmlParser.parseUnit(source, HtmlConfig.defaulConfig());
Html html = result.unit();
HtmlConfig — Parser Configuration
The static factory defaulConfig() returns a pre-built instance that covers
standard HTML out of the box — use it unless you need custom tag/attribute recognition.
defaulConfig() — default values
The default configuration is pre-built with the following settings:
| Builder call | Value |
|---|---|
javascriptTag(…) |
<script> with no type, or type="text/javascript" / type="module" |
cssTag(String) |
"style" |
cssAttribute(String) |
"style" |
acceptUnclosedTag(boolean) |
true |
tagAsPlainText(String) |
"script" |
Builder methods
HtmlConfig.builder()
| Signature | Returns | Description |
|---|---|---|
javascriptTag(String name) | Builder | Mark the tag with the given local name as a JavaScript block |
javascriptTag(String namespace, String name) | Builder | Same with an explicit namespace |
javascriptTag(Function<TagEntity, Boolean>) | Builder | Custom predicate — return true to mark a tag as JavaScript |
javascriptAttribute(String name) | Builder | Mark the attribute with the given name as containing JavaScript (e.g. onclick) |
javascriptAttribute(Function<Attribute, Boolean>) | Builder | Custom predicate for JavaScript attributes |
cssTag(String name) | Builder | Mark the tag with the given local name as a CSS block (e.g. style) |
cssTag(String namespace, String name) | Builder | Same with an explicit namespace |
cssTag(Function<TagEntity, Boolean>) | Builder | Custom predicate for CSS tags |
cssAttribute(String name) | Builder | Mark the attribute with the given name as containing CSS (e.g. style) |
cssAttribute(Function<Attribute, Boolean>) | Builder | Custom predicate for CSS attributes |
tagAsPlainText(String name) | Builder | Force the content of the named tag to be tokenised as raw text, not XML markup |
tagAsPlainText(Function<TagEntity, Boolean>) | Builder | Custom predicate for plain-text tags |
acceptUnclosedTag(boolean) | Builder | Default: true in defaulConfig(). When true, void tags without /> are accepted; when false, they produce an error node |
build() | HtmlConfig | Materialise the configuration |
Predicate types
The Function<…, Boolean> overloads above receive lightweight
records declared in com.easyparsingapi.yari.parser.xml.lexer.TagEntity.
These are the values a custom predicate inspects to decide whether a tag or
attribute holds JavaScript or CSS.
(Markup markup, List<Attribute> attributes)
A parsed tag: its qualified name plus its attribute list. Passed to javascriptTag, cssTag and tagAsPlainText predicates.
| Accessor | Returns | Description |
|---|---|---|
markup() | Markup | The tag's qualified name (namespace + local name) |
attributes() | List<Attribute> | The tag's attributes, in source order |
(Markup markup, String value)
A single attribute. Passed to javascriptAttribute and cssAttribute predicates.
| Accessor | Returns | Description |
|---|---|---|
markup() | Markup | The attribute's qualified name |
value() | String | The unquoted attribute value, or null if the attribute has no value |
(String namespace, String name)
A namespace-qualified name shared by tags and attributes. toString() renders it as namespace:name (or just name when there is no namespace).
| Accessor | Returns | Description |
|---|---|---|
namespace() | String | The namespace prefix, or null if none |
name() | String | The local name |
// Treat every <x-script> tag, in any namespace, as JavaScript
HtmlConfig config = HtmlConfig.builder()
.javascriptTag(tag -> "x-script".equals(tag.markup().name()))
.cssAttribute(attr -> attr.markup().name().startsWith("css-"))
.build();
Html — Root AST Node
Root of a parsed HTML document. Extends Xml (which implements AstUnit), so its child nodes are XmlNode instances — the typed HTML nodes (ScriptTag, StyleTag) all implement XmlNode.
| getNodes() | → List<XmlNode> | Top-level nodes (inherited from Xml) |
| walkChildren(Consumer<Handler>) | Walk direct children with a typed Handler | |
| astComments() | → List<AstComment> | |
| astCommentsOf(AstNode, Position…) | → List<AstComment> |
Unified Tree — HTML Nodes
The HTML parser reuses the XML AST node types and adds JavaScript/CSS-aware nodes. Every node in the tree is an XmlNode; the types below are the ones you will branch on. The parser always produces a complete tree — even from malformed input.
| Node type | Description |
|---|---|
Tag |
A regular HTML element with a head, body and foot. getHead() → TagHead, getBody() → TagBody |
ScriptTag |
A <script> element (extends TagAbstract<Script>). getBody() → Script |
StyleTag |
A <style> element (extends TagAbstract<Style>). getBody() → Style |
ScriptAttributeValue |
The value of a JavaScript attribute (e.g. onclick). Extends TagAttribute.Value. getNodes() → List<JavascriptNode> |
StyleAttributeValue |
The value of a CSS attribute (e.g. style). Extends TagAttribute.Value. getNodes() → List<CssNode> |
XmlComment |
An HTML/XML comment. getComment() → String |
XmlError |
Error node for malformed input. getFailureMessage() → String |
Embedded CSS & JS
ScriptTag and StyleTag carry the parsed sub-language AST directly: Script extends Javascript and Style extends Css. So getBody() hands you a node that is the JavaScript or CSS tree — no extra parsing step, no unwrapping.
Html html = HtmlParser.parseUnit(source, HtmlConfig.defaulConfig()).unit();
html.walkChildren(handler -> {
switch (handler.node()) {
case Tag tag ->
System.out.println("tag: " + tag.getHead().getName().getValue());
case ScriptTag scriptTag -> {
Script script = scriptTag.getBody(); // Script extends Javascript
System.out.println("JS statements: " + script.getNodes().size());
}
case StyleTag styleTag -> {
Style style = styleTag.getBody(); // Style extends Css
System.out.println("CSS rules: " + style.getNodes().size());
}
default -> {}
}
});
| getNodes() | → List<JavascriptNode> | The script's JavaScript statements (inherited from Javascript) |
| astComments() | → List<AstComment> | Comments inside the script |
| getNodes() | → List<CssNode> | The style's CSS rules (inherited from Css) |
| astComments() | → List<AstComment> | Comments inside the style block |
Inline event handlers (onclick, onchange…) and style attributes are also parsed. The matching TagAttribute's value becomes a ScriptAttributeValue or StyleAttributeValue, each exposing getNodes() for the embedded JavaScript or CSS.
Comments & Source Locations
Use parseUnit to get back an AstResult<Html> which bundles the root node together with the full token list. The token list enables accurate source substring extraction for any node in the tree.
AstResult<Html> result = HtmlParser.parseUnit(source, HtmlConfig.defaulConfig());
Html html = result.unit();
// All HTML comments
List<AstComment> htmlComments = html.astComments();
// Comments before a specific node
html.getNodes().forEach(node ->
html.astCommentsOf(node, AstUnit.Position.before)
.forEach(c -> System.out.println(((XmlComment) c).getComment())));
// Source substring for any node
System.out.println(result.substring(someNode.getSourceLocation()));
Error Recovery — Malformed Pages
The parser is fault-tolerant: it never throws on malformed HTML. Real-world pages are rarely well-formed, so the parser is built to keep going — it always returns a complete, usable tree and marks the broken spots instead of aborting. It recovers from, among others:
- Unclosed tags — an element that is opened but never closed (e.g.
<li>First<li>Second, or a bare<br>/<img>). - Closing-only tags — a closing tag with no matching opening tag (e.g. a stray
</div>). - Broken nesting — overlapping or mis-ordered tags.
With acceptUnclosedTag(true) (the default in defaulConfig())
unclosed tags are accepted silently; any genuinely unrecoverable fragment becomes an
XmlError node carrying its source location and a failure message — never an exception.
// A deliberately malformed page:
// • <li> items are never closed (unclosed tags)
// • a stray </section> has no opening tag (closing-only tag)
String malformed = """
<ul>
<li>First
<li>Second
</ul>
</section>
<p>Trailing text
""";
// defaulConfig() already sets acceptUnclosedTag(true)
AstResult<Html> result = HtmlParser.parseUnit(malformed, HtmlConfig.defaulConfig());
Html html = result.unit();
// The whole document still parsed into a usable tree — no exception thrown
System.out.println(html.getNodes().size() + " top-level nodes");
// Inspect the recovered error markers
html.astStream()
.filter(n -> n instanceof XmlError)
.forEach(n -> System.out.println(
"error at " + n.getSourceLocation()));
To reject malformed input instead, build a config with acceptUnclosedTag(false): unclosed tags then surface as XmlError nodes too, so you can validate strictly while still never having to catch an exception.
For the full code-level reference, see the README on GitHub and Javadoc under yari-html-parser/src/main/java/.