Yari — Fast, resilient Java parsers: HTML, CSS, JS, XML

Built for AI-driven code understanding

AI coding assistants and static-analysis tools are most useful when they can reason about source code at a structural level — not as raw text, but as a typed, located tree. Yari is designed to be that building block: parse any web source in one line, walk the result with a typed API, query nodes by type, and read exact locations. A clean, typed, fully-located AST gives an LLM a much stronger signal than raw text, improving analyses, refactoring suggestions, and overall code comprehension.

Each AST node is fully serializable to JSON out of the box via Jackson — persist a parsed tree, send it over the wire, or reload it without re-parsing. The unified tree model means one analysis pipeline covers HTML structure, embedded scripts, stylesheets, and standalone JS/CSS/XML files — all in the same shape.

A philosophy of resilience

Real-world web code is messy. HTML pages in the wild have unclosed tags, broken nesting, and mixed casing. JavaScript in the wild has syntax errors and unusual patterns. Yari was built around one constraint: never throw, never stop. When the parser encounters an unrecognisable combination, it emits an error node into the AST and resumes parsing — so you still get the maximum number of recognisable nodes from a broken or partially-malformed source.

This degraded-parse behaviour is not an afterthought bolted onto each parser. It is a first-class primitive of yari-parsec, the combinator engine underneath, which makes it composable across every language in the framework. Perfect for linting, static analysis, or any pipeline that must keep going on imperfect inputs.

Build your own parser

Yari is more than four language parsers. At its core is yari-parsec, a general parser-combinator engine: compose small, well-typed Parser<T> values to tokenize, build expression grammars with operator precedence, track source locations, and recover from errors. It is the exact machinery the JavaScript, HTML, CSS and XML parsers are themselves built on.

Use it to implement your own fast, resilient parser for any language or DSL — one linear pass, error nodes instead of exceptions, and an OperatorTable for precedence and associativity. The Quick start walks through a complete mini-language built entirely on the engine.

Why Yari

A small set of focused, fast and resilient parsers that compose into a unified AST without any external dependencies beyond the JVM.

One AST, all sub-languages

The HTML parser also parses the CSS and JavaScript embedded inside the HTML (<script>, <style>, onclick, style attributes…) and exposes everything as a single unified AST. Stream over the nodes of a full web page — HTML, CSS and JS — in one pass.

Degraded parsing — never throw

When the parser encounters an unrecognisable combination it emits an error node and resumes — you always get the maximum number of recognisable nodes. Ideal for linting, static analysis, or any pipeline that must handle imperfect inputs.

Exact source locations

Every AST node carries its precise location in the source: offsets, lines, and columns. Critical for diagnostics, refactoring tools, code-intel features, and any downstream consumer that needs to point back at the original text.

Comments are first-class

Comments are parsed in every supported language and kept with their own source locations. A dedicated API lets you look them up relative to any AST node (leading, trailing, inside…), so you never lose the connection between code and its documentation.

Simple one-liner API

Parsing a source is a single method call. The resulting AST is a plain typed tree you can walk, query, or stream over. Easy to plug into existing JVM tooling — no configuration, no setup.

JSON serialization

Every AST node is fully serializable and deserializable to JSON out of the box (via Jackson). Persist a parsed tree, send it over the wire, or reload it later without re-parsing the original source.

AI-friendly

A clean, typed, located AST gives an LLM a much stronger signal than raw source text. Yari is designed to be a building block for AI-driven code understanding — improving analyses, refactoring suggestions, and code comprehension.

Parser-combinator engine

The framework ships its own parser-combinator library (yari-parsec), which makes degraded-parse behaviour composable across every language. Error recovery is a first-class primitive of the combinators, not an afterthought.

Heavily tested

Each module ships with an extensive test suite covering as many edge cases as possible — and it keeps growing as new patterns surface. Contributions of untested real-world inputs are welcome.

The modules

Pick the modules you need. Each one is documented page-by-page with its full API.

yari-parsec

Why choose Yari

What the framework brings, compared to conventional approaches.

Aspect	Conventional approach	With Yari
Broken input	Exception thrown, parsing aborts	Error node inserted, parsing continues — maximum AST recovered
Multi-language page	Three separate parsers, three separate trees to juggle	One unified AST — HTML, embedded CSS and JS in one pass
Source locations	Often absent or limited to line numbers	Every node carries exact start/end offsets, lines and columns
Comments	Discarded during lexing	Preserved with location, queryable relative to any AST node
Serialization	Manual mapping or custom serializer	Full Jackson JSON serialization/deserialization out of the box
Error recovery	Ad-hoc per parser, hard to extend	First-class combinator primitive — composable across all parsers
AI compatibility	Raw text or opaque tokens	Typed, located AST — strong structural signal for LLMs and agents
Dependencies	Multiple external parser libraries, version conflicts	One framework, one shared engine — only Jackson and SLF4J

Fast, resilient parsers: JavaScript · HTML · CSS · XML
And
A powerful parser-combinator engine written in Java