diff --git a/packages/html-tools/README.md b/packages/html-tools/README.md new file mode 100644 index 0000000000..2b663b9b8a --- /dev/null +++ b/packages/html-tools/README.md @@ -0,0 +1,137 @@ +# html-tools + +A lightweight standards-based HTML tokenizer and parser which outputs +to HTMLjs. Special hooks allow the syntax to be extended to parse an +HTML-like template language like Spacebars. Used by the Spacebars +compiler, which normally only runs at bundle time but can also be used +at runtime on the client or server. + +## HTML Dialect + +HTML has many dialects and potential degrees of permissiveness. We +use the WHATWG syntax spec and are pretty strict, failing on any +"parse error" cases, which basically means the input has to be +valid "HTML5" (except for the template tags). + +HTML syntax references: + +* [Human-readable syntax guide](http://developers.whatwg.org/syntax.html#syntax) +* [Tokenization state machine](http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html) + +The WHATWG parser without error recovery is strict compared to +browsers (which will recover from almost anything), but lenient +compared to the now-defunct XHTML spec (which required lowercase tag +names and lots more escaping of special characters). + +The following are examples of **errors**: + +* A stray or unclosed `<` character +* An unknown character reference like `&asdf;` +* Self-closing tags like `
` (except for BR, HR, INPUT, and other "void" elements) +* End tags for void elements (BR, HR, INPUT, etc.) +* Missing end tags, in most cases (e.g. missing ``) + +The following are **permitted**: + +* Bare `>` characters +* Bare `&` that can't be confused with a character reference +* Uppercase or lowercase tag and attribute names (case insensitive) +* Unquoted and valueless attributes - `` +* Most characters in attribute values - `