diff --git a/packages/html-tools/README.md b/packages/html-tools/README.md new file mode 100644 index 0000000000..2b663b9b8a --- /dev/null +++ b/packages/html-tools/README.md @@ -0,0 +1,137 @@ +# html-tools + +A lightweight standards-based HTML tokenizer and parser which outputs +to HTMLjs. Special hooks allow the syntax to be extended to parse an +HTML-like template language like Spacebars. Used by the Spacebars +compiler, which normally only runs at bundle time but can also be used +at runtime on the client or server. + +## HTML Dialect + +HTML has many dialects and potential degrees of permissiveness. We +use the WHATWG syntax spec and are pretty strict, failing on any +"parse error" cases, which basically means the input has to be +valid "HTML5" (except for the template tags). + +HTML syntax references: + +* [Human-readable syntax guide](http://developers.whatwg.org/syntax.html#syntax) +* [Tokenization state machine](http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html) + +The WHATWG parser without error recovery is strict compared to +browsers (which will recover from almost anything), but lenient +compared to the now-defunct XHTML spec (which required lowercase tag +names and lots more escaping of special characters). + +The following are examples of **errors**: + +* A stray or unclosed `<` character +* An unknown character reference like `&asdf;` +* Self-closing tags like `
` (except for BR, HR, INPUT, and other "void" elements) +* End tags for void elements (BR, HR, INPUT, etc.) +* Missing end tags, in most cases (e.g. missing `
`) + +The following are **permitted**: + +* Bare `>` characters +* Bare `&` that can't be confused with a character reference +* Uppercase or lowercase tag and attribute names (case insensitive) +* Unquoted and valueless attributes - `` +* Most characters in attribute values - `x,y` +* Embedded SVG elements + +**XXX Currently you have to close your Ps, LIs, and other tags for which the spec allows the end tag to be omitted in many cases** + +## Invoking the Parser + +`HTML.parseFragment(input, options)` - Takes an input string or Scanner object and returns HTMLjs. + +In the basic case, where no options are passed, `parseFragment` will consume the entire input (the full string or the rest of the Scanner). + +The options are as follows: + +### getSpecialTag + +`getSpecialTag: function (scanner, templateTagPosition) { ... }` - A function for the parser to invoke to possibly parse a template tag, like `{{foo}}`, say. If the function returns a non-null value, that value is wrapped in an `HTML.Special` node which is inserted into the HTMLjs tree at the appropriate location. The function is expected to advance the scanner if it succeeds at parsing a template tag. + +The `getSpecialTag` function may invoke a nested `HTML.parseFragment`. In this case, the same `getSpecialTag` function must be passed to the nested invocation of `parseFragment`. + +At the moment, template tags must begin with `{`. The parser does not try calling `getSpecialTag` for every character of an HTML document, only at token boundaries, and it knows to always end a token at `{`. + +It's expected that there are four possible outcomes when `getSpecialTag` is called: + +* Not a template tag - Leave the scanner as is, and return `null`. A quick peek at the next character should bail to this case if the start of a template tag is not seen. +* Bad template tag - Call `scanner.fatal`, which aborts parsing completely. Once the beginning of a template tag is seen, `getSpecialTag` will generally want to commit, and either succeed or fail trying). +* Good template tag - Advance the scanner to the end of the template tag and return an object. +* Comment tag - Advance the scanner and return `null`. For example, a Spacebars comment is `{{! foo}}`. + +The `templateTagPosition` argument to `getSpecialTag` is one of: + +* `HTML.TEMPLATE_TAG_POSITION.ELEMENT` - At "element level," meaning somewhere an HTML tag could be. +* `HTML.TEMPLATE_TAG_POSITION.IN_START_TAG` - Inside a start tag, as in `
`, where you might otherwise find `name=value`. +* `HTML.TEMPLATE_TAG_POSITION.IN_ATTRIBUTE` - Inside the value of an HTML attribute, as in `
`. +* `HTML.TEMPLATE_TAG_POSITION.IN_RCDATA` - Inside a TEXTAREA or a block helper inside an attribute, where character references are allowed ("replaced character data") but not tags. +* `HTML.TEMPLATE_TAG_POSITION.IN_RAWTEXT` - In a context where character references are not parsed, such as a script tag, style tag, or markdown helper. + +**XXX Better error message for `
`.** + +**XXX Do something with ``** + +**XXX Why both IN_ATTRIBUTE and IN_RCDATA?** + +**XXX Fix Markdown** + +### textMode + +The `textMode` option, if present, causes the parser to parse text (such as the contents of a `