Chris@0: # The Parser Model Chris@0: Chris@0: The parser model here follows the model in section Chris@0: [8.2.1](http://www.w3.org/TR/2012/CR-html5-20121217/syntax.html#parsing) Chris@0: of the HTML5 specification, though we do not assume a networking layer. Chris@0: Chris@0: [ InputStream ] // Generic support for reading input. Chris@0: || Chris@0: [ Scanner ] // Breaks down the stream into characters. Chris@0: || Chris@0: [ Tokenizer ] // Groups characters into syntactic Chris@0: || Chris@0: [ Tree Builder ] // Organizes units into a tree of objects Chris@0: || Chris@0: [ DOM Document ] // The final state of the parsed document. Chris@0: Chris@0: Chris@0: ## InputStream Chris@0: Chris@0: This is an interface with at least two concrete implementations: Chris@0: Chris@0: - StringInputStream: Reads an HTML5 string. Chris@0: - FileInputStream: Reads an HTML5 file. Chris@0: Chris@0: ## Scanner Chris@0: Chris@0: This is a mechanical piece of the parser. Chris@0: Chris@0: ## Tokenizer Chris@0: Chris@0: This follows section 8.4 of the HTML5 spec. It is (roughly) a recursive Chris@0: descent parser. (Though there are plenty of optimizations that are less Chris@0: than purely functional. Chris@0: Chris@0: ## EventHandler and DOMTree Chris@0: Chris@0: EventHandler is the interface for tree builders. Since not all Chris@0: implementations will necessarily build trees, we've chosen a more Chris@0: generic name. Chris@0: Chris@0: The event handler emits tokens during tokenization. Chris@0: Chris@0: The DOMTree is an event handler that builds a DOM tree. The output of Chris@0: the DOMTree builder is a DOMDocument. Chris@0: Chris@0: ## DOMDocument Chris@0: Chris@0: PHP has a DOMDocument class built-in (technically, it's part of libxml.) Chris@0: We use that, thus rendering the output of this process compatible with Chris@0: SimpleXML, QueryPath, and many other XML/HTML processing tools. Chris@0: Chris@0: For cases where the HTML5 is a fragment of a HTML5 document a Chris@0: DOMDocumentFragment is returned instead. This is another built-in class.