comparison vendor/masterminds/html5/src/HTML5/Parser/README.md @ 0:4c8ae668cc8c

Initial import (non-working)
author Chris Cannam
date Wed, 29 Nov 2017 16:09:58 +0000
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:4c8ae668cc8c
1 # The Parser Model
2
3 The parser model here follows the model in section
4 [8.2.1](http://www.w3.org/TR/2012/CR-html5-20121217/syntax.html#parsing)
5 of the HTML5 specification, though we do not assume a networking layer.
6
7 [ InputStream ] // Generic support for reading input.
8 ||
9 [ Scanner ] // Breaks down the stream into characters.
10 ||
11 [ Tokenizer ] // Groups characters into syntactic
12 ||
13 [ Tree Builder ] // Organizes units into a tree of objects
14 ||
15 [ DOM Document ] // The final state of the parsed document.
16
17
18 ## InputStream
19
20 This is an interface with at least two concrete implementations:
21
22 - StringInputStream: Reads an HTML5 string.
23 - FileInputStream: Reads an HTML5 file.
24
25 ## Scanner
26
27 This is a mechanical piece of the parser.
28
29 ## Tokenizer
30
31 This follows section 8.4 of the HTML5 spec. It is (roughly) a recursive
32 descent parser. (Though there are plenty of optimizations that are less
33 than purely functional.
34
35 ## EventHandler and DOMTree
36
37 EventHandler is the interface for tree builders. Since not all
38 implementations will necessarily build trees, we've chosen a more
39 generic name.
40
41 The event handler emits tokens during tokenization.
42
43 The DOMTree is an event handler that builds a DOM tree. The output of
44 the DOMTree builder is a DOMDocument.
45
46 ## DOMDocument
47
48 PHP has a DOMDocument class built-in (technically, it's part of libxml.)
49 We use that, thus rendering the output of this process compatible with
50 SimpleXML, QueryPath, and many other XML/HTML processing tools.
51
52 For cases where the HTML5 is a fragment of a HTML5 document a
53 DOMDocumentFragment is returned instead. This is another built-in class.