Chris@0
|
1 # The Parser Model
|
Chris@0
|
2
|
Chris@0
|
3 The parser model here follows the model in section
|
Chris@0
|
4 [8.2.1](http://www.w3.org/TR/2012/CR-html5-20121217/syntax.html#parsing)
|
Chris@0
|
5 of the HTML5 specification, though we do not assume a networking layer.
|
Chris@0
|
6
|
Chris@0
|
7 [ InputStream ] // Generic support for reading input.
|
Chris@0
|
8 ||
|
Chris@0
|
9 [ Scanner ] // Breaks down the stream into characters.
|
Chris@0
|
10 ||
|
Chris@0
|
11 [ Tokenizer ] // Groups characters into syntactic
|
Chris@0
|
12 ||
|
Chris@0
|
13 [ Tree Builder ] // Organizes units into a tree of objects
|
Chris@0
|
14 ||
|
Chris@0
|
15 [ DOM Document ] // The final state of the parsed document.
|
Chris@0
|
16
|
Chris@0
|
17
|
Chris@0
|
18 ## InputStream
|
Chris@0
|
19
|
Chris@0
|
20 This is an interface with at least two concrete implementations:
|
Chris@0
|
21
|
Chris@0
|
22 - StringInputStream: Reads an HTML5 string.
|
Chris@0
|
23 - FileInputStream: Reads an HTML5 file.
|
Chris@0
|
24
|
Chris@0
|
25 ## Scanner
|
Chris@0
|
26
|
Chris@0
|
27 This is a mechanical piece of the parser.
|
Chris@0
|
28
|
Chris@0
|
29 ## Tokenizer
|
Chris@0
|
30
|
Chris@0
|
31 This follows section 8.4 of the HTML5 spec. It is (roughly) a recursive
|
Chris@0
|
32 descent parser. (Though there are plenty of optimizations that are less
|
Chris@0
|
33 than purely functional.
|
Chris@0
|
34
|
Chris@0
|
35 ## EventHandler and DOMTree
|
Chris@0
|
36
|
Chris@0
|
37 EventHandler is the interface for tree builders. Since not all
|
Chris@0
|
38 implementations will necessarily build trees, we've chosen a more
|
Chris@0
|
39 generic name.
|
Chris@0
|
40
|
Chris@0
|
41 The event handler emits tokens during tokenization.
|
Chris@0
|
42
|
Chris@0
|
43 The DOMTree is an event handler that builds a DOM tree. The output of
|
Chris@0
|
44 the DOMTree builder is a DOMDocument.
|
Chris@0
|
45
|
Chris@0
|
46 ## DOMDocument
|
Chris@0
|
47
|
Chris@0
|
48 PHP has a DOMDocument class built-in (technically, it's part of libxml.)
|
Chris@0
|
49 We use that, thus rendering the output of this process compatible with
|
Chris@0
|
50 SimpleXML, QueryPath, and many other XML/HTML processing tools.
|
Chris@0
|
51
|
Chris@0
|
52 For cases where the HTML5 is a fragment of a HTML5 document a
|
Chris@0
|
53 DOMDocumentFragment is returned instead. This is another built-in class.
|