Chris@0
|
1 # HTML5-PHP
|
Chris@0
|
2
|
Chris@0
|
3 HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP.
|
Chris@0
|
4 It is stable and used in many production websites, and has
|
Chris@0
|
5 well over [one million downloads](https://packagist.org/packages/masterminds/html5).
|
Chris@0
|
6
|
Chris@0
|
7 HTML5 provides the following features.
|
Chris@0
|
8
|
Chris@0
|
9 - An HTML5 serializer
|
Chris@0
|
10 - Support for PHP namespaces
|
Chris@0
|
11 - Composer support
|
Chris@0
|
12 - Event-based (SAX-like) parser
|
Chris@0
|
13 - A DOM tree builder
|
Chris@0
|
14 - Interoperability with [QueryPath](https://github.com/technosophos/querypath)
|
Chris@0
|
15 - Runs on **PHP** 5.3.0 or newer and **HHVM** 3.2 or newer
|
Chris@0
|
16
|
Chris@0
|
17 [](https://travis-ci.org/Masterminds/html5-php)
|
Chris@0
|
18 [](https://packagist.org/packages/masterminds/html5)
|
Chris@0
|
19 [](https://scrutinizer-ci.com/g/Masterminds/html5-php/?branch=master)
|
Chris@0
|
20 [](https://scrutinizer-ci.com/g/Masterminds/html5-php/?branch=master)
|
Chris@0
|
21 [](https://masterminds.github.io/stability/sustained.html)
|
Chris@0
|
22
|
Chris@0
|
23 ## Installation
|
Chris@0
|
24
|
Chris@0
|
25 Install HTML5-PHP using [composer](http://getcomposer.org/).
|
Chris@0
|
26
|
Chris@0
|
27 To install, add `masterminds/html5` to your `composer.json` file:
|
Chris@0
|
28
|
Chris@0
|
29 ```json
|
Chris@0
|
30 {
|
Chris@0
|
31 "require" : {
|
Chris@0
|
32 "masterminds/html5": "2.*"
|
Chris@0
|
33 },
|
Chris@0
|
34 }
|
Chris@0
|
35 ```
|
Chris@0
|
36
|
Chris@0
|
37 (You may substitute `2.*` for a more specific release tag, of
|
Chris@0
|
38 course.)
|
Chris@0
|
39
|
Chris@0
|
40 From there, use the `composer install` or `composer update` commands to
|
Chris@0
|
41 install.
|
Chris@0
|
42
|
Chris@0
|
43 ## Basic Usage
|
Chris@0
|
44
|
Chris@0
|
45 HTML5-PHP has a high-level API and a low-level API.
|
Chris@0
|
46
|
Chris@0
|
47 Here is how you use the high-level `HTML5` library API:
|
Chris@0
|
48
|
Chris@0
|
49 ```php
|
Chris@0
|
50 <?php
|
Chris@0
|
51 // Assuming you installed from Composer:
|
Chris@0
|
52 require "vendor/autoload.php";
|
Chris@0
|
53 use Masterminds\HTML5;
|
Chris@0
|
54
|
Chris@0
|
55
|
Chris@0
|
56 // An example HTML document:
|
Chris@0
|
57 $html = <<< 'HERE'
|
Chris@0
|
58 <html>
|
Chris@0
|
59 <head>
|
Chris@0
|
60 <title>TEST</title>
|
Chris@0
|
61 </head>
|
Chris@0
|
62 <body id='foo'>
|
Chris@0
|
63 <h1>Hello World</h1>
|
Chris@0
|
64 <p>This is a test of the HTML5 parser.</p>
|
Chris@0
|
65 </body>
|
Chris@0
|
66 </html>
|
Chris@0
|
67 HERE;
|
Chris@0
|
68
|
Chris@0
|
69 // Parse the document. $dom is a DOMDocument.
|
Chris@0
|
70 $html5 = new HTML5();
|
Chris@0
|
71 $dom = $html5->loadHTML($html);
|
Chris@0
|
72
|
Chris@0
|
73 // Render it as HTML5:
|
Chris@0
|
74 print $html5->saveHTML($dom);
|
Chris@0
|
75
|
Chris@0
|
76 // Or save it to a file:
|
Chris@0
|
77 $html5->save($dom, 'out.html');
|
Chris@0
|
78
|
Chris@0
|
79 ?>
|
Chris@0
|
80 ```
|
Chris@0
|
81
|
Chris@0
|
82 The `$dom` created by the parser is a full `DOMDocument` object. And the
|
Chris@0
|
83 `save()` and `saveHTML()` methods will take any DOMDocument.
|
Chris@0
|
84
|
Chris@0
|
85 ### Options
|
Chris@0
|
86
|
Chris@0
|
87 It is possible to pass in an array of configuration options when loading
|
Chris@0
|
88 an HTML5 document.
|
Chris@0
|
89
|
Chris@0
|
90 ```php
|
Chris@0
|
91 // An associative array of options
|
Chris@0
|
92 $options = array(
|
Chris@0
|
93 'option_name' => 'option_value',
|
Chris@0
|
94 );
|
Chris@0
|
95
|
Chris@0
|
96 // Provide the options to the constructor
|
Chris@0
|
97 $html5 = new HTML5($options);
|
Chris@0
|
98
|
Chris@0
|
99 $dom = $html5->loadHTML($html);
|
Chris@0
|
100 ```
|
Chris@0
|
101
|
Chris@0
|
102 The following options are supported:
|
Chris@0
|
103
|
Chris@0
|
104 * `encode_entities` (boolean): Indicates that the serializer should aggressively
|
Chris@0
|
105 encode characters as entities. Without this, it only encodes the bare
|
Chris@0
|
106 minimum.
|
Chris@0
|
107 * `disable_html_ns` (boolean): Prevents the parser from automatically
|
Chris@0
|
108 assigning the HTML5 namespace to the DOM document. This is for
|
Chris@0
|
109 non-namespace aware DOM tools.
|
Chris@0
|
110 * `target_document` (\DOMDocument): A DOM document that will be used as the
|
Chris@0
|
111 destination for the parsed nodes.
|
Chris@0
|
112 * `implicit_namespaces` (array): An assoc array of namespaces that should be
|
Chris@0
|
113 used by the parser. Name is tag prefix, value is NS URI.
|
Chris@0
|
114
|
Chris@0
|
115 ## The Low-Level API
|
Chris@0
|
116
|
Chris@0
|
117 This library provides the following low-level APIs that you can use to
|
Chris@0
|
118 create more customized HTML5 tools:
|
Chris@0
|
119
|
Chris@0
|
120 - An `InputStream` abstraction that can work with different kinds of
|
Chris@0
|
121 input source (not just files and strings).
|
Chris@0
|
122 - A SAX-like event-based parser that you can hook into for special kinds
|
Chris@0
|
123 of parsing.
|
Chris@0
|
124 - A flexible error-reporting mechanism that can be tuned to document
|
Chris@0
|
125 syntax checking.
|
Chris@0
|
126 - A DOM implementation that uses PHP's built-in DOM library.
|
Chris@0
|
127
|
Chris@0
|
128 The unit tests exercise each piece of the API, and every public function
|
Chris@0
|
129 is well-documented.
|
Chris@0
|
130
|
Chris@0
|
131 ### Parser Design
|
Chris@0
|
132
|
Chris@0
|
133 The parser is designed as follows:
|
Chris@0
|
134
|
Chris@0
|
135 - The `InputStream` portion handles direct I/O.
|
Chris@0
|
136 - The `Scanner` handles scanning on behalf of the parser.
|
Chris@0
|
137 - The `Tokenizer` requests data off of the scanner, parses it, clasifies
|
Chris@0
|
138 it, and sends it to an `EventHandler`. It is a *recursive descent parser.*
|
Chris@0
|
139 - The `EventHandler` receives notifications and data for each specific
|
Chris@0
|
140 semantic event that occurs during tokenization.
|
Chris@0
|
141 - The `DOMBuilder` is an `EventHandler` that listens for tokenizing
|
Chris@0
|
142 events and builds a document tree (`DOMDocument`) based on the events.
|
Chris@0
|
143
|
Chris@0
|
144 ### Serializer Design
|
Chris@0
|
145
|
Chris@0
|
146 The serializer takes a data structure (the `DOMDocument`) and transforms
|
Chris@0
|
147 it into a character representation -- an HTML5 document.
|
Chris@0
|
148
|
Chris@0
|
149 The serializer is broken into three parts:
|
Chris@0
|
150
|
Chris@0
|
151 - The `OutputRules` contain the rules to turn DOM elements into strings. The
|
Chris@0
|
152 rules are an implementation of the interface `RulesInterface` allowing for
|
Chris@0
|
153 different rule sets to be used.
|
Chris@0
|
154 - The `Traverser`, which is a special-purpose tree walker. It visits
|
Chris@0
|
155 each node node in the tree and uses the `OutputRules` to transform the node
|
Chris@0
|
156 into a string.
|
Chris@0
|
157 - `HTML5` manages the `Traverser` and stores the resultant data
|
Chris@0
|
158 in the correct place.
|
Chris@0
|
159
|
Chris@0
|
160 The serializer (`save()`, `saveHTML()`) follows the
|
Chris@0
|
161 [section 8.9 of the HTML 5.0 spec](http://www.w3.org/TR/2012/CR-html5-20121217/syntax.html#serializing-html-fragments).
|
Chris@0
|
162 So tags are serialized according to these rules:
|
Chris@0
|
163
|
Chris@0
|
164 - A tag with children: <foo>CHILDREN</foo>
|
Chris@0
|
165 - A tag that cannot have content: <foo> (no closing tag)
|
Chris@0
|
166 - A tag that could have content, but doesn't: <foo></foo>
|
Chris@0
|
167
|
Chris@0
|
168 ## Known Issues (Or, Things We Designed Against the Spec)
|
Chris@0
|
169
|
Chris@0
|
170 Please check the issue queue for a full list, but the following are
|
Chris@0
|
171 issues known issues that are not presently on the roadmap:
|
Chris@0
|
172
|
Chris@0
|
173 - Namespaces: HTML5 only [supports a selected list of namespaces](http://www.w3.org/TR/html5/infrastructure.html#namespaces)
|
Chris@0
|
174 and they do not operate in the same way as XML namespaces. A `:` has no special
|
Chris@0
|
175 meaning.
|
Chris@0
|
176 By default the parser does not support XML style namespaces via `:`;
|
Chris@0
|
177 to enable the XML namespaces see the [XML Namespaces section](#xml-namespaces)
|
Chris@0
|
178 - Scripts: This parser does not contain a JavaScript or a CSS
|
Chris@0
|
179 interpreter. While one may be supplied, not all features will be
|
Chris@0
|
180 supported.
|
Chris@0
|
181 - Rentrance: The current parser is not re-entrant. (Thus you can't pause
|
Chris@0
|
182 the parser to modify the HTML string mid-parse.)
|
Chris@0
|
183 - Validation: The current tree builder is **not** a validating parser.
|
Chris@0
|
184 While it will correct some HTML, it does not check that the HTML
|
Chris@0
|
185 conforms to the standard. (Should you wish, you can build a validating
|
Chris@0
|
186 parser by extending DOMTree or building your own EventHandler
|
Chris@0
|
187 implementation.)
|
Chris@0
|
188 * There is limited support for insertion modes.
|
Chris@0
|
189 * Some autocorrection is done automatically.
|
Chris@0
|
190 * Per the spec, many legacy tags are admitted and correctly handled,
|
Chris@0
|
191 even though they are technically not part of HTML5.
|
Chris@0
|
192 - Attribute names and values: Due to the implementation details of the
|
Chris@0
|
193 PHP implementation of DOM, attribute names that do not follow the
|
Chris@0
|
194 XML 1.0 standard are not inserted into the DOM. (Effectively, they
|
Chris@0
|
195 are ignored.) If you've got a clever fix for this, jump in!
|
Chris@0
|
196 - Processor Instructions: The HTML5 spec does not allow processor
|
Chris@0
|
197 instructions. We do. Since this is a server-side library, we think
|
Chris@0
|
198 this is useful. And that means, dear reader, that in some cases you
|
Chris@0
|
199 can parse the HTML from a mixed PHP/HTML document. This, however,
|
Chris@0
|
200 is an incidental feature, not a core feature.
|
Chris@0
|
201 - HTML manifests: Unsupported.
|
Chris@0
|
202 - PLAINTEXT: Unsupported.
|
Chris@0
|
203 - Adoption Agency Algorithm: Not yet implemented. (8.2.5.4.7)
|
Chris@0
|
204
|
Chris@0
|
205 ##XML Namespaces
|
Chris@0
|
206
|
Chris@0
|
207 To use XML style namespaces you have to configure well the main `HTML5` instance.
|
Chris@0
|
208
|
Chris@0
|
209 ```php
|
Chris@0
|
210 use Masterminds\HTML5;
|
Chris@0
|
211 $html = new HTML5(array(
|
Chris@0
|
212 "xmlNamespaces" => true
|
Chris@0
|
213 ));
|
Chris@0
|
214
|
Chris@0
|
215 $dom = $html->loadHTML('<t:tag xmlns:t="http://www.example.com"/>');
|
Chris@0
|
216
|
Chris@0
|
217 $dom->documentElement->namespaceURI; // http://www.example.com
|
Chris@0
|
218
|
Chris@0
|
219 ```
|
Chris@0
|
220
|
Chris@0
|
221 You can also add some default prefixes that will not require the namespace declaration,
|
Chris@0
|
222 but it's elements will be namespaced.
|
Chris@0
|
223
|
Chris@0
|
224 ```php
|
Chris@0
|
225 use Masterminds\HTML5;
|
Chris@0
|
226 $html = new HTML5(array(
|
Chris@0
|
227 "implicitNamespaces"=>array(
|
Chris@0
|
228 "t"=>"http://www.example.com"
|
Chris@0
|
229 )
|
Chris@0
|
230 ));
|
Chris@0
|
231
|
Chris@0
|
232 $dom = $html->loadHTML('<t:tag/>');
|
Chris@0
|
233
|
Chris@0
|
234 $dom->documentElement->namespaceURI; // http://www.example.com
|
Chris@0
|
235
|
Chris@0
|
236 ```
|
Chris@0
|
237
|
Chris@0
|
238 ## Thanks to...
|
Chris@0
|
239
|
Chris@0
|
240 The dedicated (and patient) contributors of patches small and large,
|
Chris@0
|
241 who have already made this library better.See the CREDITS file for
|
Chris@0
|
242 a list of contributors.
|
Chris@0
|
243
|
Chris@0
|
244 We owe a huge debt of gratitude to the original authors of html5lib.
|
Chris@0
|
245
|
Chris@0
|
246 While not much of the orignal parser remains, we learned a lot from
|
Chris@0
|
247 reading the html5lib library. And some pieces remain here. In
|
Chris@0
|
248 particular, much of the UTF-8 and Unicode handling is derived from the
|
Chris@0
|
249 html5lib project.
|
Chris@0
|
250
|
Chris@0
|
251 ## License
|
Chris@0
|
252
|
Chris@0
|
253 This software is released under the MIT license. The original html5lib
|
Chris@0
|
254 library was also released under the MIT license.
|
Chris@0
|
255
|
Chris@0
|
256 See LICENSE.txt
|
Chris@0
|
257
|
Chris@0
|
258 Certain files contain copyright assertions by specific individuals
|
Chris@0
|
259 involved with html5lib. Those have been retained where appropriate.
|