annotate vendor/masterminds/html5/README.md @ 19:fa3358dc1485 tip

Add ndrum files
author Chris Cannam
date Wed, 28 Aug 2019 13:14:47 +0100
parents 129ea1e6d783
children
rev   line source
Chris@0 1 # HTML5-PHP
Chris@0 2
Chris@0 3 HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP.
Chris@0 4 It is stable and used in many production websites, and has
Chris@17 5 well over [five million downloads](https://packagist.org/packages/masterminds/html5).
Chris@0 6
Chris@0 7 HTML5 provides the following features.
Chris@0 8
Chris@0 9 - An HTML5 serializer
Chris@0 10 - Support for PHP namespaces
Chris@0 11 - Composer support
Chris@0 12 - Event-based (SAX-like) parser
Chris@0 13 - A DOM tree builder
Chris@0 14 - Interoperability with [QueryPath](https://github.com/technosophos/querypath)
Chris@0 15 - Runs on **PHP** 5.3.0 or newer and **HHVM** 3.2 or newer
Chris@0 16
Chris@0 17 [![Build Status](https://travis-ci.org/Masterminds/html5-php.png?branch=master)](https://travis-ci.org/Masterminds/html5-php)
Chris@0 18 [![Latest Stable Version](https://poser.pugx.org/masterminds/html5/v/stable.png)](https://packagist.org/packages/masterminds/html5)
Chris@0 19 [![Code Coverage](https://scrutinizer-ci.com/g/Masterminds/html5-php/badges/coverage.png?b=master)](https://scrutinizer-ci.com/g/Masterminds/html5-php/?branch=master)
Chris@0 20 [![Scrutinizer Code Quality](https://scrutinizer-ci.com/g/Masterminds/html5-php/badges/quality-score.png?b=master)](https://scrutinizer-ci.com/g/Masterminds/html5-php/?branch=master)
Chris@0 21 [![Stability: Sustained](https://masterminds.github.io/stability/sustained.svg)](https://masterminds.github.io/stability/sustained.html)
Chris@0 22
Chris@0 23 ## Installation
Chris@0 24
Chris@0 25 Install HTML5-PHP using [composer](http://getcomposer.org/).
Chris@0 26
Chris@17 27 By adding the `masterminds/html5` dependency to your `composer.json` file:
Chris@0 28
Chris@0 29 ```json
Chris@0 30 {
Chris@0 31 "require" : {
Chris@17 32 "masterminds/html5": "^2.0"
Chris@0 33 },
Chris@0 34 }
Chris@0 35 ```
Chris@0 36
Chris@17 37 By invoking require command via composer executable:
Chris@0 38
Chris@17 39 ```bash
Chris@17 40 composer require masterminds/html5
Chris@17 41 ```
Chris@0 42
Chris@0 43 ## Basic Usage
Chris@0 44
Chris@0 45 HTML5-PHP has a high-level API and a low-level API.
Chris@0 46
Chris@0 47 Here is how you use the high-level `HTML5` library API:
Chris@0 48
Chris@0 49 ```php
Chris@0 50 <?php
Chris@0 51 // Assuming you installed from Composer:
Chris@0 52 require "vendor/autoload.php";
Chris@17 53
Chris@0 54 use Masterminds\HTML5;
Chris@0 55
Chris@0 56 // An example HTML document:
Chris@0 57 $html = <<< 'HERE'
Chris@0 58 <html>
Chris@0 59 <head>
Chris@0 60 <title>TEST</title>
Chris@0 61 </head>
Chris@0 62 <body id='foo'>
Chris@0 63 <h1>Hello World</h1>
Chris@0 64 <p>This is a test of the HTML5 parser.</p>
Chris@0 65 </body>
Chris@0 66 </html>
Chris@0 67 HERE;
Chris@0 68
Chris@0 69 // Parse the document. $dom is a DOMDocument.
Chris@0 70 $html5 = new HTML5();
Chris@0 71 $dom = $html5->loadHTML($html);
Chris@0 72
Chris@0 73 // Render it as HTML5:
Chris@0 74 print $html5->saveHTML($dom);
Chris@0 75
Chris@0 76 // Or save it to a file:
Chris@0 77 $html5->save($dom, 'out.html');
Chris@0 78 ```
Chris@0 79
Chris@0 80 The `$dom` created by the parser is a full `DOMDocument` object. And the
Chris@0 81 `save()` and `saveHTML()` methods will take any DOMDocument.
Chris@0 82
Chris@0 83 ### Options
Chris@0 84
Chris@0 85 It is possible to pass in an array of configuration options when loading
Chris@0 86 an HTML5 document.
Chris@0 87
Chris@0 88 ```php
Chris@0 89 // An associative array of options
Chris@0 90 $options = array(
Chris@0 91 'option_name' => 'option_value',
Chris@0 92 );
Chris@0 93
Chris@0 94 // Provide the options to the constructor
Chris@0 95 $html5 = new HTML5($options);
Chris@0 96
Chris@0 97 $dom = $html5->loadHTML($html);
Chris@0 98 ```
Chris@0 99
Chris@0 100 The following options are supported:
Chris@0 101
Chris@0 102 * `encode_entities` (boolean): Indicates that the serializer should aggressively
Chris@0 103 encode characters as entities. Without this, it only encodes the bare
Chris@0 104 minimum.
Chris@0 105 * `disable_html_ns` (boolean): Prevents the parser from automatically
Chris@0 106 assigning the HTML5 namespace to the DOM document. This is for
Chris@0 107 non-namespace aware DOM tools.
Chris@0 108 * `target_document` (\DOMDocument): A DOM document that will be used as the
Chris@0 109 destination for the parsed nodes.
Chris@0 110 * `implicit_namespaces` (array): An assoc array of namespaces that should be
Chris@0 111 used by the parser. Name is tag prefix, value is NS URI.
Chris@0 112
Chris@0 113 ## The Low-Level API
Chris@0 114
Chris@0 115 This library provides the following low-level APIs that you can use to
Chris@0 116 create more customized HTML5 tools:
Chris@0 117
Chris@0 118 - A SAX-like event-based parser that you can hook into for special kinds
Chris@0 119 of parsing.
Chris@0 120 - A flexible error-reporting mechanism that can be tuned to document
Chris@0 121 syntax checking.
Chris@0 122 - A DOM implementation that uses PHP's built-in DOM library.
Chris@0 123
Chris@0 124 The unit tests exercise each piece of the API, and every public function
Chris@0 125 is well-documented.
Chris@0 126
Chris@0 127 ### Parser Design
Chris@0 128
Chris@0 129 The parser is designed as follows:
Chris@0 130
Chris@0 131 - The `Scanner` handles scanning on behalf of the parser.
Chris@0 132 - The `Tokenizer` requests data off of the scanner, parses it, clasifies
Chris@0 133 it, and sends it to an `EventHandler`. It is a *recursive descent parser.*
Chris@0 134 - The `EventHandler` receives notifications and data for each specific
Chris@0 135 semantic event that occurs during tokenization.
Chris@0 136 - The `DOMBuilder` is an `EventHandler` that listens for tokenizing
Chris@0 137 events and builds a document tree (`DOMDocument`) based on the events.
Chris@0 138
Chris@0 139 ### Serializer Design
Chris@0 140
Chris@0 141 The serializer takes a data structure (the `DOMDocument`) and transforms
Chris@0 142 it into a character representation -- an HTML5 document.
Chris@0 143
Chris@0 144 The serializer is broken into three parts:
Chris@0 145
Chris@0 146 - The `OutputRules` contain the rules to turn DOM elements into strings. The
Chris@0 147 rules are an implementation of the interface `RulesInterface` allowing for
Chris@0 148 different rule sets to be used.
Chris@0 149 - The `Traverser`, which is a special-purpose tree walker. It visits
Chris@0 150 each node node in the tree and uses the `OutputRules` to transform the node
Chris@0 151 into a string.
Chris@0 152 - `HTML5` manages the `Traverser` and stores the resultant data
Chris@0 153 in the correct place.
Chris@0 154
Chris@0 155 The serializer (`save()`, `saveHTML()`) follows the
Chris@0 156 [section 8.9 of the HTML 5.0 spec](http://www.w3.org/TR/2012/CR-html5-20121217/syntax.html#serializing-html-fragments).
Chris@0 157 So tags are serialized according to these rules:
Chris@0 158
Chris@0 159 - A tag with children: &lt;foo&gt;CHILDREN&lt;/foo&gt;
Chris@0 160 - A tag that cannot have content: &lt;foo&gt; (no closing tag)
Chris@0 161 - A tag that could have content, but doesn't: &lt;foo&gt;&lt;/foo&gt;
Chris@0 162
Chris@0 163 ## Known Issues (Or, Things We Designed Against the Spec)
Chris@0 164
Chris@0 165 Please check the issue queue for a full list, but the following are
Chris@0 166 issues known issues that are not presently on the roadmap:
Chris@0 167
Chris@0 168 - Namespaces: HTML5 only [supports a selected list of namespaces](http://www.w3.org/TR/html5/infrastructure.html#namespaces)
Chris@0 169 and they do not operate in the same way as XML namespaces. A `:` has no special
Chris@0 170 meaning.
Chris@0 171 By default the parser does not support XML style namespaces via `:`;
Chris@0 172 to enable the XML namespaces see the [XML Namespaces section](#xml-namespaces)
Chris@0 173 - Scripts: This parser does not contain a JavaScript or a CSS
Chris@0 174 interpreter. While one may be supplied, not all features will be
Chris@0 175 supported.
Chris@0 176 - Rentrance: The current parser is not re-entrant. (Thus you can't pause
Chris@0 177 the parser to modify the HTML string mid-parse.)
Chris@0 178 - Validation: The current tree builder is **not** a validating parser.
Chris@0 179 While it will correct some HTML, it does not check that the HTML
Chris@0 180 conforms to the standard. (Should you wish, you can build a validating
Chris@0 181 parser by extending DOMTree or building your own EventHandler
Chris@0 182 implementation.)
Chris@0 183 * There is limited support for insertion modes.
Chris@0 184 * Some autocorrection is done automatically.
Chris@0 185 * Per the spec, many legacy tags are admitted and correctly handled,
Chris@0 186 even though they are technically not part of HTML5.
Chris@0 187 - Attribute names and values: Due to the implementation details of the
Chris@0 188 PHP implementation of DOM, attribute names that do not follow the
Chris@0 189 XML 1.0 standard are not inserted into the DOM. (Effectively, they
Chris@0 190 are ignored.) If you've got a clever fix for this, jump in!
Chris@0 191 - Processor Instructions: The HTML5 spec does not allow processor
Chris@0 192 instructions. We do. Since this is a server-side library, we think
Chris@0 193 this is useful. And that means, dear reader, that in some cases you
Chris@0 194 can parse the HTML from a mixed PHP/HTML document. This, however,
Chris@0 195 is an incidental feature, not a core feature.
Chris@0 196 - HTML manifests: Unsupported.
Chris@0 197 - PLAINTEXT: Unsupported.
Chris@0 198 - Adoption Agency Algorithm: Not yet implemented. (8.2.5.4.7)
Chris@0 199
Chris@17 200 ## XML Namespaces
Chris@0 201
Chris@0 202 To use XML style namespaces you have to configure well the main `HTML5` instance.
Chris@0 203
Chris@0 204 ```php
Chris@0 205 use Masterminds\HTML5;
Chris@0 206 $html = new HTML5(array(
Chris@0 207 "xmlNamespaces" => true
Chris@0 208 ));
Chris@0 209
Chris@0 210 $dom = $html->loadHTML('<t:tag xmlns:t="http://www.example.com"/>');
Chris@0 211
Chris@0 212 $dom->documentElement->namespaceURI; // http://www.example.com
Chris@0 213
Chris@0 214 ```
Chris@0 215
Chris@0 216 You can also add some default prefixes that will not require the namespace declaration,
Chris@17 217 but its elements will be namespaced.
Chris@0 218
Chris@0 219 ```php
Chris@0 220 use Masterminds\HTML5;
Chris@0 221 $html = new HTML5(array(
Chris@0 222 "implicitNamespaces"=>array(
Chris@0 223 "t"=>"http://www.example.com"
Chris@0 224 )
Chris@0 225 ));
Chris@0 226
Chris@0 227 $dom = $html->loadHTML('<t:tag/>');
Chris@0 228
Chris@0 229 $dom->documentElement->namespaceURI; // http://www.example.com
Chris@0 230
Chris@0 231 ```
Chris@0 232
Chris@0 233 ## Thanks to...
Chris@0 234
Chris@0 235 The dedicated (and patient) contributors of patches small and large,
Chris@0 236 who have already made this library better.See the CREDITS file for
Chris@0 237 a list of contributors.
Chris@0 238
Chris@0 239 We owe a huge debt of gratitude to the original authors of html5lib.
Chris@0 240
Chris@17 241 While not much of the original parser remains, we learned a lot from
Chris@0 242 reading the html5lib library. And some pieces remain here. In
Chris@0 243 particular, much of the UTF-8 and Unicode handling is derived from the
Chris@0 244 html5lib project.
Chris@0 245
Chris@0 246 ## License
Chris@0 247
Chris@0 248 This software is released under the MIT license. The original html5lib
Chris@0 249 library was also released under the MIT license.
Chris@0 250
Chris@0 251 See LICENSE.txt
Chris@0 252
Chris@0 253 Certain files contain copyright assertions by specific individuals
Chris@0 254 involved with html5lib. Those have been retained where appropriate.