Chris@0: # HTML5-PHP Chris@0: Chris@0: HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP. Chris@0: It is stable and used in many production websites, and has Chris@17: well over [five million downloads](https://packagist.org/packages/masterminds/html5). Chris@0: Chris@0: HTML5 provides the following features. Chris@0: Chris@0: - An HTML5 serializer Chris@0: - Support for PHP namespaces Chris@0: - Composer support Chris@0: - Event-based (SAX-like) parser Chris@0: - A DOM tree builder Chris@0: - Interoperability with [QueryPath](https://github.com/technosophos/querypath) Chris@0: - Runs on **PHP** 5.3.0 or newer and **HHVM** 3.2 or newer Chris@0: Chris@0: [![Build Status](https://travis-ci.org/Masterminds/html5-php.png?branch=master)](https://travis-ci.org/Masterminds/html5-php) Chris@0: [![Latest Stable Version](https://poser.pugx.org/masterminds/html5/v/stable.png)](https://packagist.org/packages/masterminds/html5) Chris@0: [![Code Coverage](https://scrutinizer-ci.com/g/Masterminds/html5-php/badges/coverage.png?b=master)](https://scrutinizer-ci.com/g/Masterminds/html5-php/?branch=master) Chris@0: [![Scrutinizer Code Quality](https://scrutinizer-ci.com/g/Masterminds/html5-php/badges/quality-score.png?b=master)](https://scrutinizer-ci.com/g/Masterminds/html5-php/?branch=master) Chris@0: [![Stability: Sustained](https://masterminds.github.io/stability/sustained.svg)](https://masterminds.github.io/stability/sustained.html) Chris@0: Chris@0: ## Installation Chris@0: Chris@0: Install HTML5-PHP using [composer](http://getcomposer.org/). Chris@0: Chris@17: By adding the `masterminds/html5` dependency to your `composer.json` file: Chris@0: Chris@0: ```json Chris@0: { Chris@0: "require" : { Chris@17: "masterminds/html5": "^2.0" Chris@0: }, Chris@0: } Chris@0: ``` Chris@0: Chris@17: By invoking require command via composer executable: Chris@0: Chris@17: ```bash Chris@17: composer require masterminds/html5 Chris@17: ``` Chris@0: Chris@0: ## Basic Usage Chris@0: Chris@0: HTML5-PHP has a high-level API and a low-level API. Chris@0: Chris@0: Here is how you use the high-level `HTML5` library API: Chris@0: Chris@0: ```php Chris@0: Chris@0: Chris@0: TEST Chris@0: Chris@0: Chris@0:

Hello World

Chris@0:

This is a test of the HTML5 parser.

Chris@0: Chris@0: Chris@0: HERE; Chris@0: Chris@0: // Parse the document. $dom is a DOMDocument. Chris@0: $html5 = new HTML5(); Chris@0: $dom = $html5->loadHTML($html); Chris@0: Chris@0: // Render it as HTML5: Chris@0: print $html5->saveHTML($dom); Chris@0: Chris@0: // Or save it to a file: Chris@0: $html5->save($dom, 'out.html'); Chris@0: ``` Chris@0: Chris@0: The `$dom` created by the parser is a full `DOMDocument` object. And the Chris@0: `save()` and `saveHTML()` methods will take any DOMDocument. Chris@0: Chris@0: ### Options Chris@0: Chris@0: It is possible to pass in an array of configuration options when loading Chris@0: an HTML5 document. Chris@0: Chris@0: ```php Chris@0: // An associative array of options Chris@0: $options = array( Chris@0: 'option_name' => 'option_value', Chris@0: ); Chris@0: Chris@0: // Provide the options to the constructor Chris@0: $html5 = new HTML5($options); Chris@0: Chris@0: $dom = $html5->loadHTML($html); Chris@0: ``` Chris@0: Chris@0: The following options are supported: Chris@0: Chris@0: * `encode_entities` (boolean): Indicates that the serializer should aggressively Chris@0: encode characters as entities. Without this, it only encodes the bare Chris@0: minimum. Chris@0: * `disable_html_ns` (boolean): Prevents the parser from automatically Chris@0: assigning the HTML5 namespace to the DOM document. This is for Chris@0: non-namespace aware DOM tools. Chris@0: * `target_document` (\DOMDocument): A DOM document that will be used as the Chris@0: destination for the parsed nodes. Chris@0: * `implicit_namespaces` (array): An assoc array of namespaces that should be Chris@0: used by the parser. Name is tag prefix, value is NS URI. Chris@0: Chris@0: ## The Low-Level API Chris@0: Chris@0: This library provides the following low-level APIs that you can use to Chris@0: create more customized HTML5 tools: Chris@0: Chris@0: - A SAX-like event-based parser that you can hook into for special kinds Chris@0: of parsing. Chris@0: - A flexible error-reporting mechanism that can be tuned to document Chris@0: syntax checking. Chris@0: - A DOM implementation that uses PHP's built-in DOM library. Chris@0: Chris@0: The unit tests exercise each piece of the API, and every public function Chris@0: is well-documented. Chris@0: Chris@0: ### Parser Design Chris@0: Chris@0: The parser is designed as follows: Chris@0: Chris@0: - The `Scanner` handles scanning on behalf of the parser. Chris@0: - The `Tokenizer` requests data off of the scanner, parses it, clasifies Chris@0: it, and sends it to an `EventHandler`. It is a *recursive descent parser.* Chris@0: - The `EventHandler` receives notifications and data for each specific Chris@0: semantic event that occurs during tokenization. Chris@0: - The `DOMBuilder` is an `EventHandler` that listens for tokenizing Chris@0: events and builds a document tree (`DOMDocument`) based on the events. Chris@0: Chris@0: ### Serializer Design Chris@0: Chris@0: The serializer takes a data structure (the `DOMDocument`) and transforms Chris@0: it into a character representation -- an HTML5 document. Chris@0: Chris@0: The serializer is broken into three parts: Chris@0: Chris@0: - The `OutputRules` contain the rules to turn DOM elements into strings. The Chris@0: rules are an implementation of the interface `RulesInterface` allowing for Chris@0: different rule sets to be used. Chris@0: - The `Traverser`, which is a special-purpose tree walker. It visits Chris@0: each node node in the tree and uses the `OutputRules` to transform the node Chris@0: into a string. Chris@0: - `HTML5` manages the `Traverser` and stores the resultant data Chris@0: in the correct place. Chris@0: Chris@0: The serializer (`save()`, `saveHTML()`) follows the Chris@0: [section 8.9 of the HTML 5.0 spec](http://www.w3.org/TR/2012/CR-html5-20121217/syntax.html#serializing-html-fragments). Chris@0: So tags are serialized according to these rules: Chris@0: Chris@0: - A tag with children: <foo>CHILDREN</foo> Chris@0: - A tag that cannot have content: <foo> (no closing tag) Chris@0: - A tag that could have content, but doesn't: <foo></foo> Chris@0: Chris@0: ## Known Issues (Or, Things We Designed Against the Spec) Chris@0: Chris@0: Please check the issue queue for a full list, but the following are Chris@0: issues known issues that are not presently on the roadmap: Chris@0: Chris@0: - Namespaces: HTML5 only [supports a selected list of namespaces](http://www.w3.org/TR/html5/infrastructure.html#namespaces) Chris@0: and they do not operate in the same way as XML namespaces. A `:` has no special Chris@0: meaning. Chris@0: By default the parser does not support XML style namespaces via `:`; Chris@0: to enable the XML namespaces see the [XML Namespaces section](#xml-namespaces) Chris@0: - Scripts: This parser does not contain a JavaScript or a CSS Chris@0: interpreter. While one may be supplied, not all features will be Chris@0: supported. Chris@0: - Rentrance: The current parser is not re-entrant. (Thus you can't pause Chris@0: the parser to modify the HTML string mid-parse.) Chris@0: - Validation: The current tree builder is **not** a validating parser. Chris@0: While it will correct some HTML, it does not check that the HTML Chris@0: conforms to the standard. (Should you wish, you can build a validating Chris@0: parser by extending DOMTree or building your own EventHandler Chris@0: implementation.) Chris@0: * There is limited support for insertion modes. Chris@0: * Some autocorrection is done automatically. Chris@0: * Per the spec, many legacy tags are admitted and correctly handled, Chris@0: even though they are technically not part of HTML5. Chris@0: - Attribute names and values: Due to the implementation details of the Chris@0: PHP implementation of DOM, attribute names that do not follow the Chris@0: XML 1.0 standard are not inserted into the DOM. (Effectively, they Chris@0: are ignored.) If you've got a clever fix for this, jump in! Chris@0: - Processor Instructions: The HTML5 spec does not allow processor Chris@0: instructions. We do. Since this is a server-side library, we think Chris@0: this is useful. And that means, dear reader, that in some cases you Chris@0: can parse the HTML from a mixed PHP/HTML document. This, however, Chris@0: is an incidental feature, not a core feature. Chris@0: - HTML manifests: Unsupported. Chris@0: - PLAINTEXT: Unsupported. Chris@0: - Adoption Agency Algorithm: Not yet implemented. (8.2.5.4.7) Chris@0: Chris@17: ## XML Namespaces Chris@0: Chris@0: To use XML style namespaces you have to configure well the main `HTML5` instance. Chris@0: Chris@0: ```php Chris@0: use Masterminds\HTML5; Chris@0: $html = new HTML5(array( Chris@0: "xmlNamespaces" => true Chris@0: )); Chris@0: Chris@0: $dom = $html->loadHTML(''); Chris@0: Chris@0: $dom->documentElement->namespaceURI; // http://www.example.com Chris@0: Chris@0: ``` Chris@0: Chris@0: You can also add some default prefixes that will not require the namespace declaration, Chris@17: but its elements will be namespaced. Chris@0: Chris@0: ```php Chris@0: use Masterminds\HTML5; Chris@0: $html = new HTML5(array( Chris@0: "implicitNamespaces"=>array( Chris@0: "t"=>"http://www.example.com" Chris@0: ) Chris@0: )); Chris@0: Chris@0: $dom = $html->loadHTML(''); Chris@0: Chris@0: $dom->documentElement->namespaceURI; // http://www.example.com Chris@0: Chris@0: ``` Chris@0: Chris@0: ## Thanks to... Chris@0: Chris@0: The dedicated (and patient) contributors of patches small and large, Chris@0: who have already made this library better.See the CREDITS file for Chris@0: a list of contributors. Chris@0: Chris@0: We owe a huge debt of gratitude to the original authors of html5lib. Chris@0: Chris@17: While not much of the original parser remains, we learned a lot from Chris@0: reading the html5lib library. And some pieces remain here. In Chris@0: particular, much of the UTF-8 and Unicode handling is derived from the Chris@0: html5lib project. Chris@0: Chris@0: ## License Chris@0: Chris@0: This software is released under the MIT license. The original html5lib Chris@0: library was also released under the MIT license. Chris@0: Chris@0: See LICENSE.txt Chris@0: Chris@0: Certain files contain copyright assertions by specific individuals Chris@0: involved with html5lib. Those have been retained where appropriate.