Chris@0: # Theory of Operation Chris@0: Chris@0: zend-escaper provides methods for escaping output data, dependent on the context Chris@0: in which the data will be used. Each method is based on peer-reviewed rules and Chris@0: is in compliance with the current OWASP recommendations. Chris@0: Chris@0: The escaping follows a well-known and fixed set of encoding rules defined by Chris@0: OWASP for each key HTML context. These rules cannot be impacted or negated by Chris@0: browser quirks or edge-case HTML parsing unless the browser suffers a Chris@0: catastrophic bug in its HTML parser or Javascript interpreter — both of Chris@0: these are unlikely. Chris@0: Chris@0: The contexts in which zend-escaper should be used are **HTML Body**, **HTML Chris@0: Attribute**, **Javascript**, **CSS**, and **URL/URI** contexts. Chris@0: Chris@0: Every escaper method will take the data to be escaped, make sure it is utf-8 Chris@0: encoded data (or try to convert it to utf-8), perform context-based escaping, Chris@0: encode the escaped data back to its original encoding, and return the data to Chris@0: the caller. Chris@0: Chris@0: The actual escaping of the data differs between each method; they all have their Chris@0: own set of rules according to which escaping is performed. An example will allow Chris@0: us to clearly demonstrate the difference, and how the same characters are being Chris@0: escaped differently between contexts: Chris@0: Chris@0: ```php Chris@0: $escaper = new Zend\Escaper\Escaper('utf-8'); Chris@0: Chris@0: // <script>alert("zf2")</script> Chris@0: echo $escaper->escapeHtml(''); Chris@0: Chris@0: // <script>alert("zf2")</script> Chris@0: echo $escaper->escapeHtmlAttr(''); Chris@0: Chris@0: // \x3Cscript\x3Ealert\x28\x22zf2\x22\x29\x3C\x2Fscript\x3E Chris@0: echo $escaper->escapeJs(''); Chris@0: Chris@0: // \3C script\3E alert\28 \22 zf2\22 \29 \3C \2F script\3E Chris@0: echo $escaper->escapeCss(''); Chris@0: Chris@0: // %3Cscript%3Ealert%28%22zf2%22%29%3C%2Fscript%3E Chris@0: echo $escaper->escapeUrl(''); Chris@0: ``` Chris@0: Chris@0: More detailed examples will be given in later chapters. Chris@0: Chris@0: ## The Problem with Inconsistent Functionality Chris@0: Chris@0: At present, programmers orient towards the following PHP functions for each Chris@0: common HTML context: Chris@0: Chris@0: - **HTML Body**: `htmlspecialchars()` or `htmlentities()` Chris@0: - **HTML Attribute**: `htmlspecialchars()` or `htmlentities()` Chris@0: - **Javascript**: `addslashes()` or `json_encode()` Chris@0: - **CSS**: n/a Chris@0: - **URL/URI**: `rawurlencode()` or `urlencode()` Chris@0: Chris@0: In practice, these decisions appear to depend more on what PHP offers, and if it Chris@0: can be interpreted as offering sufficient escaping safety, than it does on what Chris@0: is recommended in reality to defend against XSS. While these functions can Chris@0: prevent some forms of XSS, they do not cover all use cases or risks and are Chris@0: therefore insufficient defenses. Chris@0: Chris@0: Using `htmlspecialchars()` in a perfectly valid HTML5 unquoted attribute value, Chris@0: for example, is completely useless since the value can be terminated by a space Chris@0: (among other things), which is never escaped. Thus, in this instance, we have a Chris@0: conflict between a widely used HTML escaper and a modern HTML specification, Chris@0: with no specific function available to cover this use case. While it's tempting Chris@0: to blame users, or the HTML specification authors, escaping just needs to deal Chris@0: with whatever HTML and browsers allow. Chris@0: Chris@0: Using `addslashes()`, custom backslash escaping, or `json_encode()` will Chris@0: typically ignore HTML special characters such as ampersands, which may be used Chris@0: to inject entities into Javascript. Under the right circumstances, the browser Chris@0: will convert these entities into their literal equivalents before interpreting Chris@0: Javascript, thus allowing attackers to inject arbitrary code. Chris@0: Chris@0: Inconsistencies with valid HTML, insecure default parameters, lack of character Chris@0: encoding awareness, and misrepresentations of what functions are capable of by Chris@0: some programmers — these all make escaping in PHP an unnecessarily Chris@0: convoluted quest. Chris@0: Chris@0: To circumvent the lack of escaping methods in PHP, zend-escaper addresses the Chris@0: need to apply context-specific escaping in web applications. It implements Chris@0: methods that specifically target XSS and offers programmers a tool to secure Chris@0: their applications without misusing other inadequate methods, or using, most Chris@0: likely incomplete, home-grown solutions. Chris@0: Chris@0: ## Why Contextual Escaping? Chris@0: Chris@0: To understand why multiple standardised escaping methods are needed, what Chris@0: follows are several quick points; they are by no means a complete set of Chris@0: reasons, however! Chris@0: Chris@0: ### HTML escaping of unquoted HTML attribute values still allows XSS Chris@0: Chris@0: This is probably the best known way to defeat `htmlspecialchars()` when used on Chris@0: attribute values, since any space (or character interpreted as a space — Chris@0: there are a lot) lets you inject new attributes whose content can't be Chris@0: neutralised by HTML escaping. The solution (where this is possible) is Chris@0: additional escaping as defined by the OWASP ESAPI codecs. The point here can be Chris@0: extended further — escaping only works if a programmer or designer knows Chris@0: what they're doing. In many contexts, there are additional practices and gotchas Chris@0: that need to be carefully monitored since escaping sometimes needs a little Chris@0: extra help to protect against XSS — even if that means ensuring all Chris@0: attribute values are properly double quoted despite this not being required for Chris@0: valid HTML. Chris@0: Chris@0: ### HTML escaping of CSS, Javascript or URIs is often reversed when passed to non-HTML interpreters by the browser Chris@0: Chris@0: HTML escaping is just that &mdsash; it's designed to escape a string for HTML Chris@0: (i.e. prevent tag or attribute insertion), but not alter the underlying meaning Chris@0: of the content, whether it be text, Javascript, CSS, or URIs. For that purpose, Chris@0: a fully HTML-escaped version of any other context may still have its unescaped Chris@0: form extracted before it's interpreted or executed. For this reason we need Chris@0: separate escapers for Javascript, CSS, and URIs, and developers or designers Chris@0: writing templates **must** know which escaper to apply to which context. Of Chris@0: course, this means you need to be able to identify the correct context before Chris@0: selecting the right escaper! Chris@0: Chris@0: ### DOM-based XSS requires a defence using at least two levels of different escaping in many cases Chris@0: Chris@0: DOM-based XSS has become increasingly common as Javascript has taken off in Chris@0: popularity for large scale client-side coding. A simple example is Javascript Chris@0: defined in a template which inserts a new piece of HTML text into the DOM. If Chris@0: the string is only HTML escaped, it may still contain Javascript that will Chris@0: execute in that context. If the string is only Javascript-escaped, it may Chris@0: contain HTML markup (new tags and attributes) which will be injected into the Chris@0: DOM and parsed once the inserting Javascript executes. Damned either way? The Chris@0: solution is to escape twice — first escape the string for HTML (make it Chris@0: safe for DOM insertion), and then for Javascript (make it safe for the current Chris@0: Javascript context). Nested contexts are a common means of bypassing naive Chris@0: escaping habits (e.g. you can inject Javascript into a CSS expression within an Chris@0: HTML attribute). Chris@0: Chris@0: ### PHP has no known anti-XSS escape functions (only those kidnapped from their original purposes) Chris@0: Chris@0: A simple example, widely used, is when you see `json_encode()` used to escape Chris@0: Javascript, or worse, some kind of mutant `addslashes()` implementation. These Chris@0: were never designed to eliminate XSS, yet PHP programmers use them as such. For Chris@0: example, `json_encode()` does not escape the ampersand or semi-colon characters Chris@0: by default. That means you can easily inject HTML entities which could then be Chris@0: decoded before the Javascript is evaluated in a HTML document. This lets you Chris@0: break out of strings, add new JS statements, close tags, etc. In other words, Chris@0: using `json_encode()` is insufficient and naive. The same, arguably, could be Chris@0: said for `htmlspecialchars()` which has its own well known limitations that make Chris@0: a singular reliance on it a questionable practice.