Chris@0: Lexer component documentation Chris@0: ============================= Chris@0: Chris@0: The lexer is responsible for providing tokens to the parser. The project comes with two lexers: `PhpParser\Lexer` and Chris@0: `PhpParser\Lexer\Emulative`. The latter is an extension of the former, which adds the ability to emulate tokens of Chris@0: newer PHP versions and thus allows parsing of new code on older versions. Chris@0: Chris@0: This documentation discusses options available for the default lexers and explains how lexers can be extended. Chris@0: Chris@0: Lexer options Chris@0: ------------- Chris@0: Chris@0: The two default lexers accept an `$options` array in the constructor. Currently only the `'usedAttributes'` option is Chris@0: supported, which allows you to specify which attributes will be added to the AST nodes. The attributes can then be Chris@0: accessed using `$node->getAttribute()`, `$node->setAttribute()`, `$node->hasAttribute()` and `$node->getAttributes()` Chris@0: methods. A sample options array: Chris@0: Chris@0: ```php Chris@0: $lexer = new PhpParser\Lexer(array( Chris@0: 'usedAttributes' => array( Chris@0: 'comments', 'startLine', 'endLine' Chris@0: ) Chris@0: )); Chris@0: ``` Chris@0: Chris@0: The attributes used in this example match the default behavior of the lexer. The following attributes are supported: Chris@0: Chris@0: * `comments`: Array of `PhpParser\Comment` or `PhpParser\Comment\Doc` instances, representing all comments that occurred Chris@0: between the previous non-discarded token and the current one. Use of this attribute is required for the Chris@13: `$node->getComments()` and `$node->getDocComment()` methods to work. The attribute is also needed if you wish the pretty Chris@13: printer to retain comments present in the original code. Chris@0: * `startLine`: Line in which the node starts. This attribute is required for the `$node->getLine()` to work. It is also Chris@0: required if syntax errors should contain line number information. Chris@13: * `endLine`: Line in which the node ends. Required for `$node->getEndLine()`. Chris@13: * `startTokenPos`: Offset into the token array of the first token in the node. Required for `$node->getStartTokenPos()`. Chris@13: * `endTokenPos`: Offset into the token array of the last token in the node. Required for `$node->getEndTokenPos()`. Chris@13: * `startFilePos`: Offset into the code string of the first character that is part of the node. Required for `$node->getStartFilePos()`. Chris@13: * `endFilePos`: Offset into the code string of the last character that is part of the node. Required for `$node->getEndFilePos()`. Chris@0: Chris@0: ### Using token positions Chris@0: Chris@13: > **Note:** The example in this section is outdated in that this information is directly available in the AST: While Chris@13: > `$property->isPublic()` does not distinguish between `public` and `var`, directly checking `$property->flags` for Chris@13: > the `$property->flags & Class_::VISIBILITY_MODIFIER_MASK) === 0` allows making this distinction without resorting to Chris@13: > tokens. However the general idea behind the example still applies in other cases. Chris@13: Chris@0: The token offset information is useful if you wish to examine the exact formatting used for a node. For example the AST Chris@0: does not distinguish whether a property was declared using `public` or using `var`, but you can retrieve this Chris@0: information based on the token position: Chris@0: Chris@0: ```php Chris@0: function isDeclaredUsingVar(array $tokens, PhpParser\Node\Stmt\Property $prop) { Chris@0: $i = $prop->getAttribute('startTokenPos'); Chris@0: return $tokens[$i][0] === T_VAR; Chris@0: } Chris@0: ``` Chris@0: Chris@0: In order to make use of this function, you will have to provide the tokens from the lexer to your node visitor using Chris@0: code similar to the following: Chris@0: Chris@0: ```php Chris@0: class MyNodeVisitor extends PhpParser\NodeVisitorAbstract { Chris@0: private $tokens; Chris@0: public function setTokens(array $tokens) { Chris@0: $this->tokens = $tokens; Chris@0: } Chris@0: Chris@0: public function leaveNode(PhpParser\Node $node) { Chris@0: if ($node instanceof PhpParser\Node\Stmt\Property) { Chris@0: var_dump(isDeclaredUsingVar($this->tokens, $node)); Chris@0: } Chris@0: } Chris@0: } Chris@0: Chris@0: $lexer = new PhpParser\Lexer(array( Chris@0: 'usedAttributes' => array( Chris@0: 'comments', 'startLine', 'endLine', 'startTokenPos', 'endTokenPos' Chris@0: ) Chris@0: )); Chris@13: $parser = (new PhpParser\ParserFactory)->create(PhpParser\ParserFactory::ONLY_PHP7, $lexer); Chris@0: Chris@0: $visitor = new MyNodeVisitor(); Chris@0: $traverser = new PhpParser\NodeTraverser(); Chris@0: $traverser->addVisitor($visitor); Chris@0: Chris@0: try { Chris@0: $stmts = $parser->parse($code); Chris@0: $visitor->setTokens($lexer->getTokens()); Chris@0: $stmts = $traverser->traverse($stmts); Chris@0: } catch (PhpParser\Error $e) { Chris@0: echo 'Parse Error: ', $e->getMessage(); Chris@0: } Chris@0: ``` Chris@0: Chris@0: The same approach can also be used to perform specific modifications in the code, without changing the formatting in Chris@0: other places (which is the case when using the pretty printer). Chris@0: Chris@0: Lexer extension Chris@0: --------------- Chris@0: Chris@0: A lexer has to define the following public interface: Chris@0: Chris@13: ```php Chris@13: function startLexing(string $code, ErrorHandler $errorHandler = null): void; Chris@13: function getTokens(): array; Chris@13: function handleHaltCompiler(): string; Chris@13: function getNextToken(string &$value = null, array &$startAttributes = null, array &$endAttributes = null): int; Chris@13: ``` Chris@0: Chris@13: The `startLexing()` method is invoked whenever the `parse()` method of the parser is called and is passed the source Chris@13: code that is to be lexed (including the opening tag). It can be used to reset state or preprocess the source code or tokens. The Chris@13: passed `ErrorHandler` should be used to report lexing errors. Chris@0: Chris@0: The `getTokens()` method returns the current token array, in the usual `token_get_all()` format. This method is not Chris@0: used by the parser (which uses `getNextToken()`), but is useful in combination with the token position attributes. Chris@0: Chris@0: The `handleHaltCompiler()` method is called whenever a `T_HALT_COMPILER` token is encountered. It has to return the Chris@0: remaining string after the construct (not including `();`). Chris@0: Chris@0: The `getNextToken()` method returns the ID of the next token (as defined by the `Parser::T_*` constants). If no more Chris@0: tokens are available it must return `0`, which is the ID of the `EOF` token. Furthermore the string content of the Chris@0: token should be written into the by-reference `$value` parameter (which will then be available as `$n` in the parser). Chris@0: Chris@0: ### Attribute handling Chris@0: Chris@0: The other two by-ref variables `$startAttributes` and `$endAttributes` define which attributes will eventually be Chris@0: assigned to the generated nodes: The parser will take the `$startAttributes` from the first token which is part of the Chris@0: node and the `$endAttributes` from the last token that is part of the node. Chris@0: Chris@0: E.g. if the tokens `T_FUNCTION T_STRING ... '{' ... '}'` constitute a node, then the `$startAttributes` from the Chris@0: `T_FUNCTION` token will be taken and the `$endAttributes` from the `'}'` token. Chris@0: Chris@0: An application of custom attributes is storing the exact original formatting of literals: While the parser does retain Chris@0: some information about the formatting of integers (like decimal vs. hexadecimal) or strings (like used quote type), it Chris@0: does not preserve the exact original formatting (e.g. leading zeros for integers or escape sequences in strings). This Chris@0: can be remedied by storing the original value in an attribute: Chris@0: Chris@0: ```php Chris@0: use PhpParser\Lexer; Chris@0: use PhpParser\Parser\Tokens; Chris@0: Chris@0: class KeepOriginalValueLexer extends Lexer // or Lexer\Emulative Chris@0: { Chris@0: public function getNextToken(&$value = null, &$startAttributes = null, &$endAttributes = null) { Chris@0: $tokenId = parent::getNextToken($value, $startAttributes, $endAttributes); Chris@0: Chris@0: if ($tokenId == Tokens::T_CONSTANT_ENCAPSED_STRING // non-interpolated string Chris@0: || $tokenId == Tokens::T_ENCAPSED_AND_WHITESPACE // interpolated string Chris@0: || $tokenId == Tokens::T_LNUMBER // integer Chris@0: || $tokenId == Tokens::T_DNUMBER // floating point number Chris@0: ) { Chris@0: // could also use $startAttributes, doesn't really matter here Chris@0: $endAttributes['originalValue'] = $value; Chris@0: } Chris@0: Chris@0: return $tokenId; Chris@0: } Chris@0: } Chris@0: ```