Chris@0
|
1 Lexer component documentation
|
Chris@0
|
2 =============================
|
Chris@0
|
3
|
Chris@0
|
4 The lexer is responsible for providing tokens to the parser. The project comes with two lexers: `PhpParser\Lexer` and
|
Chris@0
|
5 `PhpParser\Lexer\Emulative`. The latter is an extension of the former, which adds the ability to emulate tokens of
|
Chris@0
|
6 newer PHP versions and thus allows parsing of new code on older versions.
|
Chris@0
|
7
|
Chris@0
|
8 This documentation discusses options available for the default lexers and explains how lexers can be extended.
|
Chris@0
|
9
|
Chris@0
|
10 Lexer options
|
Chris@0
|
11 -------------
|
Chris@0
|
12
|
Chris@0
|
13 The two default lexers accept an `$options` array in the constructor. Currently only the `'usedAttributes'` option is
|
Chris@0
|
14 supported, which allows you to specify which attributes will be added to the AST nodes. The attributes can then be
|
Chris@0
|
15 accessed using `$node->getAttribute()`, `$node->setAttribute()`, `$node->hasAttribute()` and `$node->getAttributes()`
|
Chris@0
|
16 methods. A sample options array:
|
Chris@0
|
17
|
Chris@0
|
18 ```php
|
Chris@0
|
19 $lexer = new PhpParser\Lexer(array(
|
Chris@0
|
20 'usedAttributes' => array(
|
Chris@0
|
21 'comments', 'startLine', 'endLine'
|
Chris@0
|
22 )
|
Chris@0
|
23 ));
|
Chris@0
|
24 ```
|
Chris@0
|
25
|
Chris@0
|
26 The attributes used in this example match the default behavior of the lexer. The following attributes are supported:
|
Chris@0
|
27
|
Chris@0
|
28 * `comments`: Array of `PhpParser\Comment` or `PhpParser\Comment\Doc` instances, representing all comments that occurred
|
Chris@0
|
29 between the previous non-discarded token and the current one. Use of this attribute is required for the
|
Chris@13
|
30 `$node->getComments()` and `$node->getDocComment()` methods to work. The attribute is also needed if you wish the pretty
|
Chris@13
|
31 printer to retain comments present in the original code.
|
Chris@0
|
32 * `startLine`: Line in which the node starts. This attribute is required for the `$node->getLine()` to work. It is also
|
Chris@0
|
33 required if syntax errors should contain line number information.
|
Chris@13
|
34 * `endLine`: Line in which the node ends. Required for `$node->getEndLine()`.
|
Chris@13
|
35 * `startTokenPos`: Offset into the token array of the first token in the node. Required for `$node->getStartTokenPos()`.
|
Chris@13
|
36 * `endTokenPos`: Offset into the token array of the last token in the node. Required for `$node->getEndTokenPos()`.
|
Chris@13
|
37 * `startFilePos`: Offset into the code string of the first character that is part of the node. Required for `$node->getStartFilePos()`.
|
Chris@13
|
38 * `endFilePos`: Offset into the code string of the last character that is part of the node. Required for `$node->getEndFilePos()`.
|
Chris@0
|
39
|
Chris@0
|
40 ### Using token positions
|
Chris@0
|
41
|
Chris@13
|
42 > **Note:** The example in this section is outdated in that this information is directly available in the AST: While
|
Chris@13
|
43 > `$property->isPublic()` does not distinguish between `public` and `var`, directly checking `$property->flags` for
|
Chris@13
|
44 > the `$property->flags & Class_::VISIBILITY_MODIFIER_MASK) === 0` allows making this distinction without resorting to
|
Chris@13
|
45 > tokens. However the general idea behind the example still applies in other cases.
|
Chris@13
|
46
|
Chris@0
|
47 The token offset information is useful if you wish to examine the exact formatting used for a node. For example the AST
|
Chris@0
|
48 does not distinguish whether a property was declared using `public` or using `var`, but you can retrieve this
|
Chris@0
|
49 information based on the token position:
|
Chris@0
|
50
|
Chris@0
|
51 ```php
|
Chris@0
|
52 function isDeclaredUsingVar(array $tokens, PhpParser\Node\Stmt\Property $prop) {
|
Chris@0
|
53 $i = $prop->getAttribute('startTokenPos');
|
Chris@0
|
54 return $tokens[$i][0] === T_VAR;
|
Chris@0
|
55 }
|
Chris@0
|
56 ```
|
Chris@0
|
57
|
Chris@0
|
58 In order to make use of this function, you will have to provide the tokens from the lexer to your node visitor using
|
Chris@0
|
59 code similar to the following:
|
Chris@0
|
60
|
Chris@0
|
61 ```php
|
Chris@0
|
62 class MyNodeVisitor extends PhpParser\NodeVisitorAbstract {
|
Chris@0
|
63 private $tokens;
|
Chris@0
|
64 public function setTokens(array $tokens) {
|
Chris@0
|
65 $this->tokens = $tokens;
|
Chris@0
|
66 }
|
Chris@0
|
67
|
Chris@0
|
68 public function leaveNode(PhpParser\Node $node) {
|
Chris@0
|
69 if ($node instanceof PhpParser\Node\Stmt\Property) {
|
Chris@0
|
70 var_dump(isDeclaredUsingVar($this->tokens, $node));
|
Chris@0
|
71 }
|
Chris@0
|
72 }
|
Chris@0
|
73 }
|
Chris@0
|
74
|
Chris@0
|
75 $lexer = new PhpParser\Lexer(array(
|
Chris@0
|
76 'usedAttributes' => array(
|
Chris@0
|
77 'comments', 'startLine', 'endLine', 'startTokenPos', 'endTokenPos'
|
Chris@0
|
78 )
|
Chris@0
|
79 ));
|
Chris@13
|
80 $parser = (new PhpParser\ParserFactory)->create(PhpParser\ParserFactory::ONLY_PHP7, $lexer);
|
Chris@0
|
81
|
Chris@0
|
82 $visitor = new MyNodeVisitor();
|
Chris@0
|
83 $traverser = new PhpParser\NodeTraverser();
|
Chris@0
|
84 $traverser->addVisitor($visitor);
|
Chris@0
|
85
|
Chris@0
|
86 try {
|
Chris@0
|
87 $stmts = $parser->parse($code);
|
Chris@0
|
88 $visitor->setTokens($lexer->getTokens());
|
Chris@0
|
89 $stmts = $traverser->traverse($stmts);
|
Chris@0
|
90 } catch (PhpParser\Error $e) {
|
Chris@0
|
91 echo 'Parse Error: ', $e->getMessage();
|
Chris@0
|
92 }
|
Chris@0
|
93 ```
|
Chris@0
|
94
|
Chris@0
|
95 The same approach can also be used to perform specific modifications in the code, without changing the formatting in
|
Chris@0
|
96 other places (which is the case when using the pretty printer).
|
Chris@0
|
97
|
Chris@0
|
98 Lexer extension
|
Chris@0
|
99 ---------------
|
Chris@0
|
100
|
Chris@0
|
101 A lexer has to define the following public interface:
|
Chris@0
|
102
|
Chris@13
|
103 ```php
|
Chris@13
|
104 function startLexing(string $code, ErrorHandler $errorHandler = null): void;
|
Chris@13
|
105 function getTokens(): array;
|
Chris@13
|
106 function handleHaltCompiler(): string;
|
Chris@13
|
107 function getNextToken(string &$value = null, array &$startAttributes = null, array &$endAttributes = null): int;
|
Chris@13
|
108 ```
|
Chris@0
|
109
|
Chris@13
|
110 The `startLexing()` method is invoked whenever the `parse()` method of the parser is called and is passed the source
|
Chris@13
|
111 code that is to be lexed (including the opening tag). It can be used to reset state or preprocess the source code or tokens. The
|
Chris@13
|
112 passed `ErrorHandler` should be used to report lexing errors.
|
Chris@0
|
113
|
Chris@0
|
114 The `getTokens()` method returns the current token array, in the usual `token_get_all()` format. This method is not
|
Chris@0
|
115 used by the parser (which uses `getNextToken()`), but is useful in combination with the token position attributes.
|
Chris@0
|
116
|
Chris@0
|
117 The `handleHaltCompiler()` method is called whenever a `T_HALT_COMPILER` token is encountered. It has to return the
|
Chris@0
|
118 remaining string after the construct (not including `();`).
|
Chris@0
|
119
|
Chris@0
|
120 The `getNextToken()` method returns the ID of the next token (as defined by the `Parser::T_*` constants). If no more
|
Chris@0
|
121 tokens are available it must return `0`, which is the ID of the `EOF` token. Furthermore the string content of the
|
Chris@0
|
122 token should be written into the by-reference `$value` parameter (which will then be available as `$n` in the parser).
|
Chris@0
|
123
|
Chris@0
|
124 ### Attribute handling
|
Chris@0
|
125
|
Chris@0
|
126 The other two by-ref variables `$startAttributes` and `$endAttributes` define which attributes will eventually be
|
Chris@0
|
127 assigned to the generated nodes: The parser will take the `$startAttributes` from the first token which is part of the
|
Chris@0
|
128 node and the `$endAttributes` from the last token that is part of the node.
|
Chris@0
|
129
|
Chris@0
|
130 E.g. if the tokens `T_FUNCTION T_STRING ... '{' ... '}'` constitute a node, then the `$startAttributes` from the
|
Chris@0
|
131 `T_FUNCTION` token will be taken and the `$endAttributes` from the `'}'` token.
|
Chris@0
|
132
|
Chris@0
|
133 An application of custom attributes is storing the exact original formatting of literals: While the parser does retain
|
Chris@0
|
134 some information about the formatting of integers (like decimal vs. hexadecimal) or strings (like used quote type), it
|
Chris@0
|
135 does not preserve the exact original formatting (e.g. leading zeros for integers or escape sequences in strings). This
|
Chris@0
|
136 can be remedied by storing the original value in an attribute:
|
Chris@0
|
137
|
Chris@0
|
138 ```php
|
Chris@0
|
139 use PhpParser\Lexer;
|
Chris@0
|
140 use PhpParser\Parser\Tokens;
|
Chris@0
|
141
|
Chris@0
|
142 class KeepOriginalValueLexer extends Lexer // or Lexer\Emulative
|
Chris@0
|
143 {
|
Chris@0
|
144 public function getNextToken(&$value = null, &$startAttributes = null, &$endAttributes = null) {
|
Chris@0
|
145 $tokenId = parent::getNextToken($value, $startAttributes, $endAttributes);
|
Chris@0
|
146
|
Chris@0
|
147 if ($tokenId == Tokens::T_CONSTANT_ENCAPSED_STRING // non-interpolated string
|
Chris@0
|
148 || $tokenId == Tokens::T_ENCAPSED_AND_WHITESPACE // interpolated string
|
Chris@0
|
149 || $tokenId == Tokens::T_LNUMBER // integer
|
Chris@0
|
150 || $tokenId == Tokens::T_DNUMBER // floating point number
|
Chris@0
|
151 ) {
|
Chris@0
|
152 // could also use $startAttributes, doesn't really matter here
|
Chris@0
|
153 $endAttributes['originalValue'] = $value;
|
Chris@0
|
154 }
|
Chris@0
|
155
|
Chris@0
|
156 return $tokenId;
|
Chris@0
|
157 }
|
Chris@0
|
158 }
|
Chris@0
|
159 ```
|