README.md (7388B)
1Overview [![Build Status](https://travis-ci.org/lydell/js-tokens.svg?branch=master)](https://travis-ci.org/lydell/js-tokens) 2======== 3 4A regex that tokenizes JavaScript. 5 6```js 7var jsTokens = require("js-tokens").default 8 9var jsString = "var foo=opts.foo;\n..." 10 11jsString.match(jsTokens) 12// ["var", " ", "foo", "=", "opts", ".", "foo", ";", "\n", ...] 13``` 14 15 16Installation 17============ 18 19`npm install js-tokens` 20 21```js 22import jsTokens from "js-tokens" 23// or: 24var jsTokens = require("js-tokens").default 25``` 26 27 28Usage 29===== 30 31### `jsTokens` ### 32 33A regex with the `g` flag that matches JavaScript tokens. 34 35The regex _always_ matches, even invalid JavaScript and the empty string. 36 37The next match is always directly after the previous. 38 39### `var token = matchToToken(match)` ### 40 41```js 42import {matchToToken} from "js-tokens" 43// or: 44var matchToToken = require("js-tokens").matchToToken 45``` 46 47Takes a `match` returned by `jsTokens.exec(string)`, and returns a `{type: 48String, value: String}` object. The following types are available: 49 50- string 51- comment 52- regex 53- number 54- name 55- punctuator 56- whitespace 57- invalid 58 59Multi-line comments and strings also have a `closed` property indicating if the 60token was closed or not (see below). 61 62Comments and strings both come in several flavors. To distinguish them, check if 63the token starts with `//`, `/*`, `'`, `"` or `` ` ``. 64 65Names are ECMAScript IdentifierNames, that is, including both identifiers and 66keywords. You may use [is-keyword-js] to tell them apart. 67 68Whitespace includes both line terminators and other whitespace. 69 70[is-keyword-js]: https://github.com/crissdev/is-keyword-js 71 72 73ECMAScript support 74================== 75 76The intention is to always support the latest ECMAScript version whose feature 77set has been finalized. 78 79If adding support for a newer version requires changes, a new version with a 80major verion bump will be released. 81 82Currently, ECMAScript 2018 is supported. 83 84 85Invalid code handling 86===================== 87 88Unterminated strings are still matched as strings. JavaScript strings cannot 89contain (unescaped) newlines, so unterminated strings simply end at the end of 90the line. Unterminated template strings can contain unescaped newlines, though, 91so they go on to the end of input. 92 93Unterminated multi-line comments are also still matched as comments. They 94simply go on to the end of the input. 95 96Unterminated regex literals are likely matched as division and whatever is 97inside the regex. 98 99Invalid ASCII characters have their own capturing group. 100 101Invalid non-ASCII characters are treated as names, to simplify the matching of 102names (except unicode spaces which are treated as whitespace). Note: See also 103the [ES2018](#es2018) section. 104 105Regex literals may contain invalid regex syntax. They are still matched as 106regex literals. They may also contain repeated regex flags, to keep the regex 107simple. 108 109Strings may contain invalid escape sequences. 110 111 112Limitations 113=========== 114 115Tokenizing JavaScript using regexes—in fact, _one single regex_—won’t be 116perfect. But that’s not the point either. 117 118You may compare jsTokens with [esprima] by using `esprima-compare.js`. 119See `npm run esprima-compare`! 120 121[esprima]: http://esprima.org/ 122 123### Template string interpolation ### 124 125Template strings are matched as single tokens, from the starting `` ` `` to the 126ending `` ` ``, including interpolations (whose tokens are not matched 127individually). 128 129Matching template string interpolations requires recursive balancing of `{` and 130`}`—something that JavaScript regexes cannot do. Only one level of nesting is 131supported. 132 133### Division and regex literals collision ### 134 135Consider this example: 136 137```js 138var g = 9.82 139var number = bar / 2/g 140 141var regex = / 2/g 142``` 143 144A human can easily understand that in the `number` line we’re dealing with 145division, and in the `regex` line we’re dealing with a regex literal. How come? 146Because humans can look at the whole code to put the `/` characters in context. 147A JavaScript regex cannot. It only sees forwards. (Well, ES2018 regexes can also 148look backwards. See the [ES2018](#es2018) section). 149 150When the `jsTokens` regex scans throught the above, it will see the following 151at the end of both the `number` and `regex` rows: 152 153```js 154/ 2/g 155``` 156 157It is then impossible to know if that is a regex literal, or part of an 158expression dealing with division. 159 160Here is a similar case: 161 162```js 163foo /= 2/g 164foo(/= 2/g) 165``` 166 167The first line divides the `foo` variable with `2/g`. The second line calls the 168`foo` function with the regex literal `/= 2/g`. Again, since `jsTokens` only 169sees forwards, it cannot tell the two cases apart. 170 171There are some cases where we _can_ tell division and regex literals apart, 172though. 173 174First off, we have the simple cases where there’s only one slash in the line: 175 176```js 177var foo = 2/g 178foo /= 2 179``` 180 181Regex literals cannot contain newlines, so the above cases are correctly 182identified as division. Things are only problematic when there are more than 183one non-comment slash in a single line. 184 185Secondly, not every character is a valid regex flag. 186 187```js 188var number = bar / 2/e 189``` 190 191The above example is also correctly identified as division, because `e` is not a 192valid regex flag. I initially wanted to future-proof by allowing `[a-zA-Z]*` 193(any letter) as flags, but it is not worth it since it increases the amount of 194ambigous cases. So only the standard `g`, `m`, `i`, `y` and `u` flags are 195allowed. This means that the above example will be identified as division as 196long as you don’t rename the `e` variable to some permutation of `gmiyus` 1 to 6 197characters long. 198 199Lastly, we can look _forward_ for information. 200 201- If the token following what looks like a regex literal is not valid after a 202 regex literal, but is valid in a division expression, then the regex literal 203 is treated as division instead. For example, a flagless regex cannot be 204 followed by a string, number or name, but all of those three can be the 205 denominator of a division. 206- Generally, if what looks like a regex literal is followed by an operator, the 207 regex literal is treated as division instead. This is because regexes are 208 seldomly used with operators (such as `+`, `*`, `&&` and `==`), but division 209 could likely be part of such an expression. 210 211Please consult the regex source and the test cases for precise information on 212when regex or division is matched (should you need to know). In short, you 213could sum it up as: 214 215If the end of a statement looks like a regex literal (even if it isn’t), it 216will be treated as one. Otherwise it should work as expected (if you write sane 217code). 218 219### ES2018 ### 220 221ES2018 added some nice regex improvements to the language. 222 223- [Unicode property escapes] should allow telling names and invalid non-ASCII 224 characters apart without blowing up the regex size. 225- [Lookbehind assertions] should allow matching telling division and regex 226 literals apart in more cases. 227- [Named capture groups] might simplify some things. 228 229These things would be nice to do, but are not critical. They probably have to 230wait until the oldest maintained Node.js LTS release supports those features. 231 232[Unicode property escapes]: http://2ality.com/2017/07/regexp-unicode-property-escapes.html 233[Lookbehind assertions]: http://2ality.com/2017/05/regexp-lookbehind-assertions.html 234[Named capture groups]: http://2ality.com/2017/05/regexp-named-capture-groups.html 235 236 237License 238======= 239 240[MIT](LICENSE).