cscg24-guacamole

CSCG 2024 Challenge 'Guacamole Mashup'
git clone https://git.sinitax.com/sinitax/cscg24-guacamole
Log | Files | Refs | sfeed.txt

README.md (7388B)


      1Overview [![Build Status](https://travis-ci.org/lydell/js-tokens.svg?branch=master)](https://travis-ci.org/lydell/js-tokens)
      2========
      3
      4A regex that tokenizes JavaScript.
      5
      6```js
      7var jsTokens = require("js-tokens").default
      8
      9var jsString = "var foo=opts.foo;\n..."
     10
     11jsString.match(jsTokens)
     12// ["var", " ", "foo", "=", "opts", ".", "foo", ";", "\n", ...]
     13```
     14
     15
     16Installation
     17============
     18
     19`npm install js-tokens`
     20
     21```js
     22import jsTokens from "js-tokens"
     23// or:
     24var jsTokens = require("js-tokens").default
     25```
     26
     27
     28Usage
     29=====
     30
     31### `jsTokens` ###
     32
     33A regex with the `g` flag that matches JavaScript tokens.
     34
     35The regex _always_ matches, even invalid JavaScript and the empty string.
     36
     37The next match is always directly after the previous.
     38
     39### `var token = matchToToken(match)` ###
     40
     41```js
     42import {matchToToken} from "js-tokens"
     43// or:
     44var matchToToken = require("js-tokens").matchToToken
     45```
     46
     47Takes a `match` returned by `jsTokens.exec(string)`, and returns a `{type:
     48String, value: String}` object. The following types are available:
     49
     50- string
     51- comment
     52- regex
     53- number
     54- name
     55- punctuator
     56- whitespace
     57- invalid
     58
     59Multi-line comments and strings also have a `closed` property indicating if the
     60token was closed or not (see below).
     61
     62Comments and strings both come in several flavors. To distinguish them, check if
     63the token starts with `//`, `/*`, `'`, `"` or `` ` ``.
     64
     65Names are ECMAScript IdentifierNames, that is, including both identifiers and
     66keywords. You may use [is-keyword-js] to tell them apart.
     67
     68Whitespace includes both line terminators and other whitespace.
     69
     70[is-keyword-js]: https://github.com/crissdev/is-keyword-js
     71
     72
     73ECMAScript support
     74==================
     75
     76The intention is to always support the latest ECMAScript version whose feature
     77set has been finalized.
     78
     79If adding support for a newer version requires changes, a new version with a
     80major verion bump will be released.
     81
     82Currently, ECMAScript 2018 is supported.
     83
     84
     85Invalid code handling
     86=====================
     87
     88Unterminated strings are still matched as strings. JavaScript strings cannot
     89contain (unescaped) newlines, so unterminated strings simply end at the end of
     90the line. Unterminated template strings can contain unescaped newlines, though,
     91so they go on to the end of input.
     92
     93Unterminated multi-line comments are also still matched as comments. They
     94simply go on to the end of the input.
     95
     96Unterminated regex literals are likely matched as division and whatever is
     97inside the regex.
     98
     99Invalid ASCII characters have their own capturing group.
    100
    101Invalid non-ASCII characters are treated as names, to simplify the matching of
    102names (except unicode spaces which are treated as whitespace). Note: See also
    103the [ES2018](#es2018) section.
    104
    105Regex literals may contain invalid regex syntax. They are still matched as
    106regex literals. They may also contain repeated regex flags, to keep the regex
    107simple.
    108
    109Strings may contain invalid escape sequences.
    110
    111
    112Limitations
    113===========
    114
    115Tokenizing JavaScript using regexes—in fact, _one single regex_—won’t be
    116perfect. But that’s not the point either.
    117
    118You may compare jsTokens with [esprima] by using `esprima-compare.js`.
    119See `npm run esprima-compare`!
    120
    121[esprima]: http://esprima.org/
    122
    123### Template string interpolation ###
    124
    125Template strings are matched as single tokens, from the starting `` ` `` to the
    126ending `` ` ``, including interpolations (whose tokens are not matched
    127individually).
    128
    129Matching template string interpolations requires recursive balancing of `{` and
    130`}`—something that JavaScript regexes cannot do. Only one level of nesting is
    131supported.
    132
    133### Division and regex literals collision ###
    134
    135Consider this example:
    136
    137```js
    138var g = 9.82
    139var number = bar / 2/g
    140
    141var regex = / 2/g
    142```
    143
    144A human can easily understand that in the `number` line we’re dealing with
    145division, and in the `regex` line we’re dealing with a regex literal. How come?
    146Because humans can look at the whole code to put the `/` characters in context.
    147A JavaScript regex cannot. It only sees forwards. (Well, ES2018 regexes can also
    148look backwards. See the [ES2018](#es2018) section).
    149
    150When the `jsTokens` regex scans throught the above, it will see the following
    151at the end of both the `number` and `regex` rows:
    152
    153```js
    154/ 2/g
    155```
    156
    157It is then impossible to know if that is a regex literal, or part of an
    158expression dealing with division.
    159
    160Here is a similar case:
    161
    162```js
    163foo /= 2/g
    164foo(/= 2/g)
    165```
    166
    167The first line divides the `foo` variable with `2/g`. The second line calls the
    168`foo` function with the regex literal `/= 2/g`. Again, since `jsTokens` only
    169sees forwards, it cannot tell the two cases apart.
    170
    171There are some cases where we _can_ tell division and regex literals apart,
    172though.
    173
    174First off, we have the simple cases where there’s only one slash in the line:
    175
    176```js
    177var foo = 2/g
    178foo /= 2
    179```
    180
    181Regex literals cannot contain newlines, so the above cases are correctly
    182identified as division. Things are only problematic when there are more than
    183one non-comment slash in a single line.
    184
    185Secondly, not every character is a valid regex flag.
    186
    187```js
    188var number = bar / 2/e
    189```
    190
    191The above example is also correctly identified as division, because `e` is not a
    192valid regex flag. I initially wanted to future-proof by allowing `[a-zA-Z]*`
    193(any letter) as flags, but it is not worth it since it increases the amount of
    194ambigous cases. So only the standard `g`, `m`, `i`, `y` and `u` flags are
    195allowed. This means that the above example will be identified as division as
    196long as you don’t rename the `e` variable to some permutation of `gmiyus` 1 to 6
    197characters long.
    198
    199Lastly, we can look _forward_ for information.
    200
    201- If the token following what looks like a regex literal is not valid after a
    202  regex literal, but is valid in a division expression, then the regex literal
    203  is treated as division instead. For example, a flagless regex cannot be
    204  followed by a string, number or name, but all of those three can be the
    205  denominator of a division.
    206- Generally, if what looks like a regex literal is followed by an operator, the
    207  regex literal is treated as division instead. This is because regexes are
    208  seldomly used with operators (such as `+`, `*`, `&&` and `==`), but division
    209  could likely be part of such an expression.
    210
    211Please consult the regex source and the test cases for precise information on
    212when regex or division is matched (should you need to know). In short, you
    213could sum it up as:
    214
    215If the end of a statement looks like a regex literal (even if it isn’t), it
    216will be treated as one. Otherwise it should work as expected (if you write sane
    217code).
    218
    219### ES2018 ###
    220
    221ES2018 added some nice regex improvements to the language.
    222
    223- [Unicode property escapes] should allow telling names and invalid non-ASCII
    224  characters apart without blowing up the regex size.
    225- [Lookbehind assertions] should allow matching telling division and regex
    226  literals apart in more cases.
    227- [Named capture groups] might simplify some things.
    228
    229These things would be nice to do, but are not critical. They probably have to
    230wait until the oldest maintained Node.js LTS release supports those features.
    231
    232[Unicode property escapes]: http://2ality.com/2017/07/regexp-unicode-property-escapes.html
    233[Lookbehind assertions]: http://2ality.com/2017/05/regexp-lookbehind-assertions.html
    234[Named capture groups]: http://2ality.com/2017/05/regexp-named-capture-groups.html
    235
    236
    237License
    238=======
    239
    240[MIT](LICENSE).