Using regex and terminal attributes

The #[teleparse(regex(...))] and #[teleparse(terminal(...))] attributes are used to define the pattern to match for the token type and terminals in the syntax tree.

The simpliest example is to define a single terminal struct, along with a regex for the lexer to produce the token type when the remaining source code matches the regex.

use teleparse::prelude::*;

#[derive_lexicon]
#[teleparse(terminal_parse)]
pub enum MyToken {
    #[teleparse(regex(r"\w+"), terminal(Ident))]
    Ident,
}

fn main() {
    assert_eq!(Ident::parse("hell0"), Ok(Some(Ident::from_span(0..5))));
}

You can also add additional terminals that have to match a specific literal value.

use teleparse::prelude::*;

#[derive_lexicon]
#[teleparse(terminal_parse)]
pub enum MyToken {
    #[teleparse(regex(r"\w+"), terminal(Ident, KwClass = "class"))]
    Word,
}

fn main() {
    let source = "class";
    // can be parsed as Ident and KwClass
    assert_eq!(Ident::parse(source), Ok(Some(Ident::from_span(0..5))));
    assert_eq!(KwClass::parse(source), Ok(Some(KwClass::from_span(0..5))));
    // other words can only be parsed as Ident
    let source = "javascript";
    assert_eq!(Ident::parse(source), Ok(Some(Ident::from_span(0..10))));
    assert_eq!(KwClass::parse(source), Ok(None));
}

Note that there's no "conflict" here! The regex is for the lexer, and the literals are for the parser. When seeing "class" in the source, the lexer will produce a Word token with the content "class". It is up to the parsing context if a Ident or KwClass is expected.

When such literals are present specified for the terminals along with the regex for the variant, derive_lexicon will do some checks at compile-time to make sure the literals make sense.

For each literal, the regex must:

  • has a match in the literal that starts at the beginning (position 0)
  • the match must not be a proper prefix of the literal

For the first condition, suppose the regex is board and the literal is keyboard. The lexer will never be able to emit keyboard when the rest of the input starts with board.

use teleparse::prelude::*;

#[derive_lexicon]
pub enum MyToken {
    #[teleparse(regex("board"), terminal(Board, Keyboard = "keyboard"))]
    DoesNotMatchTerminal, 
}

fn main() {}
error: This regex does not match the beginning of `keyboard`. This is likely a mistake, because the terminal will never be matched
 --> tests/ui/lex_regex_not_match_start.rs:5:23
  |
5 |     #[teleparse(regex("board"), terminal(Board, Keyboard = "keyboard"))]
  |                       ^^^^^^^

For the second condition, suppose the regex is key and the literal is keyboard. The lexer will again never be able to emit keyboard:

  • If it were to emit keyboard of this token type, the rest of the input must start with keyboard
  • However, if so, the lexer would emit key instead
use teleparse::prelude::*;
#[derive_lexicon]
pub enum MyToken {
    #[teleparse(regex("key"), terminal(Key, Keyboard = "keyboard"))]
    DoesNotMatchTerminal, 
}

fn main() {}
error: This regex matches a proper prefix of `keyboard`. This is likely a mistake, because the terminal will never be matched (the prefix will instead)
 --> tests/ui/lex_regex_not_match_is_prefix.rs:4:23
  |
4 |     #[teleparse(regex("key"), terminal(Key, Keyboard = "keyboard"))]
  |                       ^^^^^