Using regex
and terminal
attributes
The #[teleparse(regex(...))]
and #[teleparse(terminal(...))]
attributes
are used to define the pattern to match for the token type and terminals in the
syntax tree.
The simpliest example is to define a single terminal struct,
along with a regex
for the lexer to produce the token type when the remaining
source code matches the regex.
use teleparse::prelude::*; #[derive_lexicon] #[teleparse(terminal_parse)] pub enum MyToken { #[teleparse(regex(r"\w+"), terminal(Ident))] Ident, } fn main() { assert_eq!(Ident::parse("hell0"), Ok(Some(Ident::from_span(0..5)))); }
You can also add additional terminals that have to match a specific literal value.
use teleparse::prelude::*; #[derive_lexicon] #[teleparse(terminal_parse)] pub enum MyToken { #[teleparse(regex(r"\w+"), terminal(Ident, KwClass = "class"))] Word, } fn main() { let source = "class"; // can be parsed as Ident and KwClass assert_eq!(Ident::parse(source), Ok(Some(Ident::from_span(0..5)))); assert_eq!(KwClass::parse(source), Ok(Some(KwClass::from_span(0..5)))); // other words can only be parsed as Ident let source = "javascript"; assert_eq!(Ident::parse(source), Ok(Some(Ident::from_span(0..10)))); assert_eq!(KwClass::parse(source), Ok(None)); }
Note that there's no "conflict" here! The regex
is for the lexer,
and the literals are for the parser. When seeing "class" in the source,
the lexer will produce a Word
token with the content "class"
.
It is up to the parsing context if a Ident
or KwClass
is expected.
When such literals are present specified for the terminals
along with the regex
for the variant, derive_lexicon
will do some checks at compile-time to make sure the literals
make sense.
For each literal, the regex
must:
- has a match in the literal that starts at the beginning (position 0)
- the match must not be a proper prefix of the literal
For the first condition, suppose the regex is board
and the literal is keyboard
.
The lexer will never be able to emit keyboard
when the rest of the input
starts with board
.
use teleparse::prelude::*; #[derive_lexicon] pub enum MyToken { #[teleparse(regex("board"), terminal(Board, Keyboard = "keyboard"))] DoesNotMatchTerminal, } fn main() {}
error: This regex does not match the beginning of `keyboard`. This is likely a mistake, because the terminal will never be matched
--> tests/ui/lex_regex_not_match_start.rs:5:23
|
5 | #[teleparse(regex("board"), terminal(Board, Keyboard = "keyboard"))]
| ^^^^^^^
For the second condition, suppose the regex is key
and the literal is keyboard
.
The lexer will again never be able to emit keyboard
:
- If it were to emit
keyboard
of this token type, the rest of the input must start withkeyboard
- However, if so, the lexer would emit
key
instead
use teleparse::prelude::*; #[derive_lexicon] pub enum MyToken { #[teleparse(regex("key"), terminal(Key, Keyboard = "keyboard"))] DoesNotMatchTerminal, } fn main() {}
error: This regex matches a proper prefix of `keyboard`. This is likely a mistake, because the terminal will never be matched (the prefix will instead)
--> tests/ui/lex_regex_not_match_is_prefix.rs:4:23
|
4 | #[teleparse(regex("key"), terminal(Key, Keyboard = "keyboard"))]
| ^^^^^