Lexical tokenizationis conversion of a text into (semantically or syntactically) meaningfullexical tokensbelonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols anddata types.Lexical tokenization is related to the type of tokenization used inlarge language models(LLMs) but with two differences. First, lexical tokenization is usually based on alexical grammar,whereas LLM tokenizers are usuallyprobability-based. Second, LLM tokenizers perform a second step that converts the tokens into numerical values.

Rule-based programs

edit

A rule-based program, performing lexical tokenization, is calledtokenizer,[1]orscanner,althoughscanneris also a term for the first stage of a lexer. A lexer forms the first phase of acompiler frontendin processing. Analysis generally occurs in one pass. Lexers and parsers are most often used for compilers, but can be used for other computer language tools, such asprettyprintersorlinters.Lexing can be divided into two stages: thescanning,which segments the input string into syntactic units calledlexemesand categorizes these into token classes, and theevaluating,which converts lexemes into processed values.

Lexers are generally quite simple, with most of the complexity deferred to thesyntactic analysisorsemantic analysisphases, and can often be generated by alexer generator,notablylexor derivatives. However, lexers can sometimes include some complexity, such asphrase structureprocessing to make input easier and simplify the parser, and may be written partly or fully by hand, either to support more features or for performance.

Disambiguation of "lexeme"

edit

What is called "lexeme" in rule-basednatural language processingis not equal to what is calledlexemein linguistics. What is called "lexeme" in rule-based natural language processing can be equal to the linguistic equivalent only inanalytic languages,such as English, but not in highlysynthetic languages,such asfusional languages.What is called a lexeme in rule-based natural language processing is more similar to what is called awordin linguistics (not to be confused with aword in computer architecture), although in some cases it may be more similar to amorpheme.

Lexical token and lexical tokenization

edit

Alexical tokenis astringwith an assigned and thus identified meaning, in contrast to the probabilistic token used inlarge language models.A lexical token consists of atoken nameand an optionaltoken value.The token name is a category of a rule-based lexical unit.[2]

Examples of common tokens
Token name

(Lexical category)

Explanation Sample token values
identifier Names assigned by the programmer. x,color,UP
keyword Reserved words of the language. if,while,return
separator/punctuator Punctuation characters and paired delimiters. },(,;
operator Symbols that operate on arguments and produce results. +,<,=
literal Numeric, logical, textual, and reference literals. true,6.02e23,"music"
comment Line or block comments. Usually discarded. /* Retrieves user data */,// must be negative
whitespace Groups of non-printable characters. Usually discarded.

Consider this expression in theCprogramming language:

x=a+b*2;

The lexical analysis of this expression yields the following sequence of tokens:

[(identifier, x), (operator, =), (identifier, a), (operator, +), (identifier, b), (operator, *), (literal, 2), (separator,;)]

A token name is what might be termed apart of speechin linguistics.

Lexical tokenizationis the conversion of a raw text into (semantically or syntactically) meaningful lexical tokens, belonging to categories defined by a "lexer" program, such as identifiers, operators, grouping symbols, and data types. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task ofparsinginput.

For example, in the textstring:

The quick brown fox jumps over the lazy dog

the string is not implicitly segmented on spaces, as anatural languagespeaker would do. The raw input, the 43 characters, must be explicitly split into the 9 tokens with a given space delimiter (i.e., matching the string""orregular expression/\s{1}/).

When a token class represents more than one possible lexeme, the lexer often saves enough information to reproduce the original lexeme, so that it can be used insemantic analysis.The parser typically retrieves this information from the lexer and stores it in theabstract syntax tree.This is necessary in order to avoid information loss in the case where numbers may also be valid identifiers.

Tokens are identified based on the specific rules of the lexer. Some methods used to identify tokens includeregular expressions,specific sequences of characters termed aflag,specific separating characters calleddelimiters,and explicit definition by a dictionary. Special characters, including punctuation characters, are commonly used by lexers to identify tokens because of their natural use in written and programming languages. A lexical analyzer generally does nothing with combinations of tokens, a task left for aparser.For example, a typical lexical analyzer recognizes parentheses as tokens but does nothing to ensure that each "(" is matched with a ")".

When a lexer feeds tokens to the parser, the representation used is typically anenumerated type,which is a list of number representations. For example, "Identifier" can be represented with 0, "Assignment operator" with 1, "Addition operator" with 2, etc.

Tokens are often defined byregular expressions,which are understood by a lexical analyzer generator such aslex,or handcoded equivalentfinite state automata.The lexical analyzer (generated automatically by a tool like lex or hand-crafted) reads in a stream of characters, identifies thelexemesin the stream, and categorizes them into tokens. This is termedtokenizing.If the lexer finds an invalid token, it will report an error.

Following tokenizing isparsing.From there, the interpreted data may be loaded into data structures for general use, interpretation, orcompiling.

Lexical grammar

edit

The specification of aprogramming languageoften includes a set of rules, thelexical grammar,which defines the lexical syntax. The lexical syntax is usually aregular language,with the grammar rules consisting ofregular expressions;they define the set of possible character sequences (lexemes) of a token. A lexer recognizes strings, and for each kind of string found, the lexical program takes an action, most simply producing a token.

Two important common lexical categories arewhite spaceandcomments.These are also defined in the grammar and processed by the lexer but may be discarded (not producing any tokens) and considerednon-significant,at most separating two tokens (as inif xinstead ofifx). There are two important exceptions to this. First, inoff-side rulelanguages that delimitblockswith indenting, initial whitespace is significant, as it determines block structure, and is generally handled at the lexer level; seephrase structure,below. Secondly, in some uses of lexers, comments and whitespace must be preserved – for examples, aprettyprinteralso needs to output the comments and some debugging tools may provide messages to the programmer showing the original source code. In the 1960s, notably forALGOL,whitespace and comments were eliminated as part of theline reconstructionphase (the initial phase of thecompiler frontend), but this separate phase has been eliminated and these are now handled by the lexer.

Details

edit

Scanner

edit

The first stage, thescanner,is usually based on afinite-state machine(FSM). It has encoded within it information on the possible sequences of characters that can be contained within any of the tokens it handles (individual instances of these character sequences are termedlexemes). For example, anintegerlexeme may contain any sequence ofnumerical digitcharacters. In many cases, the first non-whitespace character can be used to deduce the kind of token that follows and subsequent input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token (this is termed themaximal munch,orlongest match,rule). In some languages, the lexeme creation rules are more complex and may involvebacktrackingover previously read characters. For example, in C, one 'L' character is not enough to distinguish between an identifier that begins with 'L' and a wide-character string literal.

Evaluator

edit

Alexeme,however, is only a string of characters known to be of a certain kind (e.g., a string literal, a sequence of letters). In order to construct a token, the lexical analyzer needs a second stage, theevaluator,which goes over the characters of the lexeme to produce avalue.The lexeme's type combined with its value is what properly constitutes a token, which can be given to a parser. Some tokens such as parentheses do not really have values, and so the evaluator function for these can return nothing: Only the type is needed. Similarly, sometimes evaluators can suppress a lexeme entirely, concealing it from the parser, which is useful for whitespace and comments. The evaluators for identifiers are usually simple (literally representing the identifier), but may include someunstropping.The evaluators forinteger literalsmay pass the string on (deferring evaluation to the semantic analysis phase), or may perform evaluation themselves, which can be involved for different bases or floating point numbers. For a simple quoted string literal, the evaluator needs to remove only the quotes, but the evaluator for anescaped string literalincorporates a lexer, which unescapes the escape sequences.

For example, in the source code of a computer program, the string

net_worth_future=(assetsliabilities);

might be converted into the following lexical token stream; whitespace is suppressed and special characters have no value:

IDENTIFIER net_worth_future
EQUALS
OPEN_PARENTHESIS
IDENTIFIER assets
MINUS
IDENTIFIER liabilities
CLOSE_PARENTHESIS
SEMICOLON

Lexers may be written by hand. This is practical if the list of tokens is small, but lexers generated by automated tooling as part of acompiler-compilertoolchainare more practical for a larger number of potential tokens. These tools generally accept regular expressions that describe the tokens allowed in the input stream. Each regular expression is associated with aproduction rulein the lexical grammar of the programming language that evaluates the lexemes matching the regular expression. These tools may generate source code that can be compiled and executed or construct astate transition tablefor afinite-state machine(which is plugged into template code for compiling and executing).

Regular expressions compactly represent patterns that the characters in lexemes might follow. For example, for anEnglish-based language, an IDENTIFIER token might be any English alphabetic character or an underscore, followed by any number of instances of ASCII alphanumeric characters and/or underscores. This could be represented compactly by the string[a-zA-Z_][a-zA-Z_0-9]*.This means "any character a-z, A-Z or _, followed by 0 or more of a-z, A-Z, _ or 0-9".

Regular expressions and the finite-state machines they generate are not powerful enough to handle recursive patterns, such as "nopening parentheses, followed by a statement, followed bynclosing parentheses. "They are unable to keep count, and verify thatnis the same on both sides, unless a finite set of permissible values exists forn.It takes a full parser to recognize such patterns in their full generality. A parser can push parentheses on a stack and then try to pop them off and see if the stack is empty at the end (see example[3]in theStructure and Interpretation of Computer Programsbook).

Obstacles

edit

Typically, lexical tokenization occurs at the word level. However, it is sometimes difficult to define what is meant by a "word". Often, a tokenizer relies on simple heuristics, for example:

  • Punctuation and whitespace may or may not be included in the resulting list of tokens.
  • All contiguous strings of alphabetic characters are part of one token; likewise with numbers.
  • Tokens are separated bywhitespacecharacters, such as a space or line break, or by punctuation characters.

In languages that use inter-word spaces (such as most that use the Latin alphabet, and most programming languages), this approach is fairly straightforward. However, even here there are many edge cases such ascontractions,hyphenatedwords,emoticons,and larger constructs such asURIs(which for some purposes may count as single tokens). A classic example is "New York-based", which a naive tokenizer may break at the space even though the better break is (arguably) at the hyphen.

Tokenization is particularly difficult for languages written inscriptio continua,which exhibit no word boundaries, such asAncient Greek,Chinese,[4]orThai.Agglutinative languages,such as Korean, also make tokenization tasks complicated.

Some ways to address the more difficult problems include developing more complex heuristics, querying a table of common special cases, or fitting the tokens to alanguage modelthat identifies collocations in a later processing step.

Lexer generator

edit

Lexers are often generated by alexer generator,analogous toparser generators,and such tools often come together. The most established islex,paired with theyaccparser generator, or rather some of their many reimplementations, likeflex(often paired withGNU Bison). These generators are a form ofdomain-specific language,taking in a lexical specification – generally regular expressions with some markup – and emitting a lexer.

These tools yield very fast development, which is very important in early development, both to get a working lexer and because a language specification may change often. Further, they often provide advanced features, such as pre- and post-conditions which are hard to program by hand. However, an automatically generated lexer may lack flexibility, and thus may require some manual modification, or an all-manually written lexer.

Lexer performance is a concern, and optimizing is worthwhile, more so in stable languages where the lexer is run very often (such as C or HTML). lex/flex-generated lexers are reasonably fast, but improvements of two to three times are possible using more tuned generators. Hand-written lexers are sometimes used, but modern lexer generators produce faster lexers than most hand-coded ones. The lex/flex family of generators uses a table-driven approach which is much less efficient than the directly coded approach.[dubiousdiscuss]With the latter approach the generator produces an engine that directly jumps to follow-up states via goto statements. Tools likere2c[5]have proven to produce engines that are between two and three times faster than flex produced engines.[citation needed]It is in general difficult to hand-write analyzers that perform better than engines generated by these latter tools.

Phrase structure

edit

Lexical analysis mainly segments the input stream of characters into tokens, simply grouping the characters into pieces and categorizing them. However, the lexing may be significantly more complex; most simply, lexers may omit tokens or insert added tokens. Omitting tokens, notably whitespace and comments, is very common when these are not needed by the compiler. Less commonly, added tokens may be inserted. This is done mainly to group tokens intostatements,or statements into blocks, to simplify the parser.

Line continuation

edit

Line continuationis a feature of some languages where a newline is normally a statement terminator. Most often, ending a line with a backslash (immediately followed by anewline) results in the line beingcontinued– the following line isjoinedto the prior line. This is generally done in the lexer: The backslash and newline are discarded, rather than the newline being tokenized. Examples includebash,[6]other shell scripts and Python.[7]

Semicolon insertion

edit

Many languages use the semicolon as a statement terminator. Most often this is mandatory, but in some languages the semicolon is optional in many contexts. This is mainly done at the lexer level, where the lexer outputs a semicolon into the token stream, despite one not being present in the input character stream, and is termedsemicolon insertionorautomatic semicolon insertion.In these cases, semicolons are part of the formal phrase grammar of the language, but may not be found in input text, as they can be inserted by the lexer. Optional semicolons or other terminators or separators are also sometimes handled at the parser level, notably in the case oftrailing commasor semicolons.

Semicolon insertion is a feature ofBCPLand its distant descendantGo,[8]though it is absent in B or C.[9]Semicolon insertion is present inJavaScript,though the rules are somewhat complex and much-criticized; to avoid bugs, some recommend always using semicolons, while others use initial semicolons, termeddefensive semicolons,at the start of potentially ambiguous statements.

Semicolon insertion (in languages with semicolon-terminated statements) and line continuation (in languages with newline-terminated statements) can be seen as complementary: Semicolon insertion adds a token even though newlines generally donotgenerate tokens, while line continuation prevents a token from being generated even though newlines generallydogenerate tokens.

Off-side rule

edit

Theoff-side rule(blocks determined by indenting) can be implemented in the lexer, as inPython,where increasing the indenting results in the lexer emitting an INDENT token and decreasing the indenting results in the lexer emitting one or more DEDENT tokens.[10]These tokens correspond to the opening brace{and closing brace}in languages that use braces for blocks and means that the phrase grammar does not depend on whether braces or indenting are used. This requires that the lexer hold state, namely a stack of indent levels, and thus can detect changes in indenting when this changes, and thus the lexical grammar is notcontext-free:INDENT–DEDENT depend on the contextual information of prior indent levels.

Context-sensitive lexing

edit

Generally lexical grammars are context-free, or almost so, and thus require no looking back or ahead, or backtracking, which allows a simple, clean, and efficient implementation. This also allows simple one-way communication from lexer to parser, without needing any information flowing back to the lexer.

There are exceptions, however. Simple examples include semicolon insertion in Go, which requires looking back one token; concatenation of consecutive string literals in Python,[7]which requires holding one token in a buffer before emitting it (to see if the next token is another string literal); and the off-side rule in Python, which requires maintaining a count of indent level (indeed, a stack of each indent level). These examples all only require lexical context, and while they complicate a lexer somewhat, they are invisible to the parser and later phases.

A more complex example isthe lexer hackin C, where the token class of a sequence of characters cannot be determined until the semantic analysis phase sincetypedefnames and variable names are lexically identical but constitute different token classes. Thus in the hack, the lexer calls the semantic analyzer (say, symbol table) and checks if the sequence requires a typedef name. In this case, information must flow back not from the parser only, but from the semantic analyzer back to the lexer, which complicates design.

See also

edit

References

edit
  1. ^"Anatomy of a Compiler and The Tokenizer".www.cs.man.ac.uk.
  2. ^page 111, "Compilers Principles, Techniques, & Tools, 2nd Ed." (WorldCat) by Aho, Lam, Sethi and Ullman, as quoted inhttps://stackoverflow.com/questions/14954721/what-is-the-difference-between-token-and-lexeme
  3. ^"Structure and Interpretation of Computer Programs".mitpress.mit.edu.Archived fromthe originalon 2012-10-30.Retrieved2009-03-07.
  4. ^Huang, C., Simon, P., Hsieh, S., & Prevot, L. (2007)Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Word break Identification
  5. ^Bumbulis, P.; Cowan, D. D. (Mar–Dec 1993)."RE2C: A more versatile scanner generator".ACM Letters on Programming Languages and Systems.2(1–4): 70–84.doi:10.1145/176454.176487.S2CID14814637.
  6. ^Bash Reference Manual,3.1.2.1 Escape Character
  7. ^ab"3.6.4 Documentation".docs.python.org.
  8. ^Effective Go,"Semicolons"
  9. ^"Semicolons in Go",golang-nuts, Rob 'Commander' Pike, 12/10/09
  10. ^"Lexical analysis > Indentation".The Python Language Reference.Retrieved21 June2023.

Sources

edit
  • Compiling with C# and Java,Pat Terry, 2005,ISBN032126360X
  • Algorithms + Data Structures = Programs,Niklaus Wirth, 1975,ISBN0-13-022418-9
  • Compiler Construction,Niklaus Wirth, 1996,ISBN0-201-40353-6
  • Sebesta, R. W. (2006). Concepts of programming languages (Seventh edition) pp. 177. Boston: Pearson/Addison-Wesley.
edit