Object
This abstract class contains a DSL for hand-coding tokenizers. Subclass it to implement tokenizers for specific grammars.
Tokenizers are state machines. Each state of a tokenizer is identified by a Ruby symbol. The constant Dhaka::TOKENIZER_IDLE_STATE is reserved for the idle state of the tokenizer (the one that it starts in).
The following is a tokenizer for arithmetic expressions with integer terms. The tokenizer starts in the idle state creating single-character tokens for all characters excepts digits and whitespace. It shifts to :get_integer_literal when it encounters a digit character and creates a token on the stack on which it accumulates the value of the literal. When it again encounters a non-digit character, it shifts back to idle. Whitespace is treated as a delimiter, but not shifted as a token.
class ArithmeticPrecedenceTokenizer < Dhaka::Tokenizer
digits = ('0'..'9').to_a
parenths = ['(', ')']
operators = ['-', '+', '/', '*', '^']
functions = ['h', 'l']
arg_separator = [',']
whitespace = [' ']
all_characters = digits + parenths + operators + functions + arg_separator + whitespace
for_state Dhaka::TOKENIZER_IDLE_STATE do
for_characters(all_characters - (digits + whitespace)) do
create_token(curr_char, nil)
advance
end
for_characters digits do
create_token('n', '')
switch_to :get_integer_literal
end
for_character whitespace do
advance
end
end
for_state :get_integer_literal do
for_characters all_characters - digits do
switch_to Dhaka::TOKENIZER_IDLE_STATE
end
for_characters digits do
curr_token.value << curr_char
advance
end
end
end
For languages where the lexical structure is very complicated, it may be too tedious to implement a Tokenizer by hand. In such cases, it's a lot easier to write a LexerSpecification using regular expressions and create a Lexer from that.
Define the action for the state named state_name.
# File lib/dhaka/tokenizer/tokenizer.rb, line 122 def for_state(state_name, &blk) states[state_name].instance_eval(&blk) end
Tokenizes a string input and returns a TokenizerErrorResult on failure or a TokenizerSuccessResult on sucess.
# File lib/dhaka/tokenizer/tokenizer.rb, line 127 def tokenize(input) new(input).run end
Advance to the next character.
# File lib/dhaka/tokenizer/tokenizer.rb, line 156 def advance @curr_char_index += 1 end
Push a new token on to the stack with symbol corresponding to symbol_name and a value of value.
# File lib/dhaka/tokenizer/tokenizer.rb, line 170 def create_token(symbol_name, value) new_token = Dhaka::Token.new(symbol_name, value, @curr_char_index) tokens << new_token end
The character currently being processed.
# File lib/dhaka/tokenizer/tokenizer.rb, line 151 def curr_char @input[@curr_char_index] and @input[@curr_char_index].chr end
The token currently on top of the stack.
# File lib/dhaka/tokenizer/tokenizer.rb, line 165 def curr_token tokens.last end
Generated with the Darkfish Rdoc Generator 2.