Parent

Dhaka::Tokenizer

This abstract class contains a DSL for hand-coding tokenizers. Subclass it to implement tokenizers for specific grammars.

Tokenizers are state machines. Each state of a tokenizer is identified by a Ruby symbol. The constant Dhaka::TOKENIZER_IDLE_STATE is reserved for the idle state of the tokenizer (the one that it starts in).

The following is a tokenizer for arithmetic expressions with integer terms. The tokenizer starts in the idle state creating single-character tokens for all characters excepts digits and whitespace. It shifts to :get_integer_literal when it encounters a digit character and creates a token on the stack on which it accumulates the value of the literal. When it again encounters a non-digit character, it shifts back to idle. Whitespace is treated as a delimiter, but not shifted as a token.

class ArithmeticPrecedenceTokenizer < Dhaka::Tokenizer

  digits = ('0'..'9').to_a
  parenths = ['(', ')']
  operators = ['-', '+', '/', '*', '^']
  functions = ['h', 'l']
  arg_separator = [',']
  whitespace = [' ']

  all_characters = digits + parenths + operators + functions + arg_separator + whitespace

  for_state Dhaka::TOKENIZER_IDLE_STATE do
    for_characters(all_characters - (digits + whitespace)) do
      create_token(curr_char, nil)
      advance
    end
    for_characters digits do
      create_token('n', '')
      switch_to :get_integer_literal
    end
    for_character whitespace do
      advance
    end
  end

  for_state :get_integer_literal do
    for_characters all_characters - digits do
      switch_to Dhaka::TOKENIZER_IDLE_STATE
    end
    for_characters digits do
      curr_token.value << curr_char
      advance
    end
  end

end

For languages where the lexical structure is very complicated, it may be too tedious to implement a Tokenizer by hand. In such cases, it's a lot easier to write a LexerSpecification using regular expressions and create a Lexer from that.

Attributes

tokens[R]

The tokens shifted so far.

Public Class Methods

for_state(state_name, &blk) click to toggle source

Define the action for the state named state_name.

# File lib/dhaka/tokenizer/tokenizer.rb, line 122
def for_state(state_name, &blk)
  states[state_name].instance_eval(&blk)
end
tokenize(input) click to toggle source

Tokenizes a string input and returns a TokenizerErrorResult on failure or a TokenizerSuccessResult on sucess.

# File lib/dhaka/tokenizer/tokenizer.rb, line 127
def tokenize(input)
  new(input).run
end

Public Instance Methods

advance() click to toggle source

Advance to the next character.

# File lib/dhaka/tokenizer/tokenizer.rb, line 156
def advance
  @curr_char_index += 1
end
create_token(symbol_name, value) click to toggle source

Push a new token on to the stack with symbol corresponding to symbol_name and a value of value.

# File lib/dhaka/tokenizer/tokenizer.rb, line 170
def create_token(symbol_name, value)
  new_token = Dhaka::Token.new(symbol_name, value, @curr_char_index)
  tokens << new_token
end
curr_char() click to toggle source

The character currently being processed.

# File lib/dhaka/tokenizer/tokenizer.rb, line 151
def curr_char
  @input[@curr_char_index] and @input[@curr_char_index].chr 
end
curr_token() click to toggle source

The token currently on top of the stack.

# File lib/dhaka/tokenizer/tokenizer.rb, line 165
def curr_token
  tokens.last
end
inspect() click to toggle source
# File lib/dhaka/tokenizer/tokenizer.rb, line 160
def inspect
  "<Dhaka::Tokenizer grammar : #{grammar}>"
end
switch_to(state_name) click to toggle source

Change the active state of the tokenizer to the state identified by the symbol state_name.

# File lib/dhaka/tokenizer/tokenizer.rb, line 176
def switch_to state_name
  @current_state = self.class.states[state_name]
end

[Validate]

Generated with the Darkfish Rdoc Generator 2.