Tokenizer converts a given input text into a sequence of tokens (consisting of words, punctuation marks, and other symbols) with a sentence marker after end of each sentence of input text. The output is produced in SSF format. The input to Tokenizer can also be a CML text.The sentence marker will depend on the type of character representation used for the text input data. It could be either of these (full stop, sign of exclamation, sign of interrogation, and PURNA VIRAM depending upon the character encoding).
Tokenizer will be configured for handling special tokens for each language, like, for Hindi language, the tokens like डॉ., प्रो. will not be split.
Roman symbols inside the text will be left as it is and shall be preceded by a "@" sign in the output.