•    Freeware
  •    Shareware
  •    Research
  •    Localization Tools 20
  •    Publications 707
  •    Validators 2
  •    Mobile Apps 22
  •    Fonts 31
  •    Guidelines/ Draft Standards 3
  •    Documents 13
  •    General Tools 38
  •    NLP Tools 105
  •    Linguistic Resources 255
Tokenizer converts a given input text into a sequence of tokens (consisting of words, punctuation marks, and other symbols) with a sentence marker after end of each sentence of input text. The output is produced in SSF format. The input to Tokenizer can also be a CML text.The sentence marker will depend on the type of character representation used for the text input data. It could be either of these (full stop, sign of exclamation, sign of interrogation, and PURNA VIRAM depending upon the character encoding).
Tokenizer will be configured for handling special tokens for each language, like, for Hindi language, the tokens like डॉ., प्रो. will not be split.
Roman symbols inside the text will be left as it is and shall be preceded by a "@" sign in the output.

Added on January 12, 2015


  More Details
  • Contributed by : ILMT Consortium,IIIT Hyd
  • Product Type : Linguistic Resources
  • License Type : Research
  • System Requirement : Linux
Similar / Suggested Resources