This module takes free text and produces tokens with sentence boundaries marked.
A token may be any of the following: word, abbreviation, punctuation mark, real number, special symbol etc. No token has white space in it. Special symbols such as ‘|’, ‘.’ and two new lines are treated as end of sentence marker. Period is analyzed to decide whether it is an end of sentence marker or not. Abbreviations such as Mr. or Dr. is consider as a token. A list of acronyms is consulted when a period is found. Based on the list and some rules, it decides whether it is an abbreviation. By the end of processing, each sentence will contain all the tokens that make up the sentence