Active learning and crowd sourcing are becoming increasingly popular in the machine learning community for fast and cost effective generation of labels for large volumes of data. However, such labels may be noisy. So, it becomes important to ignore the noisy labels for building of a good classifier. We propose a framework for finding the best possible augmentation of a classifier for the character recognition problem using minimum number of crowd labeled samples. The approach inherently rejects the noisy data and tries to accept a subset of correctly labeled data to maximize the classifier performance.
The paper presents a novel script independent CRF based inferencing framework for character recognition. In this framework we consider a word as a sequence of connected components. The connected components are obtained using different binarization schemes and different possible sequences are considered using a tree structure. CRF uses contextual information to learn perfect primitive sequences and finds the most probable labeling of the sequence of primitives using multiple hypothesis tree to form the correct sequence of alphabets. This approach is particularly suitable for degraded printed document images as it considers multiple alternate hypotheses for correct decision.
The paper proposes a novel multi-modal document image retrieval framework by exploiting the information of text and graphics regions. The framework applies multiple kernel learning based hashing formulation for generation of composite document indexes using different modalities. The existing multimedia management methods for imaged text documents have not addressed the requirement of old and degraded documents. In the subsequent contribution, we propose novel multi-modal document indexing framework for retrieval of old and degraded text documents by combining OCR'ed text and image based representation using learning. The evaluation of proposed concepts is demonstrated on sampled magazine cover pages, and documents of Devanagari script.
We propose a new technique for impulse noise filtering that can remove the impulse noises from color as well as gray scale images. We operate on the HSI (Hue-Saturation-Intensity) color model. Our algorithm has three Phases. In first Phase, we take a window W of size N×N (say, 3×3) and form two groups: group of color and group of colorless pixels. We select the group that has the higher count of pixels in W. This allows us to remove the noise due to the colorless pixels from the color pixels and vice-versa. In the second Phase, if the selected group is a collection of colorless pixels then we find the median pixel based on increasing order of Intensity values and we call this as a candidate pixel.
In this paper we propose an approach to separate the non-texts from texts of a manuscript. The non-texts are mainly in the form of doodles and drawings of some exceptional thinkers and writers. These have enormous historical values due to study on those writers’ subconscious as well as productive mind. We also propose a computational approach to recover the struck-out texts to reduce human effort. The proposed technique has a preprocessing stage, which removes noise using median filter and segments object region using fuzzy c-means clustering. Now connected component analysis finds the major portions of non-texts, and window examination eliminates the partially attached texts. The struck-out texts are extracted by eliminating straight lines, measuring degree of continuity, using some morphological operations.