Chapter 1: INTRODUCTION 1. Overview . Text Segmentation A text is not just a set of words, but it has some coherent structure. The meaning of each word cannot be determined until it is placed in the structure of the text. Recognizing the structure of text is an essential task in the process of text segmentation. One of the constituents of the text structure is a text segment. A text segment, whether or not it is explicitly marked, as are sentences and paragraphs, is defined as sequence of clauses or sentences that display local coherence. It resembles a scene in a movie, which describes the same objects in the same situation. It is an important process in digital video library. The segmentation task can be conducted on video, speech & text.
A lot of research has been done on English literatures where basically C99 (presented by choi) and Topic Tiling are extensively used to segment text in various literature documents but this has done on very few images of documents. Combinations of these two programmes introduce Text Segmentation algorithm called Topic Tiling which is based on Text Tiling but conceptually simpler where in this it is smallest basic unit considered is a sentence. A coherence score cp can calculated between two adjacent sentences for which assigned topic IDs are used to the words by inference here assuming the LDA model with T topics that each block is represented as a T-dimensional vector. Therefore the element of each vector contains the frequency of the topic ID example‘t’ obtained from the respective block. The coherence score is calculated by cosine similarity for each adjacent that is topic vector therefore the values close to zero indicate marginal relatedness between two adjacent blocks whereas values close to one denote a substantial

