Clustering of Near Duplicate Images in the Web Search

Good Essays
The overall objective is to cluster the near-duplicate images. Initially, the user passes the query to the search engine and the search engine results in set of query related images. These images contain duplicate as well as near-duplicate images. The main aim of this paper is to detect near-duplicate images and cluster those images. This is achieved through the following steps – Image Preprocessing, Feature Extraction and Clustering. In image processing, the initial step is preprocessing. Image preprocessing is nothing but noise removal and image enhancement. Then feature extraction includes the extraction of key points and key points matching. These matched key points are allowed for estimation of affine transform based on an affine invariant ratio of normalized lengths. At last, Clustering is performed which includes Supervised and Unsupervised Clustering. This results in cluster of images. Each of these clusters will have one image as a representative of that cluster and other images in the cluster is called its near-duplicates. At last, performance measure is calculated for the evaluation of algorithm accuracy. Figure 1 shows the block diagram of the proposed system. It is seen that the final output will be many clusters; each consisting of near-duplicates relating to the representative cluster. Fig. 1. Block Diagram of the Proposed System 3.1 Image Preprocessing Pre-processing methods use a small neighborhood of a pixel in an input image to get a new brightness value in the output image; also called filtration. Local pre-processing methods can be divided into the two groups according to the goal of the processing: Smoothing suppresses noise or other small fluctuations in the image; equivalent to the suppression of high... ... middle of paper ... ...o cut. The brief idea is clustering is done around half data through Hierarchical clustering and succeed by K-means for the remaining. In order to create super-rules, Hierarchical is terminated when it generates the largest number of clusters. Algorithm – 1. Finish a complete agglomerative Hierarchical clustering on the data and record number of clusters generated during the process. 2. Run the agglomerative Hierarchical clustering again and stop the process when largest number of clusters is generated. 3. Execute the k-means clustering on the remaining data which are not processed in the step 2 and use the centroids for every cluster in step 2 and are served as initial centroids in the k-means clustering algorithm. After the clustering process is over, set of clusters will be found. Each cluster represents a set of near-duplicates with one representative image.
Get Access