In the NExT Live Web Observatory database, we have collected close to 300 million of social images after only 5 months of active crawling, and this number is increasing rapidly every day. Given this huge amount of social images and the need to access them via visual contents, there is an urgent need to develop techniques for large-scale image indexing and retrieval.
The key issues of image indexing lie in image representation and indexing methods. Currently we are working on a hashing-based system, in which we extract a spatial pyramid visual feature encoded in a dictionary learned by sparse coding. Spectral Hashing is then employed to model the images and a Hash Code extension step is incorporated to generate hash codes with variations of up to 2 Hamming distance in order to speed up the visual search during retrieval. Finally a re-ranking step is used to derive the final ranked search results based on detailed visual feature analysis. Fig. 5.1 presents the pipeline of the system. Currently, we are investigating other image representation and hashing approaches, as well as tuning of parameters as the size of image collection increases. We aim to derive guidelines for efficient large-scale image indexing and retrieval.
Figure 5.1: The flowchart for the large scale image indexing system.
In parallel with image indexing and search, we also study microblog, such as twitter and Sina Weibo, which plays an important role in people's daily life. Microblog data contains rich information, such as the text descriptions, images, tweets, user comments and user relations. How to explore these useful information sources from the huge microblog data is an important but challenging problem. Here we work on analyzing the Twitter/Weibo content, especially the visual content. Based on our preliminary study, only half of images in Twitter/Weibo have meaningful text descriptions. We thus first explore the concept annotation and geo location analysis of images in a microblog. This paves the way for research to discover and track hot or specific events in the real world. We also work on the generation of event summarization and visualization through microblog data analysis along a timeline.