A Real-World Web Image Database from National University of Singapore

Quick Guide to NUS-WIDE

Here we introduce a web image dataset created by Lab for Media Search in National University of Singapore. The dataset includes: (1) 269,648 images and the associated tags from Flickr, with a total number of 5,018 unique tags; (2) six types of low-level features extracted from these images, including 64-D color histogram, 144-D color correlogram, 73-D edge direction histogram, 128-D wavelet texture, 225-D block-wise color moments and 500-D bag of words based on SIFT descriptions; and (3) ground-truth for 81 concepts that can be used for evaluation. Based on this dataset, we identify several research issues on web image annotation and retrieval. We also provide the baseline results for web image annotation by learning from the tags using the traditional k-NN algorithm. The benchmark results show that it is possible to learn models from these data to help general image retrieval.

NUS-WIDE Citation:

Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao Zheng. "NUS-WIDE: A Real-World Web Image Database from National University of Singapore", ACM International Conference on Image and Video Retrieval. Greece. Jul. 8-10, 2009.  [pdf]  [Bibitex entry]

For any questions regarding NUS-WIDE dataset, pls contact, Dr. Yue Gao (, Mr. Xiangyu Chen ( or Dr. Jinhui Tang (

To facilitate your research, we have uploaded the URLs of all the images except those that have been removed or are inaccessible now. [ URL ]

The exif and geo-info for the images from a fraction of the dataset are available now. [ URL ]


 For more detailed descriptions of the dataset, see specification.


        1.   Low-Level Features

       [Link1]       [Link2]        (1.1 GB file)

2.   Tags

       [Link1]       [Link2]       (25 MB file)

3.   Groundtruth

       [Link1]       [Link2]       (1.1 MB file)

4.   Image List

       [Link1]       [Link2]       (4.6 MB file)

5.   Concept List

       [Link1]       [Link2]       (448 KB file)

We provide a light version of NUS-WIDE dataset, named NUS-WIDE-LITE along with one object image dataset, named NUS-WIDE-OBJECT and one scene image dataset, named NUS-WIDE-SCENE. The image dataset can be obtained via sending a request email to us. Specifically, the researchers interested in the dataset should download and fill up the Agreement and Disclaimer Form and send it back to us. We will then email you the instructions to download the dataset at our discretion.


        1.   NUS-WIDE-LITE

       [Link1]       [Link2]        (110 MB file)  (specification)


       [Link1]       [Link2]        (62 MB file)  (specification)


       [Link1]       [Link2]        (70 MB file)  (specification) 


Digital images have become more easily accessible following the rapid advances in digital photography, networking and storage technologies. Some photo sharing websites, such as Flickr and Picasa, are popular in daily life. For example, there are more than 2,000 images being uploaded to Flickr every minute. During peak times, up to 12,000 photos are being served per second, and the record for the number of photos uploaded per day exceeds 2 million photos. When users share their photos, they will typically give several tags to describe the contents of these images. Out of these archives, several questions naturally arise for multimedia research. For example, what can we do with millions of images and their related tags? How can general image indexing and search benefit from the community shared images and tags?

In fact, how to improve the performance of existing image annotation and retrieval approaches by using machine learning and other artificial intelligent technologies has attracted much attention in multimedia research community. However, for learning based methods to be effective, a large number of balanced labeled samples is required, which typically comes from users during an interactive manual process. This is very time-consuming and labor-intensive. In order to reduce this manual effort, many semi-supervised learning or active learning approaches have been proposed. Nevertheless, there is still a need to manually annotate many images to train the learning models. On the other hand, the image sharing sites offer us great opportunity to "freely" acquire a large number of images with annotated tags. The tags for the images are collectively annotated by a large group of heterogeneous users. It is believed that although most tags
are correct, there are many noisy and missing tags. Thus if we can learn the accurate models from these user-shared images together with their associated noisy tags, then much manual effort in image annotation can be eliminated. In this case, content-based image annotation and retrieval can benefit much from the community contributed images and tags.

In this paper, we present four research issues on mining the community contributed images and tags for image annotation and retrieval. The issues are: (1) How to utilize the community contributed images and tags to annotate non-tagged images. (2) How to leverage the models learned from these images and associated tags to improve the retrieval of web images with tags or surrounding text. (3) How to achieve tag completion which means the removal of the noise in the tag set and the enrichment of missing tags. (4) How to construct effective training set for each concept and the overall concept network from the available information sources. To these ends, we construct a benchmark dataset to focus research efforts on these issues. The dataset includes a set of images crawled from Flickr, together with the associated tags of these images, as well as the semantic ground-truth of 81 concepts for these images. We also extract six low-level visual features, including 64-D color histogram in LAB color space, 144-D color correlogram in HSV color space, 73-D edge distribution histogram, 128-D wavelet texture, 225-D block-wise LAB-based color moments, which are extracted over 5
*5 fixed grid partitions, and 500-D bag of visual words. For the image annotation task, we also provide a baseline based on the k-NN method. The set of low-level features for images, their associated tags, ground-truth, and baseline results can be downloaded at

To our knowledge, this is the largest real-world web image dataset comprising over 269,000 images with over 5,000 user-provided tags, and ground-truth of 81 concepts for the entire dataset. The dataset is much larger than the popularly available Corel and Caltech 101 datasets. Though some datasets comprise over 3 million images, they only have ground-truth for a small fraction of images. Our proposed NUS-WIDE dataset has the ground-truth for the entire dataset.


1.       Y. Liu, D. Xu, I.W. Tsang, and J. Luo. Using large-scale web data to facilitate textual query based retrieval of consumer photos. In Proceedings of the Seventeen ACM international Conference on Multimedia, 2009.

2.       J. Tang, S. Yan, R. Hong, G.-J. Qi, and T.-S. Chua. Inferring semantic concepts from community-contributed images and noisy tags. In Proceedings of the Seventeen ACM international Conference on Multimedia, 2009.

3.       G.-J. Qi, X.-S. Hua, and H.-J. Zhang. Learning semantic distance from community-tagged media collection. In Proceedings of the Seventeen ACM international Conference on Multimedia, 2009.

4.       X. Liu, B. Cheng, S. Yan, J. Tang, T.-S. Chua, and H. Jin. Label to region by bi-layer sparsity priors. In Proceedings of the Seventeen ACM international Conference on Multimedia, 2009.

5.       F. Benevenuto, T. Rodrigues, V. Almeida, J. Almeida, and K. Ross. Video interactions in online video social networks. ACM Trans. Multimedia Comput. Commun. Appl. 5(4), Oct. 2009.

6.       X. Zhang, Y.-C. Song, J. Cao, Y.-D. Zhang, and J.-T. Li. Large scale incremental web video categorization. In Proceedings of the 1st Workshop on Web-Scale Multimedia Corpus, ACM, 2009.

7.       A. Sun and S.S. Bhowmick. Image tag clarity: in search of visual-representative tags for social images. In Proceedings of the First SIGMM Workshop on Social Media, ACM, 2009.

8.       M. Wang, K. Yang, X.-S. Hua, and H.-J. Zhang. Visual tag dictionary: interpreting tags with visual words. In Proceedings of the 1st Workshop on Web-Scale Multimedia Corpus, ACM, 2009.

9.       Y. Song, Y. Zhang, X. Zhang, J. Cao, and J.-T. Li. Google challenge: incremental-learning for web video categorization on robust semantic feature space. In Proceedings of the Seventeen ACM international Conference on Multimedia, 2009.

10.    T.-S. Chua, S. Tang, R. Trichet, H.-K. Tan, and Y. Song. MovieBase: a movie database for event detection and behavioral analysis. In Proceedings of the 1st Workshop on Web-Scale Multimedia Corpus, ACM, 2009.

11.    S. Gao, L.-T. Chia, and X. Cheng. Understanding tag-cloud and visual features for better annotation of concepts in NUS-WIDE dataset. In Proceedings of the 1st Workshop on Web-Scale Multimedia Corpus, ACM, 2009.

12.    J. Cao, Y.-D. Zhang, Y.-C. Song, Z.-N. Chen, X. Zhang, and J.-T. Li. MCG-WEBV: A Benchmark Dataset for Web Video Analysis. Technical report, ICT, CAS, 2009.

13.    S. Si, D. Tao, K.-P. Chan. Transfer Discriminative Logmaps. Pacific-Rim Conference on Multimedia, 2009.

14.    R. Hong, J. Tang, Z.-J. Zha, Z. Luo, T.-S. Chua. Mediapedia: Mining Web Knowledge to Construct Multimedia Encyclopedia. The 16th International Multimedia Modeling Conference, 2010.

15.    Z. Luo, H. Li, J. Tang, R. Hong, T.-S. Chua. Estimating Poses of World's Photos with Geographic Metadata. The 16th International Multimedia Modeling Conference, 2010.


For problems or questions regarding this web site contact The Web Master.
Last updated: Jan, 28 2009