Huge amount of manual efforts are required to annotate large image/video archives with text annotations. Several recent works attempted to automate this task by employing supervised learning approaches to associate visual information extracted in segmented images with semantic concepts provided by associated text. The main limitation of such approaches, however, is that large labeled training corpus is still needed for effective learning, and semantically meaningful segmentation for images is in general unavailable. This paper explores the use of bootstrapping approach to tackle this problem. The idea is to start from a small set of labeled training examples, and successively annotate a larger set of unlabeled examples. This is done using the cotraining approach, in which two "statistically independent" classifiers are used to co-train and co-annotate the unlabeled examples. An active learning approach is used to select the best examples to label at each stage of learning in order to maximize the learning objective. To accomplish this, we break the task of annotating images into the sub-tasks of: (a) segmenting images into meaningful units, (b) extracting appropriate features for the units, and (c) associating these features with text. Because of the uncertainty in sub-tasks (a) and (b), we adopt two independent segmentation methods (task a) and two independent sets of features (task b) to support co-training. We carried out experiments using a mid-sized image collection (comprising about 6,000 images from CorelCD, PhotoCD and Web) and demonstrated that our bootstrapping approach significantly improve the performance of annotation by about 10% in terms of F1 measure as compared to the best results obtained from the traditional supervised learning approach. Moreover, the bootstrapping approach has the key advantage of requiring much fewer labeled examples in training.