Current dataset version: v1.0 (released in October 2021)

Click here to download v1.0 annotations (ZIP file)

This ZIP file contains JSON files containing image post metadata of RedCaps instances, along with CC-BY 4.0 license and Reddit API terms of use. The expected file structure after extraction is as follows:

- annotations/
  ├── abandoned_2017.json
  ├── abandoned_2018.json
  ├── ...
  ├── itookapicture_2019.json
  ├── itookapicture_2020.json
  ├── ...
  ├── <subreddit>_<year>.json
  └── ...

Note: We do not distribute image files as we do not legally own them. The annotation files contain image URLs – we recommend downloading images using the redcaps-downloader tool.

Note for dataset users

Terms of use: Uses of RedCaps are subject to Reddit API terms. Users must comply with Reddit User Agreeement, Content Policy, and Privacy Policy.

Usage Restrictions: RedCaps should only be used for non-commercial research. RedCaps should not be used for any tasks that involve identifying features related to people (facial recognition, gender, age, ethnicity identification, etc.) or make decisions that impact people (mortgages, job applications, criminal sentences; or moderation decisions about user-uploaded data that could result in bans from a website). Any commercial and for-profit uses of RedCaps are restricted – it should not be used to train models that will be deployed in production systems as part of a product offered by businesses or government agencies

Refer to the datasheet in the paper more details.

Image removal request

Did you find any problematic image in RedCaps that should be removed? Report it to us using this link! We will review all requests and remove problematic image in the next version release.

Image removal request form

Annotation format

Each JSON file contains metadata of image posts from a single subreddit, submitted by Reddit users in one or more years. For example, itookapicture_2018.json contains image posts from r/itookapicture submitted between Jan 1 - Dec 31, 2018. Annotation files named *_2017.json contain posts between (2008 - 2017). Each JSON file follows this schema:

  "info": {
    "start_date": "YYYY-MM-DD",  // start date of image posts in this file
    "end_date": "YYYY-MM-DD",    // end date of image posts in this file
    "url": "",
    "version": 1.0,
  "annotations": [
      // Example (metadata is not real)
      "image_id": "ab3d5f",
      "author": "johndoe",
      "url": "",
      "raw_caption": "ITAP of my cat [4000x4000]",
      "caption": "itap of my cat",  // after applying text filtering to 'raw_caption'
      "subreddit": "itookapicture",
      "score": 15,  // upvotes - downvotes
      "created_utc": 160256789,  // UTC epoch when the post was submitted
      "permalink": "/r/itookapicture/itap_of_my_cat",
      // ...
Maintained by: Karan Desai