COCOHub is a crowdsourcing platform through which volunteers can support students and researchers interested in natural language processing, by translating sentences from the MS-COCO dataset (hence the "COCO" in the name).

COCOHub is a crowdsourcing website developed by Brian Muhia. It enables people to contribute to future African NLP research in image captioning and machine translation. COCOHub is there to help researchers and developers by providing aligned sentences in more than 40 languages, and captions in the same basic format as the MS-COCO 2015 Image Captioning Task.

How COCOHub differs from existing efforts:

  • This is a 100% open-data project, aimed at supporting students and researchers
  • Unlike, say, the Google Translate Community or OBtranslate, we aim to expose all verified language pairs as downloadable datasets, split into training and test sets, to be used by students and researchers of machine translation and image captioning. All aligned sentence pairs are available, in any language, so you can mix and match any language pair you want, to get a brand new dataset.
  • This is designed as a platform that a student/researcher can use to take responsibility for building a dataset, since all sentences have unique identifiers.
  • We are inspired by CrowdCrafting.org
  • This way one can easily work on more substantial parts of your project such as language modeling and building tokenizers..

Represent your community

In our first project series, you can go to Projects and search for a language you'd like to contribute to. When you find it, join in and translate a few sentences. If you don't find the language you're looking for, contact us at info.

While this project is maintained by a few people, we believe in the power that communities have over their common problems, and lack of access to aligned sentences is a big one for people who want to do research in machine translation. The problem is even bigger in the field of image captioning. Join us and represent your community by creating a dataset that will be used by the next few generations of researchers in NLP!


COCOHub is built on top of PyBossa , an open-source Python framework for the creation of crowdsourcing projects. PyBossa is written in Python and uses the Flask micro-framework. Each project on the platform is written in a combination of HTML, CSS and Javascript. The COCOHub theme and any plugins developed to provide additional functionality are open-souce and available on GitHub.

HTML5 web badge
CSS3 web badge
JavaScript web badge

Contact information