Building Datasets, Part 3

By David Nadeau

In the first of these posts, we covered the (now) conventional wisdom that having a bigger dataset is better for training machine learning algorithms. The second of the series detailed a few rules of thumb for creating quality datasets. This time around, we’ll look at how to start building datasets.

 This slide from @NathanBenaich summarizes the available options quite well:

https://docs.google.com/presentation/d/11Ki7j7oTxI8Y9pZxyrxk3m3dTdjz4t1ir4QRCWnY7-g/edit#slide=id.g1d25704031_0_210

Adding a crowdsourcing tab to the above image, and a list of those types of companies, such as Crowdflower, Spare5/MightyAI and Mechanical Turk, would really provide a comprehensive picture of the process and options available.

Let’s narrow down all the potential data acquisition strategies available and highlight some of the well- and lesser-known public datasets.

Without any specific order, here are some large initiatives:

  • The Common Crawl is a an open repository of web crawl data that can be accessed and analyzed by anyone.

  • UCI Machine Learning Datasets is the most widly used source of datasets for ML algorithm benchmarking.

  • DBPedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web.

  • Kaggle, the Machine Learning competition platform, hosts a large inventory of more than 400 public datasets.

  • Dataverse connects a community that shares close to 50,000 datasets, mostly for reproductible science.

  • Data.gov and Open Data in Canada are two of the many open governmental data initiatives.

  • Non-gov entities are also riding on the Open Data opportunity. NYC-based Enigma Public, for instance, claim maintaining the world’s broadest collection of public data.

  • AI2 has a great list of datasets for AI benchmarks.

  • Ever heard of ‘Linked data’? Well there are more and more datasets published in linked data-compatible format. Here’s a visualization:

http://lod-cloud.net/

Some more specific datasets:

Dataset and its sourcing platform

  • Mozilla Common Voice is an online platform to crowd source and validate an open corpus of training data for voice recognition. You can contribute your voice, verify contributions from others and … “Mozilla aims to […] release the open source database later in 2017.”

Want more?

And that’s just scratching the surface…

Written on March 20, 2017