Building Datasets, Part 3
In the first of these posts, we covered the (now) conventional wisdom that having a bigger dataset is better for training machine learning algorithms. The second of the series detailed a few rules of thumb for creating quality datasets. This time around, we’ll look at how to start building datasets.
This slide from @NathanBenaich summarizes the available options quite well:
Adding a crowdsourcing tab to the above image, and a list of those types of companies, such as Crowdflower, Spare5/MightyAI and Mechanical Turk, would really provide a comprehensive picture of the process and options available.
Let’s narrow down all the potential data acquisition strategies available and highlight some of the well- and lesser-known public datasets.
Without any specific order, here are some large initiatives:
-
The Common Crawl is a an open repository of web crawl data that can be accessed and analyzed by anyone.
-
UCI Machine Learning Datasets is the most widly used source of datasets for ML algorithm benchmarking.
-
DBPedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web.
-
Kaggle, the Machine Learning competition platform, hosts a large inventory of more than 400 public datasets.
-
Dataverse connects a community that shares close to 50,000 datasets, mostly for reproductible science.
-
Data.gov and Open Data in Canada are two of the many open governmental data initiatives.
-
Non-gov entities are also riding on the Open Data opportunity. NYC-based Enigma Public, for instance, claim maintaining the world’s broadest collection of public data.
-
AI2 has a great list of datasets for AI benchmarks.
-
Ever heard of ‘Linked data’? Well there are more and more datasets published in linked data-compatible format. Here’s a visualization:
Some more specific datasets:
-
The Ubuntu Dialogue Corpus is a massive body of 1 million multi-turn dialogs, often FAQ-style.
-
Quora Question Pairs is 400,000 lines of potential question duplicates.
-
Microsoft Concept Graph contains 85 million “is a” relations for 5 million concepts and 12 million instances.
-
The Instacart Online Grocery Shopping Dataset 2017 This anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users.
-
CV Datasets on the web covers more computer vision (CV) datasets than one can handle.
-
Princeton researchers released the biggest database of sarcasm ever with 1.4 million sarcastic remarks taken from Reddit. Problem solved.
-
Here’s a curated list of Medical datasets for machine learning.
-
LibPostal project released 100GB of openly-licensed labeled data for sequence modeling research (addresses/other geographic strings from OSM/OpenAddresses).
Dataset and its sourcing platform
- Mozilla Common Voice is an online platform to crowd source and validate an open corpus of training data for voice recognition. You can contribute your voice, verify contributions from others and … “Mozilla aims to […] release the open source database later in 2017.”
Want more?
-
In Mediums’ Fueling the Gold Rush, Luke de Oliveira lists over 30 more datasets specific to AI.
-
a16z’ AI microsite lists a few image and text datasets.
-
Awesome Public Datasets is an absolutely awesome collection of datasets collected and tidied from blogs, answers, and user responses.
And that’s just scratching the surface…