Collecting additional data

While collecting internal data is useful for some data science projects, it's only one piece of the puzzle. Often, you need to gather data from external sources as well.

Even more data

There are many ways that you can collect additional data for your organization. A few common ways include APIs, public records, and Mechanical Turk, all of which we'll discuss here.

Data APIs

Let's begin with APIs. API stands for Application Programming Interface. It's an easy way of requesting data from a third party over the internet. Many companies have APIs to let your team access their data.

Some notable APIs include Twitter, Wikipedia, Yahoo! Finance, Google Maps, but there are many, many more.

If you work with a partner and think that they might have useful data, do a quick web search and see if an API exists!

Tracking a hashtag

Let's look at an example of Twitter API. Suppose we want to track Tweets with the hashtag AmazonGreatIndianFestival, We can use the twitter API to request all Tweets with this hashtag.

At this point, we have many options for analysis, we could perform a sentiment analysis on the text of each Tweet and get an idea of how people are responding to the event. We could simply track how often this hashtag appears over a period of time. We could also combine this data with downloads data to see if positive tweets are correlated with more app downloads during this period.

Public records

Public records are another great way of gathering additional data. In India, data.gov.in has data available for a wide array of indicators which can be downloaded for free. Other countries also have similar sites.

This can be great source for understanding population-level trends or gathering location and economic data.

Building a training set

Previously, we discussed image recognition as a type of data science problem. In order to build a good image recognition algorithm, we need a set of pictures where the images have been already labelled, which is called our training set. But we don't need just one or two pictures. We need hundreds or thousands of pictures. Getting these images labelled can be really difficult and time consuming, and a lack of training set is often what keeps good data science projects from being completed.

Mechanical Turk

Depending on what kind of training set is needed, Mechanical Turk, also called MTurk, can be a great option. Mechanical Turk means asking humans to complete a task that we eventually plan on computerizing.

In our previous example, this would mean labelling a handful of pictures to create a training set for image recognition. Rather than asking one person to label thousands of images, we recruit thousands of people to label a few images.

To ensure quality, we might ask two or three people to review the same image and then take the most common answer. Many platforms exist to help build your Mechanical Turk problem and recruit helpers, such as AWS MTurk.

Mechanical Turk isn't just for image recognition. You can also use it to label customer reviews as positive or negative, extract text from a form, or highlight keywords in a sentence.

In the example above, users are asked to identify which sections of the image contain a street sign.