Data sources and risks

Previously, you learned about the data science workflow. Here, we will look at the first step: data collection.

Common Sources of data

Data is everywhere and almost every business process can generate mountains of data. Common sources of data include:

  1. Web events
  1. Customer data
  1. Logistics data
  1. Financial transactions

It's possible that a company is already collecting all of this information. It's best to ask your data engineers what is collected and what isn't; and to emphasize the importance of starting centralized data collection process sooner rather than later

Web data

Let's dive a bit deeper into web data. When a user visits a web page or clicks on a link, it can be helpful to track this information in order to calculate conversion rates or monitor the popularity of different pieces of content.

At a minimum, you'll want to collect the name of the event, which could mean the URL of the page visited or an identifier for the element that was clicked, the time-stamp of the event, and an identifier for the user that performed the actions

Personally Identifiable Information

Suppose Jane Doe is a customer who visits your company website and likes one of your products. You might choose to track her name, the timestamp, and object she clicked on.

It's important to remember that Jane Doe's name is Personally Identifiable Information, or PII. PII includes a person's name, location, email-address, and any other piece of information that could be used to tie web event back to a real human. PII should be treated with extreme sensitivity and caution.

Data Pseudonymization

One of the easiest ways to protect Jane's identity is to split this information into two separate entries. We can assign Jane a user id, in this case 185477, and store that information in a users table. We can then identify her event using this id.

We call the data in the events table pseudonymized because Jane can't be identified by that table alone, but she can be identified if we combine information from the users table with the events table.

To protect Jane, we'll want to make sure that access to the users table is restricted to only folks who need to know Jane's identity, such as senior customer service representatives or members of the legal team. We'll also want to periodically audit who has accessed this data and how they have used it to ensure that Jane's data is respected.

Data anonymization

The best way to protect Jane's privacy is to destroy the information in the users table after assigning Jane's user id. Without the users table, the events table is fully anonymized data.

For many analysis purposes, anonymized data is sufficient. We need to know that Jane is a unique individual, but we don't need to know her name or any other PII.

General Data Protection Regulation (GDPR)

You might have heard the term GDPR recently. GDPR stands for General Data Protection Regulation and applied to all data inside the European Union.

The purpose of GDPR is to give individuals control over their personal data. Among other things, GDPR regulates how long data can be stored, mandates appropriate anonymization, and requires data collection to be disclosed and consent to be obtained.

It's always best to consult a lawyer when dealing with any data inside of the EU to ensure that you comply with GDPR.