# De-identifying

A colleague of mine is collecting longitudinal data on social networks. To do this, she’s asking everyone in the study who their friends are. Thus, her participants are providing her with a series of names at regular time intervals. Unfortunately, this poses a few problems.

1. The data are not deidentified. When working with data, it’s almost always a problem when you see people’s actual names next to other information about them. All identifying information should be stripped.

2. People provide partially matching names. Sometimes, when someone tells you that they are friends with ‘Jane Doe’, you later find that this person’s real name is ‘Janet Doe’.

To solve these problems, she wondered if it were possible to create a function which takes, as an input, a series of individual names and ID numbers, as well as a series of names provided by the participants, and returns the latter series of names, converted to id numbers.

In other words, if you have one file which has names and ID numbers like this:

And then a second vector of names that you’d like to replace with id numbers, where possible:

Can we get a function which accomplishes this?

Obviously, the real question is not whether we can, but rather how we can do it.

I’ve included below the function that accomplishes this. What we’re doing is matching based on the first few letters of the person’s first name, paired with the last name (or last $n$ names in cases where someone has more than 1 ‘name’ after their first), and then replacing these with the corresponding id numbers.

For instance, matching based on the first 3 letters in the first name:

Versus the first 4 letters:

3 grabs more of them, so why would we ever want 4? This helps in case you happen to have a Rick Williams and a Rich Williams, or something of the sort..

And the function itself below. Note that plyr is a dependency - make sure you’ve got that package installed if you want this to work:

Happy de-identifying!

Written on February 13, 2015