Almost all data sets include IDs, may they be of individuals, groups or events. Often, these ID are not entirely anonymized while researchers might have an interest in making them untraceable to individuals before sharing data inside or beyond their research group.
In this short tutorial, I share how I recently solved this issue by creating randomized IDs in R and matching them to the original IDs in my data set.
The Starting point
Imagine you have collected data on individuals with IDs. Each of these individuals chose a partner to talk to, which also had a unique ID. After partners were chosen, each choosing individual rated the satisfaction with their conversation (this is a really trivial example of network data). Now, to make things a little bit more complicated, imagine that the partners to choose from included new individuals outside of the original group. Also, imagine it was possible that individuals refuse to choose a partner, resulting in missing values for the ID of the chosen partner. Here is a sample data set representing this situation:
library(dplyr, warn.conflicts = F)
set.seed(123)
d <- data.frame(
id = 1:10 %>% as.character(),
id_partner = sample(1:20, 10, replace=F) %>% as.character(),
satisfaction = sample(1:5, 10, replace=T)
)
d$id_partner[c(4, 7, 9)] <- NA
d %>% knitr::kable(format = "html") %>% kableExtra::kable_styling(full_width = T)
id | id_partner | satisfaction |
---|---|---|
1 | 15 | 1 |
2 | 19 | 2 |
3 | 14 | 3 |
4 | NA | 5 |
5 | 10 | 3 |
6 | 2 | 3 |
7 | NA | 1 |
8 | 11 | 4 |
9 | NA | 1 |
10 | 4 | 1 |
Now, the situation is the following: We want to consistently substitute all IDs representing individuals in this data set with pseudo-randomized numbers. For this, we need something like a dictionary, or hash-table, which temporarily matches old IDs to new, randomized IDs.
For this, I tried out the R-package {hash}
, but I found it to perform rather slow on a large collections of IDs.
Therefore, I just created a data frame with rows representing a “dictionary entry”, which
looks like this (adjust the sample range of possible new ID numbers as convenient):
all_possible_ids <- c(d$id, d$id_partner) %>% unique() %>% na.omit()
random_ids <- sample(500:600, length(all_possible_ids), replace = F) %>% as.character()
id_dictionary <- cbind(all_possible_ids, random_ids) %>% `colnames<-`(c("old", "new")) %>% as.data.frame()
id_dictionary %>% knitr::kable(format = "html") %>% kableExtra::kable_styling(full_width = T)
old | new |
---|---|
1 | 592 |
2 | 598 |
3 | 571 |
4 | 525 |
5 | 506 |
6 | 541 |
7 | 508 |
8 | 582 |
9 | 535 |
10 | 577 |
15 | 580 |
19 | 542 |
14 | 575 |
11 | 514 |
What we can do is, after renaming the column old
to the variable we want to replace,
to perform a left join on that variable. In that way, we match the corresponding new, randomized
IDs to the old IDs we want to replace:
d_reference <- d # not relevant at this point
id_dictionary <- id_dictionary %>% rename(id = old)
d <- d %>% left_join(id_dictionary, by = "id")
d %>% knitr::kable(format = "html") %>% kableExtra::kable_styling(full_width = T)
id | id_partner | satisfaction | new |
---|---|---|---|
1 | 15 | 1 | 592 |
2 | 19 | 2 | 598 |
3 | 14 | 3 | 571 |
4 | NA | 5 | 525 |
5 | 10 | 3 | 506 |
6 | 2 | 3 | 541 |
7 | NA | 1 | 508 |
8 | 11 | 4 | 582 |
9 | NA | 1 | 535 |
10 | 4 | 1 | 577 |
Now, we can just overwrite the column of old IDs with the new, randomized IDs we want to substitute and drop the
column new
, as it is now a duplicate. Afterwards, we rename the column in our dictonary
back to “old” for later reference and additional left joins in order to consistently
replace IDs appearing in other variables.
d <- d %>% mutate(id = new) %>% select(-new)
d %>% knitr::kable(format = "html") %>% kableExtra::kable_styling(full_width = T)
id | id_partner | satisfaction |
---|---|---|
592 | 15 | 1 |
598 | 19 | 2 |
571 | 14 | 3 |
525 | NA | 5 |
506 | 10 | 3 |
541 | 2 | 3 |
508 | NA | 1 |
582 | 11 | 4 |
535 | NA | 1 |
577 | 4 | 1 |
id_dictionary <- id_dictionary %>% rename(old = id)
Now, there is just one final catch: the missing values in our data that we do not want to overwrite
when matching new IDs to old ones. In order to handle such a situation, we need to specify
na_matches="never"
in the function call of left_join
.
id_dictionary <- id_dictionary %>% rename(id_partner = old)
d <- d %>% left_join(id_dictionary, by = "id_partner", na_matches="never") %>% mutate(id_partner = new) %>% select(-new)
id_dictionary <- id_dictionary %>% rename(old = id_partner)
And… there you have it: We successfully replaced all occurences of IDs in our data set with new, randomized IDs, handling both multiple occurences of IDs across variables as well as missing values inside of them. To summarize what we did we can call:
cbind(d_reference[,1:2], d) %>%
`colnames<-`(c("id_old", "id_partner_old", "id_new", "id_partner_new", "satisfaction")) %>%
knitr::kable(format = "html") %>% kableExtra::kable_styling(full_width = T)
id_old | id_partner_old | id_new | id_partner_new | satisfaction |
---|---|---|---|---|
1 | 15 | 592 | 580 | 1 |
2 | 19 | 598 | 542 | 2 |
3 | 14 | 571 | 575 | 3 |
4 | NA | 525 | NA | 5 |
5 | 10 | 506 | 577 | 3 |
6 | 2 | 541 | 598 | 3 |
7 | NA | 508 | NA | 1 |
8 | 11 | 582 | 514 | 4 |
9 | NA | 535 | NA | 1 |
10 | 4 | 577 | 525 | 1 |
Notice how the old ID “2” got replaced with “598” in both id_new
and id_partner_new
.
Final thoughts
Pseudo-randomized sequences are an easy way to anonymize your data. Still, make sure that you carefully consider the storage of your original data as well as of the script creating your anonymized data set in the light of your specific situation.
{{< comments >}}