Consistently Substituting IDs with Randomized Numbers Using R

tutorial R

Almost all data sets include IDs, may they be of individuals, groups or events. Often, these ID are not entirely anonymized while researchers might have an interest in making them untraceable to individuals before sharing data inside or beyond their research group.

In this short tutorial, I share how I recently solved this issue by creating randomized IDs in R and matching them to the original IDs in my data set.

The Starting point

Imagine you have collected data on individuals with IDs. Each of these individuals chose a partner to talk to, which also had a unique ID. After partners were chosen, each choosing individual rated the satisfaction with their conversation (this is a really trivial example of network data). Now, to make things a little bit more complicated, imagine that the partners to choose from included new individuals outside of the original group. Also, imagine it was possible that individuals refuse to choose a partner, resulting in missing values for the ID of the chosen partner. Here is a sample data set representing this situation:

library(dplyr, warn.conflicts = F)

set.seed(123)

d <- data.frame(
    id = 1:10 %>% as.character(),
    id_partner = sample(1:20, 10, replace=F) %>% as.character(),
    satisfaction = sample(1:5, 10, replace=T)
)

d$id_partner[c(4, 7, 9)] <- NA

d %>% knitr::kable(format = "html") %>% kableExtra::kable_styling(full_width = T)
id id_partner satisfaction
1 15 1
2 19 2
3 14 3
4 NA 5
5 10 3
6 2 3
7 NA 1
8 11 4
9 NA 1
10 4 1

Now, the situation is the following: We want to consistently substitute all IDs representing individuals in this data set with pseudo-randomized numbers. For this, we need something like a dictionary, or hash-table, which temporarily matches old IDs to new, randomized IDs.

For this, I tried out the R-package {hash}, but I found it to perform rather slow on a large collections of IDs. Therefore, I just created a data frame with rows representing a “dictionary entry”, which looks like this (adjust the sample range of possible new ID numbers as convenient):

all_possible_ids <- c(d$id, d$id_partner) %>% unique() %>% na.omit()

random_ids <- sample(500:600, length(all_possible_ids), replace = F) %>% as.character()

id_dictionary <- cbind(all_possible_ids, random_ids) %>% `colnames<-`(c("old", "new")) %>% as.data.frame()

id_dictionary %>% knitr::kable(format = "html") %>% kableExtra::kable_styling(full_width = T)
old new
1 592
2 598
3 571
4 525
5 506
6 541
7 508
8 582
9 535
10 577
15 580
19 542
14 575
11 514

What we can do is, after renaming the column old to the variable we want to replace, to perform a left join on that variable. In that way, we match the corresponding new, randomized IDs to the old IDs we want to replace:

d_reference <- d  # not relevant at this point

id_dictionary <- id_dictionary %>% rename(id = old)
d <- d %>% left_join(id_dictionary, by = "id") 

d %>% knitr::kable(format = "html") %>% kableExtra::kable_styling(full_width = T)
id id_partner satisfaction new
1 15 1 592
2 19 2 598
3 14 3 571
4 NA 5 525
5 10 3 506
6 2 3 541
7 NA 1 508
8 11 4 582
9 NA 1 535
10 4 1 577

Now, we can just overwrite the column of old IDs with the new, randomized IDs we want to substitute and drop the column new, as it is now a duplicate. Afterwards, we rename the column in our dictonary back to “old” for later reference and additional left joins in order to consistently replace IDs appearing in other variables.

d <- d %>% mutate(id = new) %>% select(-new)

d %>% knitr::kable(format = "html") %>% kableExtra::kable_styling(full_width = T)
id id_partner satisfaction
592 15 1
598 19 2
571 14 3
525 NA 5
506 10 3
541 2 3
508 NA 1
582 11 4
535 NA 1
577 4 1
id_dictionary <- id_dictionary %>% rename(old = id)

Now, there is just one final catch: the missing values in our data that we do not want to overwrite when matching new IDs to old ones. In order to handle such a situation, we need to specify na_matches="never" in the function call of left_join.

id_dictionary <- id_dictionary %>% rename(id_partner = old)
d <- d %>% left_join(id_dictionary, by = "id_partner", na_matches="never") %>% mutate(id_partner = new) %>% select(-new)
id_dictionary <- id_dictionary %>% rename(old = id_partner)

And… there you have it: We successfully replaced all occurences of IDs in our data set with new, randomized IDs, handling both multiple occurences of IDs across variables as well as missing values inside of them. To summarize what we did we can call:

cbind(d_reference[,1:2], d) %>% 
  `colnames<-`(c("id_old", "id_partner_old", "id_new", "id_partner_new", "satisfaction")) %>%
  knitr::kable(format = "html") %>% kableExtra::kable_styling(full_width = T)
id_old id_partner_old id_new id_partner_new satisfaction
1 15 592 580 1
2 19 598 542 2
3 14 571 575 3
4 NA 525 NA 5
5 10 506 577 3
6 2 541 598 3
7 NA 508 NA 1
8 11 582 514 4
9 NA 535 NA 1
10 4 577 525 1

Notice how the old ID “2” got replaced with “598” in both id_new and id_partner_new.

Final thoughts

Pseudo-randomized sequences are an easy way to anonymize your data. Still, make sure that you carefully consider the storage of your original data as well as of the script creating your anonymized data set in the light of your specific situation.

{{< comments >}}

Using Bash to Query the New Twitter API 2.0

tutorial twitterapi bash

Code Snippet: Including Shiny Apps in Your Static Website with Hugo

code-snippet hugo shiny R

Using Bash to Query the CrowdTangle API and Parsing Outputs to CSV with R

tutorial crowdtangle bash R

Search this Website