Consistently Substituting IDs with Randomized Numbers Using R

tutorial R

Almost all data sets include IDs, may they be of individuals, groups or events. Often, these ID are not entirely anonymized while researchers might have an interest in making them untraceable to individuals before sharing data inside or beyond their research group.

In this short tutorial, I share how I recently solved this issue by creating randomized IDs in R and matching them to the original IDs in my data set.

The Starting point

Imagine you have collected data on individuals with IDs. Each of these individuals chose a partner to talk to, which also had a unique ID. After partners were chosen, each choosing individual rated the satisfaction with their conversation (this is a really trivial example of network data). Now, to make things a little bit more complicated, imagine that the partners to choose from included new individuals outside of the original group. Also, imagine it was possible that individuals refuse to choose a partner, resulting in missing values for the ID of the chosen partner. Here is a sample data set representing this situation:

library(dplyr, warn.conflicts = F)

set.seed(123)

d <- data.frame(
    id = 1:10 %>% as.character(),
    id_partner = sample(1:20, 10, replace=F) %>% as.character(),
    satisfaction = sample(1:5, 10, replace=T)
)

d$id_partner[c(4, 7, 9)] <- NA

d %>% knitr::kable(format = "html") %>% kableExtra::kable_styling(full_width = T)

id	id_partner	satisfaction
1	15	1
2	19	2
3	14	3
4	NA	5
5	10	3
6	2	3
7	NA	1
8	11	4
9	NA	1
10	4	1

Now, the situation is the following: We want to consistently substitute all IDs representing individuals in this data set with pseudo-randomized numbers. For this, we need something like a dictionary, or hash-table, which temporarily matches old IDs to new, randomized IDs.

For this, I tried out the R-package {hash}, but I found it to perform rather slow on a large collections of IDs. Therefore, I just created a data frame with rows representing a “dictionary entry”, which looks like this (adjust the sample range of possible new ID numbers as convenient):

all_possible_ids <- c(d$id, d$id_partner) %>% unique() %>% na.omit()

random_ids <- sample(500:600, length(all_possible_ids), replace = F) %>% as.character()

id_dictionary <- cbind(all_possible_ids, random_ids) %>% `colnames<-`(c("old", "new")) %>% as.data.frame()

id_dictionary %>% knitr::kable(format = "html") %>% kableExtra::kable_styling(full_width = T)

old	new
1	592
2	598
3	571
4	525
5	506
6	541
7	508
8	582
9	535
10	577
15	580
19	542
14	575
11	514

What we can do is, after renaming the column old to the variable we want to replace, to perform a left join on that variable. In that way, we match the corresponding new, randomized IDs to the old IDs we want to replace:

d_reference <- d  # not relevant at this point

id_dictionary <- id_dictionary %>% rename(id = old)
d <- d %>% left_join(id_dictionary, by = "id") 

d %>% knitr::kable(format = "html") %>% kableExtra::kable_styling(full_width = T)

id	id_partner	satisfaction	new
1	15	1	592
2	19	2	598
3	14	3	571
4	NA	5	525
5	10	3	506
6	2	3	541
7	NA	1	508
8	11	4	582
9	NA	1	535
10	4	1	577

Now, we can just overwrite the column of old IDs with the new, randomized IDs we want to substitute and drop the column new, as it is now a duplicate. Afterwards, we rename the column in our dictonary back to “old” for later reference and additional left joins in order to consistently replace IDs appearing in other variables.

d <- d %>% mutate(id = new) %>% select(-new)

d %>% knitr::kable(format = "html") %>% kableExtra::kable_styling(full_width = T)

id	id_partner	satisfaction
592	15	1
598	19	2
571	14	3
525	NA	5
506	10	3
541	2	3
508	NA	1
582	11	4
535	NA	1
577	4	1

id_dictionary <- id_dictionary %>% rename(old = id)

Now, there is just one final catch: the missing values in our data that we do not want to overwrite when matching new IDs to old ones. In order to handle such a situation, we need to specify na_matches="never" in the function call of left_join.

id_dictionary <- id_dictionary %>% rename(id_partner = old)
d <- d %>% left_join(id_dictionary, by = "id_partner", na_matches="never") %>% mutate(id_partner = new) %>% select(-new)
id_dictionary <- id_dictionary %>% rename(old = id_partner)

And… there you have it: We successfully replaced all occurences of IDs in our data set with new, randomized IDs, handling both multiple occurences of IDs across variables as well as missing values inside of them. To summarize what we did we can call:

cbind(d_reference[,1:2], d) %>% 
  `colnames<-`(c("id_old", "id_partner_old", "id_new", "id_partner_new", "satisfaction")) %>%
  knitr::kable(format = "html") %>% kableExtra::kable_styling(full_width = T)

id_old	id_partner_old	id_new	id_partner_new	satisfaction
1	15	592	580	1
2	19	598	542	2
3	14	571	575	3
4	NA	525	NA	5
5	10	506	577	3
6	2	541	598	3
7	NA	508	NA	1
8	11	582	514	4
9	NA	535	NA	1
10	4	577	525	1

Notice how the old ID “2” got replaced with “598” in both id_new and id_partner_new.

Final thoughts

Pseudo-randomized sequences are an easy way to anonymize your data. Still, make sure that you carefully consider the storage of your original data as well as of the script creating your anonymized data set in the light of your specific situation.

Consistently Substituting IDs with Randomized Numbers Using R

The Starting point

Final thoughts

Related

Using Bash to Query the New Twitter API 2.0

Code Snippet: Including Shiny Apps in Your Static Website with Hugo

Using Bash to Query the CrowdTangle API and Parsing Outputs to CSV with R

Search this Website and Old Blog Posts

old	new
1	592
2	598
3	571
4	525
5	506
6	541
7	508
8	582
9	535
10	577
15	580
19	542
14	575
11	514

id	id_partner	satisfaction	new
1	15	1	592
2	19	2	598
3	14	3	571
4	NA	5	525
5	10	3	506
6	2	3	541
7	NA	1	508
8	11	4	582
9	NA	1	535
10	4	1	577

id_old	id_partner_old	id_new	id_partner_new	satisfaction
1	15	592	580	1
2	19	598	542	2
3	14	571	575	3
4	NA	525	NA	5
5	10	506	577	3
6	2	541	598	3
7	NA	508	NA	1
8	11	582	514	4
9	NA	535	NA	1
10	4	577	525	1

old	new
1	592
2	598
3	571
4	525
5	506
6	541
7	508
8	582
9	535
10	577
15	580
19	542
14	575
11	514

id	id_partner	satisfaction	new
1	15	1	592
2	19	2	598
3	14	3	571
4	NA	5	525
5	10	3	506
6	2	3	541
7	NA	1	508
8	11	4	582
9	NA	1	535
10	4	1	577

id_old	id_partner_old	id_new	id_partner_new	satisfaction
1	15	592	580	1
2	19	598	542	2
3	14	571	575	3
4	NA	525	NA	5
5	10	506	577	3
6	2	541	598	3
7	NA	508	NA	1
8	11	582	514	4
9	NA	535	NA	1
10	4	577	525	1

old	new
1	592
2	598
3	571
4	525
5	506
6	541
7	508
8	582
9	535
10	577
15	580
19	542
14	575
11	514

id	id_partner	satisfaction	new
1	15	1	592
2	19	2	598
3	14	3	571
4	NA	5	525
5	10	3	506
6	2	3	541
7	NA	1	508
8	11	4	582
9	NA	1	535
10	4	1	577

id_old	id_partner_old	id_new	id_partner_new	satisfaction
1	15	592	580	1
2	19	598	542	2
3	14	571	575	3
4	NA	525	NA	5
5	10	506	577	3
6	2	541	598	3
7	NA	508	NA	1
8	11	582	514	4
9	NA	535	NA	1
10	4	577	525	1