Extracting Facebook IDs and User Names from Uncleaned Facebook URLs with R

crowdtangle R

CrowdTangle is a service owned by Facebook that recently started to grant researchers access to public Facebook data (for a richer introduction see this blog post by Facebook Research).

Researchers investigating public institutions that are represented through a Facebook account often are interested in merging data from other public sources about these institutions.

Now, it can be difficult to algorithmically match instances of Facebook pages with institutional names or other identifyers on a larger scale. One way of doing this is finding links of these institutions to their Facebook pages, typically found on websites.

While matching data through Facebook URLs seems to be the most intuitive solution, there is one problem to this approach: Facebook URLs

have changed over time in terms of their structure
can be reported in many different forms (links to specific posts, videos, …)

The most reliable approach of merging Facebook URLs is extracting two identifyers of Facebook pages from these URLs: Facebook IDs and user names. The following code showcases how this can be done for a variety of different Facebook URL types.

Disclaimer: The following procedure is still in the making and I am sure that there are still edge-cases for which it does not work. So far, I tested the code on a set of 30k URLs with a success rate of 100%.

Facebook URL types

A typical Facebook page URL consists of a user name and a Facebook ID, which is a sequence of at least 9 digits.

https://www.facebook.com/user-name-with-hyphens-123456789

Often, institutions link to their Facebook page through a link to a specific post:

https://www.facebook.com/user-name-with-hyphens-123456789/posts/123456789

Addtionally, older versions of page links featured URL chunks indicating the link belongs to a page:

https://www.facebook.com/pages/user-name-with-hyphens-123456789

Sometimes, these chunks are also two-fold:

https://www.facebook.com/category/category-name-with-hyphens/user-name-with-hyphens-123456789

Finally, URLs generally have to be cleaned in terms of the following aspects:

ASCII encoding

utils::URLdecode("Here we have a pl%E5ceholder")

## [1] "Here we have a pl\xe5ceholder"

Redundant forward slashes

This link still works: 

cborchers.com///tags

Optional forward slashes at the end of links

These link to the same page.

cborchers.com/publications/
cborchers.com/publications

The cleaning process

First we extract all Facebook ID and user names as they appear in the URLs

library(magrittr)
library(stringr, warn.conflicts = F)

clean_url <- function(strings) {
  return(
    strings %>%
      sapply(., utils::URLdecode) %>%  
      tolower() %>%
      str_remove_all("^http[s]?://") %>%   # optional transfer protocol specification at beginning
      str_remove_all("^w{3,}.") %>%  # optional www. or wwww{...}. at beginning
      str_replace_all("/{1,}","/") %>%   # redundant forward slashes
      str_remove_all("facebook.[a-z]{2,3}/") %>%  # international endings, .de, .fr, .it, ...
      str_remove_all("posts/.*|videos/.*|timeline.*|events/.*") %>% # content chunks
      str_remove_all("/$") %>%    # forward slashes at the end of URLs
      str_remove_all("pages/|pg/|hashtag/|people/") %>%  # old page identifyers
      str_remove_all("category/(?=\\S*['-])([a-zA-Z'-]+)/")  # category names with dashes
  )
}

test_cases <- c(
  "http://www.facebook.com/user-name-example-2412412412412",
  "wwwww.facebook.com/user-name-short-987654321",
  "facebook.de/german-page-name-example-987654321/videos/12345",
  "https://www.facebook.com/user-name-with-hyphens-123456789/posts/123456789/",
  "https://www.facebook.com//pages/user-name-with-hyphens-123456789",
  "https://www.facebook.com/category/category-name-with-hyphens/user-name-with-hyphens-123456789",
  "https://www.facebook.com/user-name-with-a-number-2-and-hyphens-123456789",
  "https://www.facebook.com/user-name-with-dot.dot-123456789",
  "https://www.facebook.com/user-without-Facebook-ID"
)

d <- test_cases %>% clean_url()
d

## [1] "user-name-example-2412412412412"                
## [2] "user-name-short-987654321"                      
## [3] "german-page-name-example-987654321"             
## [4] "user-name-with-hyphens-123456789"               
## [5] "user-name-with-hyphens-123456789"               
## [6] "user-name-with-hyphens-123456789"               
## [7] "user-name-with-a-number-2-and-hyphens-123456789"
## [8] "user-name-with-dot.dot-123456789"               
## [9] "user-without-facebook-id"

Then, all thats left is extracting the Facebook IDs and user names without hyphens (as they are returned by CrowdTangle in that way)

facebook_id <- d %>% 
  sapply(., function(x) { if (str_detect(x, "[[:digit:]]{9,1000}")) { return( str_extract(x, "[[:digit:]]{9,1000}") ) } else { return(NA) } } ) %>% 
  as.vector()   # if possible, extract facebook ID

user_name <- d %>% 
  str_remove_all("[[:digit:]]{9,}") %>%
  str_replace_all("([.])|[[:punct:]]", "\\1")    # only allow dots as punctuation

Result

cbind(test_cases, facebook_id, user_name) %>%
  knitr::kable(format = "html") %>% kableExtra::kable_styling(full_width = T)

test_cases	facebook_id	user_name
http://www.facebook.com/user-name-example-2412412412412	2412412412412	usernameexample
wwwww.facebook.com/user-name-short-987654321	987654321	usernameshort
facebook.de/german-page-name-example-987654321/videos/12345	987654321	germanpagenameexample
https://www.facebook.com/user-name-with-hyphens-123456789/posts/123456789/	123456789	usernamewithhyphens
https://www.facebook.com//pages/user-name-with-hyphens-123456789	123456789	usernamewithhyphens
https://www.facebook.com/category/category-name-with-hyphens/user-name-with-hyphens-123456789	123456789	usernamewithhyphens
https://www.facebook.com/user-name-with-a-number-2-and-hyphens-123456789	123456789	usernamewithanumber2andhyphens
https://www.facebook.com/user-name-with-dot.dot-123456789	123456789	usernamewithdot.dot
https://www.facebook.com/user-without-Facebook-ID	NA	userwithoutfacebookid

Final thoughts

This post is work in progress. If you any questions or find any Facebook URL that is not handled appropriately by the proposed functions, do not hesitate to get in touch with me.

Extracting Facebook IDs and User Names from Uncleaned Facebook URLs with R

Facebook URL types

The cleaning process

Result

Final thoughts

Related

Code Snippet: Including Shiny Apps in Your Static Website with Hugo

Using Bash to Query the CrowdTangle API and Parsing Outputs to CSV with R

Code Snippet: Generalized Linear Mixed Models Power Analysis in R

Search this Website and Old Blog Posts