Extracting Facebook IDs and User Names from Uncleaned Facebook URLs with R

crowdtangle R

CrowdTangle is a service owned by Facebook that recently started to grant researchers access to public Facebook data (for a richer introduction see this blog post by Facebook Research).

Researchers investigating public institutions that are represented through a Facebook account often are interested in merging data from other public sources about these institutions.

Now, it can be difficult to algorithmically match instances of Facebook pages with institutional names or other identifyers on a larger scale. One way of doing this is finding links of these institutions to their Facebook pages, typically found on websites.

While matching data through Facebook URLs seems to be the most intuitive solution, there is one problem to this approach: Facebook URLs

  1. have changed over time in terms of their structure
  2. can be reported in many different forms (links to specific posts, videos, …)

The most reliable approach of merging Facebook URLs is extracting two identifyers of Facebook pages from these URLs: Facebook IDs and user names. The following code showcases how this can be done for a variety of different Facebook URL types.

Disclaimer: The following procedure is still in the making and I am sure that there are still edge-cases for which it does not work. So far, I tested the code on a set of 30k URLs with a success rate of 100%.

Facebook URL types

A typical Facebook page URL consists of a user name and a Facebook ID, which is a sequence of at least 9 digits.

https://www.facebook.com/user-name-with-hyphens-123456789

Often, institutions link to their Facebook page through a link to a specific post:

https://www.facebook.com/user-name-with-hyphens-123456789/posts/123456789

Addtionally, older versions of page links featured URL chunks indicating the link belongs to a page:

https://www.facebook.com/pages/user-name-with-hyphens-123456789

Sometimes, these chunks are also two-fold:

https://www.facebook.com/category/category-name-with-hyphens/user-name-with-hyphens-123456789

Finally, URLs generally have to be cleaned in terms of the following aspects:

  • ASCII encoding
utils::URLdecode("Here we have a pl%E5ceholder")
## [1] "Here we have a pl\xe5ceholder"
  • Redundant forward slashes
This link still works: 

cborchers.com///tags
  • Optional forward slashes at the end of links
These link to the same page.

cborchers.com/publications/
cborchers.com/publications

The cleaning process

First we extract all Facebook ID and user names as they appear in the URLs

library(magrittr)
library(stringr, warn.conflicts = F)

clean_url <- function(strings) {
  return(
    strings %>%
      sapply(., utils::URLdecode) %>%  
      tolower() %>%
      str_remove_all("^http[s]?://") %>%   # optional transfer protocol specification at beginning
      str_remove_all("^w{3,}.") %>%  # optional www. or wwww{...}. at beginning
      str_replace_all("/{1,}","/") %>%   # redundant forward slashes
      str_remove_all("facebook.[a-z]{2,3}/") %>%  # international endings, .de, .fr, .it, ...
      str_remove_all("posts/.*|videos/.*|timeline.*|events/.*") %>% # content chunks
      str_remove_all("/$") %>%    # forward slashes at the end of URLs
      str_remove_all("pages/|pg/|hashtag/|people/") %>%  # old page identifyers
      str_remove_all("category/(?=\\S*['-])([a-zA-Z'-]+)/")  # category names with dashes
  )
}

test_cases <- c(
  "http://www.facebook.com/user-name-example-2412412412412",
  "wwwww.facebook.com/user-name-short-987654321",
  "facebook.de/german-page-name-example-987654321/videos/12345",
  "https://www.facebook.com/user-name-with-hyphens-123456789/posts/123456789/",
  "https://www.facebook.com//pages/user-name-with-hyphens-123456789",
  "https://www.facebook.com/category/category-name-with-hyphens/user-name-with-hyphens-123456789",
  "https://www.facebook.com/user-name-with-a-number-2-and-hyphens-123456789",
  "https://www.facebook.com/user-name-with-dot.dot-123456789",
  "https://www.facebook.com/user-without-Facebook-ID"
)

d <- test_cases %>% clean_url()
d
## [1] "user-name-example-2412412412412"                
## [2] "user-name-short-987654321"                      
## [3] "german-page-name-example-987654321"             
## [4] "user-name-with-hyphens-123456789"               
## [5] "user-name-with-hyphens-123456789"               
## [6] "user-name-with-hyphens-123456789"               
## [7] "user-name-with-a-number-2-and-hyphens-123456789"
## [8] "user-name-with-dot.dot-123456789"               
## [9] "user-without-facebook-id"

Then, all thats left is extracting the Facebook IDs and user names without hyphens (as they are returned by CrowdTangle in that way)

facebook_id <- d %>% 
  sapply(., function(x) { if (str_detect(x, "[[:digit:]]{9,1000}")) { return( str_extract(x, "[[:digit:]]{9,1000}") ) } else { return(NA) } } ) %>% 
  as.vector()   # if possible, extract facebook ID

user_name <- d %>% 
  str_remove_all("[[:digit:]]{9,}") %>%
  str_replace_all("([.])|[[:punct:]]", "\\1")    # only allow dots as punctuation

Result

cbind(test_cases, facebook_id, user_name) %>%
  knitr::kable(format = "html") %>% kableExtra::kable_styling(full_width = T)
test_cases facebook_id user_name
http://www.facebook.com/user-name-example-2412412412412 2412412412412 usernameexample
wwwww.facebook.com/user-name-short-987654321 987654321 usernameshort
facebook.de/german-page-name-example-987654321/videos/12345 987654321 germanpagenameexample
https://www.facebook.com/user-name-with-hyphens-123456789/posts/123456789/ 123456789 usernamewithhyphens
https://www.facebook.com//pages/user-name-with-hyphens-123456789 123456789 usernamewithhyphens
https://www.facebook.com/category/category-name-with-hyphens/user-name-with-hyphens-123456789 123456789 usernamewithhyphens
https://www.facebook.com/user-name-with-a-number-2-and-hyphens-123456789 123456789 usernamewithanumber2andhyphens
https://www.facebook.com/user-name-with-dot.dot-123456789 123456789 usernamewithdot.dot
https://www.facebook.com/user-without-Facebook-ID NA userwithoutfacebookid

Final thoughts

This post is work in progress. If you any questions or find any Facebook URL that is not handled appropriately by the proposed functions, do not hesitate to get in touch with me.

{{< comments >}}

Code Snippet: Including Shiny Apps in Your Static Website with Hugo

code-snippet hugo shiny R

Using Bash to Query the CrowdTangle API and Parsing Outputs to CSV with R

tutorial crowdtangle bash R

Code Snippet: Generalized Linear Mixed Models Power Analysis in R

R stats code snippet

Search this Website