CrowdTangle is a service owned by Facebook that recently started to grant researchers access to public Facebook data (for a richer introduction see this blog post by Facebook Research).
Researchers investigating public institutions that are represented through a Facebook account often are interested in merging data from other public sources about these institutions.
Now, it can be difficult to algorithmically match instances of Facebook pages with institutional names or other identifyers on a larger scale. One way of doing this is finding links of these institutions to their Facebook pages, typically found on websites.
While matching data through Facebook URLs seems to be the most intuitive solution, there is one problem to this approach: Facebook URLs
- have changed over time in terms of their structure
- can be reported in many different forms (links to specific posts, videos, …)
The most reliable approach of merging Facebook URLs is extracting two identifyers of Facebook pages from these URLs: Facebook IDs and user names. The following code showcases how this can be done for a variety of different Facebook URL types.
Disclaimer: The following procedure is still in the making and I am sure that there are still edge-cases for which it does not work. So far, I tested the code on a set of 30k URLs with a success rate of 100%.
Facebook URL types
A typical Facebook page URL consists of a user name and a Facebook ID, which is a sequence of at least 9 digits.
https://www.facebook.com/user-name-with-hyphens-123456789
Often, institutions link to their Facebook page through a link to a specific post:
https://www.facebook.com/user-name-with-hyphens-123456789/posts/123456789
Addtionally, older versions of page links featured URL chunks indicating the link belongs to a page:
https://www.facebook.com/pages/user-name-with-hyphens-123456789
Sometimes, these chunks are also two-fold:
https://www.facebook.com/category/category-name-with-hyphens/user-name-with-hyphens-123456789
Finally, URLs generally have to be cleaned in terms of the following aspects:
- ASCII encoding
utils::URLdecode("Here we have a pl%E5ceholder")
## [1] "Here we have a pl\xe5ceholder"
- Redundant forward slashes
This link still works:
cborchers.com///tags
- Optional forward slashes at the end of links
These link to the same page.
cborchers.com/publications/
cborchers.com/publications
The cleaning process
First we extract all Facebook ID and user names as they appear in the URLs
library(magrittr)
library(stringr, warn.conflicts = F)
clean_url <- function(strings) {
return(
strings %>%
sapply(., utils::URLdecode) %>%
tolower() %>%
str_remove_all("^http[s]?://") %>% # optional transfer protocol specification at beginning
str_remove_all("^w{3,}.") %>% # optional www. or wwww{...}. at beginning
str_replace_all("/{1,}","/") %>% # redundant forward slashes
str_remove_all("facebook.[a-z]{2,3}/") %>% # international endings, .de, .fr, .it, ...
str_remove_all("posts/.*|videos/.*|timeline.*|events/.*") %>% # content chunks
str_remove_all("/$") %>% # forward slashes at the end of URLs
str_remove_all("pages/|pg/|hashtag/|people/") %>% # old page identifyers
str_remove_all("category/(?=\\S*['-])([a-zA-Z'-]+)/") # category names with dashes
)
}
test_cases <- c(
"http://www.facebook.com/user-name-example-2412412412412",
"wwwww.facebook.com/user-name-short-987654321",
"facebook.de/german-page-name-example-987654321/videos/12345",
"https://www.facebook.com/user-name-with-hyphens-123456789/posts/123456789/",
"https://www.facebook.com//pages/user-name-with-hyphens-123456789",
"https://www.facebook.com/category/category-name-with-hyphens/user-name-with-hyphens-123456789",
"https://www.facebook.com/user-name-with-a-number-2-and-hyphens-123456789",
"https://www.facebook.com/user-name-with-dot.dot-123456789",
"https://www.facebook.com/user-without-Facebook-ID"
)
d <- test_cases %>% clean_url()
d
## [1] "user-name-example-2412412412412"
## [2] "user-name-short-987654321"
## [3] "german-page-name-example-987654321"
## [4] "user-name-with-hyphens-123456789"
## [5] "user-name-with-hyphens-123456789"
## [6] "user-name-with-hyphens-123456789"
## [7] "user-name-with-a-number-2-and-hyphens-123456789"
## [8] "user-name-with-dot.dot-123456789"
## [9] "user-without-facebook-id"
Then, all thats left is extracting the Facebook IDs and user names without hyphens (as they are returned by CrowdTangle in that way)
facebook_id <- d %>%
sapply(., function(x) { if (str_detect(x, "[[:digit:]]{9,1000}")) { return( str_extract(x, "[[:digit:]]{9,1000}") ) } else { return(NA) } } ) %>%
as.vector() # if possible, extract facebook ID
user_name <- d %>%
str_remove_all("[[:digit:]]{9,}") %>%
str_replace_all("([.])|[[:punct:]]", "\\1") # only allow dots as punctuation
Result
cbind(test_cases, facebook_id, user_name) %>%
knitr::kable(format = "html") %>% kableExtra::kable_styling(full_width = T)
test_cases | facebook_id | user_name |
---|---|---|
http://www.facebook.com/user-name-example-2412412412412 | 2412412412412 | usernameexample |
wwwww.facebook.com/user-name-short-987654321 | 987654321 | usernameshort |
facebook.de/german-page-name-example-987654321/videos/12345 | 987654321 | germanpagenameexample |
https://www.facebook.com/user-name-with-hyphens-123456789/posts/123456789/ | 123456789 | usernamewithhyphens |
https://www.facebook.com//pages/user-name-with-hyphens-123456789 | 123456789 | usernamewithhyphens |
https://www.facebook.com/category/category-name-with-hyphens/user-name-with-hyphens-123456789 | 123456789 | usernamewithhyphens |
https://www.facebook.com/user-name-with-a-number-2-and-hyphens-123456789 | 123456789 | usernamewithanumber2andhyphens |
https://www.facebook.com/user-name-with-dot.dot-123456789 | 123456789 | usernamewithdot.dot |
https://www.facebook.com/user-without-Facebook-ID | NA | userwithoutfacebookid |
Final thoughts
This post is work in progress. If you any questions or find any Facebook URL that is not handled appropriately by the proposed functions, do not hesitate to get in touch with me.
{{< comments >}}