Using Bash to Query the CrowdTangle API and Parsing Outputs to CSV with R
tutorial crowdtangle bash RSimilar to my post on querying the Twitter API through Twurl, I recently found automatizing queries to Facebook’s CrowdTangle API services to improve my understanding of its data and functionality considerably. Notably, this also resulted in the returned post data having more variables than the ones being returned by the GUI download interface. Still, as there are a few obstacles before successfully employing a shell data pipeline accessing CrowdTangle data, I am going to document my work in progress here.
This time, I moved from Windows Batch to Bash, which can easily be
run on a Windows machine through Cygwin.
Additionally, I used cURL, which is
what “Twurl” for accessing the Twitter API is built on. For a quick install,
you can also type choco install curl
in your shell.
The puzzle pieces
The Token
First, you will need a token string which you can access in CrowdTangle via Settings (aka gear in the top right corner), API Access.
I recommend storing this token in a file called “token.txt”. In that way
the token can be read in at the beginning of a script and stored in a variable
(e.g., through token=$(cat token.txt)
. This comes in handy when you want
to share your code with multiple researchers or in public repositories.
The Query
Next, you need to have an idea of what kind of data you want to download and what the corresponding URL query looks like. I highly recommend checking out CrowdTangles “API Cheat Sheet”. CrowdTangle also offers a JSON file including query templates that can be read in by Postman, which is nice for a start. Ultimately, you might want to learn how to send and automatize queries yourself though.
In my use case, I am downloading public post data from a list, which is like a collection of topic-specific public accounts. This list has a specific identifier which is located at the URL connected to the list in your CrowdTangle dashboard.
Of course, we also need a timeframe to download the data within. This is kinda
tricky because each query returns up to 100 posts (10 per default) but does not send
a warning when this limit is exceeded (read: posts are missed and not downloaded).
All timestamps are in the format yyyy:mm:ddThh:mm:ss
in UTC/GMT time.
Furthermore, no timeframe in requests should be greater than a year.
Additionally, the /search
endpoint for posts currently allows for 6
queries per minute, so if your download speed is fast (unlike mine), then
you might want to incorporate a sleep <number of seconds>
command into your bash script.
Finally, we arrive at the following query for downloading posts in list 123456 in 2019:
curl "https://api.crowdtangle.com/posts?token=<your token>&startDate=2019-01-01T00:00:00&endDate=2020-01-01T00:00:00&listIds=123456"
For comprehensive documentation of the /search
endpoint, check out
CrowdTangle’s GitHub documentation.
The Download Script
First, we need a bunch of timeframes to download data within.
Handling Timestamps in Unix can be tough, therefore I created
timeframes in R and wrote them to a file called timeframes.txt
.
Imagine we want to send queries for each hour in January 2020, then we could run the following R-script:
library(stringr)
times <- seq.POSIXt(as.POSIXct("2020-01-01"), as.POSIXct("2020-02-01"), by = "60 min")
times <- times %>%
as.character() %>%
str_replace_all(" CET", "") %>% # attention, your timezone might not be CET
str_replace_all(" ", "T")
sink("timeframes.txt")
for (time in all)
cat(time, "\n")
sink()
Then, we could just read in timeframes.txt
line by line using Bash.
This, however, requires a bit of text cleaning, such that we arrive at the
following script that reads in the timestamps as the end of each timeframe
and then pushes that timestamp to the “start position” of the next timeframe.
Additionally, we save each query into a time-stamped .json
file indicating the
downloaded timeframe (might come in handy later - if any file hits the
10k post limit, you could re-download subsets of the corresponding timeframe)
inside of a folder called /json
.
token=$(cat token.txt)
start="2019-12-31T23:00:00" # make sure you start _before_ the first entry in timeframes.txt
input="timeframes.txt"
while IFS= read -r end
do
end=$(echo ${end} | tr -d '\r')
end=$(echo ${end} | tr -d ' ')
curl "https://api.crowdtangle.com/posts?token=$token&startDate=$start&endDate=$end&listIds=1461358" > json/"$start"-"$end".json
start=$end
done < "$input"
If your filenames look weird, don’t worry, I already adapted the following code from StackOverflow:
cd json
for fn in *; do mv -- "$fn" $(echo $fn | sed -e 's/[^A-Za-z0-9._-]/_/g'); done
cd ..
The Parsing Script (to CSV using R)
JSON files are essentially lists of lists of lists of …, which is not really
what a data scientist accustomed to data frames and tibbles is wishing for.
Luckily, there are helper functions in R which do the job. Notably, as
the CrowdTangle API only returns variables that apply to a given post,
we also need to use a function such as plyr::rbind.fill()
that fills
variables not existent in certain objects with NAs.
filenames <- list.files(path="./json", pattern="*.json", full.names=T)
responses <- list()
i <- 1
for (file in filenames){
d <- rjson::fromJSON(file=file)
# Skip empty responses
# One entry in "responses" object is one API response as df
if (length(d$result$posts) > 0){
# Extract post information and save each post as data frame
posts <- lapply(d$result$posts, as.data.frame)
# Bind posts to data frame, fill missing columns
responses[[i]] <- plyr::rbind.fill(posts)
i <- i+1
}
}
# Optionally bind all responses into one data frame
d <- plyr::rbind.fill(responses); rm(responses)
write.csv(d, "yourfilename.csv")
Summary
There you have it, each time you plan on a new data collection, you need to specify:
- API endpoint and URL request parameters
- Timeframes to download (can be tricky due to return and rate limits)
- Download script in shell reading in the timeframes
- A parsing script for the JSON files
Update, Jan 2021
I recently improved the shell script in order to handle timeouts and read in the next timeframe from the last return value, rather than creating timeframes externally.
get_year ()
{
printf "\n*** Downloading all post data in $1 ***\n\n"
printf "*** Cleaning json/ directory ***\n"
cd json
sleep 2
find . -size 0 -delete
touch .gitkeep # if you have a GitHub repo
cd ..
printf "\n*** Checking for previous file of year $1 in json/ for obtaining timeframe ***\n\n"
sleep 2
lastfile=$(ls -v json | grep "^$1-" | tail -n 1)
if ! [[ -z $lastfile ]]
then
printf "*** Found file: $lastfile ***\n"
count=$(echo ${lastfile:5} | sed 's/.json//g')
let count+=1
printf "*** Will count files starting at $count ***\n"
end=$(tail -c 10000 json/$lastfile | grep -oP '\"date\":\"[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}' | tail -n 1 | sed 's/[a-z\"]*//g' | cut -c2- | sed 's/ /T/g')
printf "*** Start date form last session found at $end ***\n"
sleep 2
else
printf "*** Found NO previous file for $1, will set default start value ***\n\n"
end="$1-12-31T23:59:59"
let count=1
printf "*** Start date set to $end ***\n"
sleep 2
fi
start="$1-01-01T00:00:00"
printf "*** Initializing download... ***\n\n"
printf "*** Files will be stored in directory json/ with signature $1-{1,2,...n} ***\n\n"
sleep 1
let returned=999
while [[ returned -gt 149 ]]
do
touch json/$1-$count.json
printf "*** Trying to download $start until $end into json/$1-$count.json... ***\n\n"
while [[ $(head -c 50 json/$1-$count.json | grep -oP '^[^0-9]*\K[0-9]+') -ne 200 ]]
do
curl --max-time 90 "https://api.crowdtangle.com/posts?token=$token&startDate=$start&endDate=$end&listIds=1461358&sortBy=date&count=10000" > json/$1-$count.json
if [[ $(head -c 50 json/$1-$count.json | grep -oP '^[^0-9]*\K[0-9]+') -ne 200 ]]
then
printf "\n*** Last download returned bad status or failed. Setting console to sleep for 10 seconds and retrying ***\n\n"
sleep 10
fi
done
end=$(tail -c 10000 json/$1-$count.json | grep -oP '\"date\":\"[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}' | tail -n 1 | sed 's/[a-z\"]*//g' | cut -c2- | sed 's/ /T/g')
returned=$(head -c 150 json/$1-$count.json | wc -c) # empty returns have < 150 bytes
let count+=1
done
let count-=1
rm json/$1-$count.json # remove last return as it is empty
printf "*** YEAR $1 DOWNLOAD COMPLETE ***\n\n"
}
get_year 2020
Final thoughts
Taking a birds-eye view on this post, I acknowledge that using the CrowdTangle API from your shell can seem daunting. Still, I emphasize that it is worth the journey. On your journey, you will very likely discover CrowdTangle endpoints and variables you overlooked and which could potentially inform your research questions or your specific use case.