Using Bash to Query the CrowdTangle API and Parsing Outputs to CSV with R

tutorial crowdtangle bash R

Similar to my post on querying the Twitter API through Twurl, I recently found automatizing queries to Facebook’s CrowdTangle API services to improve my understanding of its data and functionality considerably. Notably, this also resulted in the returned post data having more variables than the ones being returned by the GUI download interface. Still, as there are a few obstacles before successfully employing a shell data pipeline accessing CrowdTangle data, I am going to document my work in progress here.

This time, I moved from Windows Batch to Bash, which can easily be run on a Windows machine through Cygwin. Additionally, I used cURL, which is what “Twurl” for accessing the Twitter API is built on. For a quick install, you can also type choco install curl in your shell.

The puzzle pieces

The Token

First, you will need a token string which you can access in CrowdTangle via Settings (aka gear in the top right corner), API Access.

I recommend storing this token in a file called “token.txt”. In that way the token can be read in at the beginning of a script and stored in a variable (e.g., through token=$(cat token.txt). This comes in handy when you want to share your code with multiple researchers or in public repositories.

The Query

Next, you need to have an idea of what kind of data you want to download and what the corresponding URL query looks like. I highly recommend checking out CrowdTangles “API Cheat Sheet”. CrowdTangle also offers a JSON file including query templates that can be read in by Postman, which is nice for a start. Ultimately, you might want to learn how to send and automatize queries yourself though.

In my use case, I am downloading public post data from a list, which is like a collection of topic-specific public accounts. This list has a specific identifier which is located at the URL connected to the list in your CrowdTangle dashboard.

Of course, we also need a timeframe to download the data within. This is kinda tricky because each query returns up to 100 posts (10 per default) but does not send a warning when this limit is exceeded (read: posts are missed and not downloaded). All timestamps are in the format yyyy:mm:ddThh:mm:ss in UTC/GMT time. Furthermore, no timeframe in requests should be greater than a year. Additionally, the /search endpoint for posts currently allows for 6 queries per minute, so if your download speed is fast (unlike mine), then you might want to incorporate a sleep <number of seconds> command into your bash script.

Finally, we arrive at the following query for downloading posts in list 123456 in 2019:

curl "https://api.crowdtangle.com/posts?token=<your token>&startDate=2019-01-01T00:00:00&endDate=2020-01-01T00:00:00&listIds=123456"

For comprehensive documentation of the /search endpoint, check out CrowdTangle’s GitHub documentation.

The Download Script

First, we need a bunch of timeframes to download data within. Handling Timestamps in Unix can be tough, therefore I created timeframes in R and wrote them to a file called timeframes.txt.

Imagine we want to send queries for each hour in January 2020, then we could run the following R-script:

library(stringr)
times <- seq.POSIXt(as.POSIXct("2020-01-01"), as.POSIXct("2020-02-01"), by = "60 min")
times <- times %>%
    as.character() %>%
    str_replace_all(" CET", "") %>%   # attention, your timezone might not be CET
    str_replace_all(" ", "T")   

sink("timeframes.txt")
for (time in all)
    cat(time, "\n")
sink()    

Then, we could just read in timeframes.txt line by line using Bash. This, however, requires a bit of text cleaning, such that we arrive at the following script that reads in the timestamps as the end of each timeframe and then pushes that timestamp to the “start position” of the next timeframe.

Additionally, we save each query into a time-stamped .json file indicating the downloaded timeframe (might come in handy later - if any file hits the 10k post limit, you could re-download subsets of the corresponding timeframe) inside of a folder called /json.

token=$(cat token.txt)
start="2019-12-31T23:00:00"  # make sure you start _before_ the first entry in timeframes.txt

input="timeframes.txt"
while IFS= read -r end
do
  end=$(echo ${end} | tr -d '\r')
  end=$(echo ${end} | tr -d ' ')
  curl "https://api.crowdtangle.com/posts?token=$token&startDate=$start&endDate=$end&listIds=1461358" > json/"$start"-"$end".json
  start=$end
done < "$input"

If your filenames look weird, don’t worry, I already adapted the following code from StackOverflow:

cd json
for fn in *; do mv -- "$fn" $(echo $fn | sed -e 's/[^A-Za-z0-9._-]/_/g'); done
cd ..

The Parsing Script (to CSV using R)

JSON files are essentially lists of lists of lists of …, which is not really what a data scientist accustomed to data frames and tibbles is wishing for. Luckily, there are helper functions in R which do the job. Notably, as the CrowdTangle API only returns variables that apply to a given post, we also need to use a function such as plyr::rbind.fill() that fills variables not existent in certain objects with NAs.

filenames <- list.files(path="./json", pattern="*.json", full.names=T)

responses <- list()
i <- 1

for (file in filenames){

    d <- rjson::fromJSON(file=file)  
    
    # Skip empty responses
    # One entry in "responses" object is one API response as df

    if (length(d$result$posts) > 0){
        # Extract post information and save each post as data frame
        posts <- lapply(d$result$posts, as.data.frame) 
        # Bind posts to data frame, fill missing columns 
        responses[[i]] <- plyr::rbind.fill(posts)
        i <- i+1
    }

}

# Optionally bind all responses into one data frame
d <- plyr::rbind.fill(responses); rm(responses) 

write.csv(d, "yourfilename.csv")

Summary

There you have it, each time you plan on a new data collection, you need to specify:

  1. API endpoint and URL request parameters
  2. Timeframes to download (can be tricky due to return and rate limits)
  3. Download script in shell reading in the timeframes
  4. A parsing script for the JSON files

Update, Jan 2021

I recently improved the shell script in order to handle timeouts and read in the next timeframe from the last return value, rather than creating timeframes externally.

get_year ()
{
printf "\n*** Downloading all post data in $1 ***\n\n"
printf "*** Cleaning json/ directory ***\n"
cd json
sleep 2
find . -size 0 -delete
touch .gitkeep # if you have a GitHub repo
cd ..
printf "\n*** Checking for previous file of year $1 in json/ for obtaining timeframe ***\n\n"
sleep 2
lastfile=$(ls -v json | grep "^$1-" | tail -n 1)
if ! [[ -z $lastfile ]]
then
      printf "*** Found file: $lastfile ***\n"
	  count=$(echo ${lastfile:5} | sed 's/.json//g')
	  let count+=1
	  printf "*** Will count files starting at $count ***\n"
	  end=$(tail -c 10000 json/$lastfile | grep -oP '\"date\":\"[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}' | tail -n 1 | sed 's/[a-z\"]*//g' | cut -c2- | sed 's/ /T/g')
	  printf "*** Start date form last session found at $end ***\n"
	  sleep 2
else
      printf "*** Found NO previous file for $1, will set default start value ***\n\n"
	  end="$1-12-31T23:59:59"
	  let count=1
	  printf "*** Start date set to $end ***\n"
	  sleep 2
fi
start="$1-01-01T00:00:00"
printf "*** Initializing download... ***\n\n"
printf "*** Files will be stored in directory json/ with signature $1-{1,2,...n} ***\n\n"
sleep 1
let returned=999
while [[ returned -gt 149 ]]
do
  touch json/$1-$count.json
  printf "*** Trying to download $start until $end into json/$1-$count.json... ***\n\n"
  while [[ $(head -c 50 json/$1-$count.json | grep -oP '^[^0-9]*\K[0-9]+') -ne 200 ]]
  do
    curl --max-time 90 "https://api.crowdtangle.com/posts?token=$token&startDate=$start&endDate=$end&listIds=1461358&sortBy=date&count=10000" > json/$1-$count.json
	if [[ $(head -c 50 json/$1-$count.json | grep -oP '^[^0-9]*\K[0-9]+') -ne 200 ]]
	then 
	  printf "\n*** Last download returned bad status or failed. Setting console to sleep for 10 seconds and retrying ***\n\n"
	  sleep 10
	fi
  done
  end=$(tail -c 10000 json/$1-$count.json | grep -oP '\"date\":\"[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}' | tail -n 1 | sed 's/[a-z\"]*//g' | cut -c2- | sed 's/ /T/g')
  returned=$(head -c 150 json/$1-$count.json | wc -c) # empty returns have < 150 bytes
  let count+=1
done	
let count-=1
rm json/$1-$count.json # remove last return as it is empty
printf "*** YEAR $1 DOWNLOAD COMPLETE ***\n\n"
} 

get_year 2020

Final thoughts

Taking a birds-eye view on this post, I acknowledge that using the CrowdTangle API from your shell can seem daunting. Still, I emphasize that it is worth the journey. On your journey, you will very likely discover CrowdTangle endpoints and variables you overlooked and which could potentially inform your research questions or your specific use case.

Questions? Thoughts? Generate a Comment to this Post!


Enter Name:


Enter a Title for Later Reference:


If Applicable, Enter Reply Reference:


Enter Comment:



Notes on Downloading Conversations through Twitter's V2 API

twitterapi bash

Using Bash to Query the New Twitter API 2.0

tutorial twitterapi bash

Code Snippet: Including Shiny Apps in Your Static Website with Hugo

code-snippet hugo shiny R

Search this Website