How to effectively use Twitter's full archive API with Twurl

tutorial twitterAPI batch

When first starting out using Twitter’s search APIs for accessing research data, I found comprehensive online tutorials on effectively downloading Twitter data to be sparse.

Many researchers aim at downloading complete hashtags or search terms which often involves using several of Twitter’s premium requests at once. In this short tutorial, I suggest a framework which handles these kinds of downloads automatically without the user worrying about incomplete data or the unnecessary use of requests.

Why use Twurl?

Using Twurl instead of popular libraries for Python and R comes with two great advantages:

API requests can directly be edited as URL requests such that there will never be compatibility problems when Twitter updates their API services.
Twurl will always output all currently available variables as JSON files such that no data ever gets lost when parsed into library-specific file formats.

While you might think that the JSON file format is a drawback, you can easily re-download and parse the status IDs you have downloaded into the external library of your choice if you want to work with an existing file format rather than parsing the JSON files yourself. This process is called “hydrating”, is a great way of sharing Twitter data between researchers that have Twitter API access and might be covered in another post.

The building blocks

After you have opened your terminal and installed Twurl

gem install twurl

make sure that you have created a new API app as well as a full archive dev environment. Write down the name of the dev environment for later. Now, create the keys connected to your app and paste them into the following command

twurl authorize --bearer --consumer-key xyz --consumer-secret xyz --access-token xyz --token-secret xyz

This command should now have created a bearer token for this session with which you can use the dev label you have set up.

A standard full archive URL request has the following form

twurl --bearer "/1.1/tweets/search/fullarchive/<devlabel>.json" -A "Content-Type:application/json" -d '{"#example OR #anotherexample","maxResults":"500","fromDate":"200603220000","toDate":"202001010000"}' > tweets.json

There are a few things to note here:

/1.1/tweets/search/fullarchive/<devlabel>.json is the URL connected to you dev label. Here the flexibility of Twurl becomes apparent, as 1.1 indicates the API version we are using in this example.
Each request in the current version of Twitter’s premium search API returns up to 500 tweets.
Timeframes are in the format YYYYMMDDmmss. What you can not see here is the fact that time is always standardized to GMT/UTC (London) and that the API searches backwards in time (just as if you were scrolling through Twitter yourself), such that we need to update “toDate” after each request. (Note: The “fromDate” roughly equates the time in which the first tweet on Twitter has been posted).
> tweets.json writes the results into the file tweets.json. I would not recommend appending each result to the same file and will later cover what we do with this file once we want to start our next query.

Creating an automatic loop

Now that we can use Twitter’s full archive API from our terminal, we need the following ingredients to have a full, automated download loop:

A script that updates “toDate” in our download file based on our search result.
A script that renames (and moves) tweets.json such that no result gets overwritten by its successor.
A stopping criterion that checks if the full search term is downloaded.

As I am a Windows user that mainly works with R, here is one way of handling these three points with CMD, Powershell, and R:

I read in tweets.json into R with

json_data <- rjson::fromJSON(file="tweets.json")

Then I localize the variable created_at in json_data, apply min() to it an save the result in the variable toDate.

For writing the updated toDate into my download script, I like to use the base R functions cat() and sink(). Additionally, I store my current query in a .txt file which I read into the variable q.

sink("download.cmd")
paste("twurl --bearer \"/1.1/tweets/search/fullarchive/<devenv>.json\" -A \"Content-Type:application/json\" -d \'{\"query\":\"", q, "\",\"maxResults\":\"500\",\"fromDate\":\"200603220000\",\"toDate\":\"", toDate, "\"}\' > tweets.json", sep="") %>% 
  cat()
sink()

Before sending the next query, I run the following Powershell script to add a “_YYYYMMDD_hhmmss” timestamp to tweets.json and move it to another directory

Get-ChildItem "*.json" | ForEach-Object {Rename-Item $_.FullName "$BackupFolder$($_.BaseName -replace " ", "_" -replace '\..*?$')$(Get-Date -Format "_yyyyMMdd_hhmmss").json"}

mv *.json json_files

Lastly, I use cat() and sink() to write the number of tweets that I extracted from rjson::fromJSON(file="tweets.json") into another .txt that my terminal reads in and checks before each download. Typically, a return value of less than 500 indicates that all archived tweets have been downloaded. Here is one possible implementation in Win CMD batch, employing an endless for-loop.

@echo off
SetLocal EnableDelayedExpansion

for /L %%n in (1,0,10) do (
  cmd /C download.cmd
  echo *** Now initiating R file routine ***
  rscript parse.R
  set /p returned_n=<returned_n.txt
  echo *** Current n is: ***
  echo !returned_n!
  if !returned_n! LSS 500 (
      echo All tweets for query downloaded.
      echo Press any key to continue and close this process.
      PAUSE
      exit
      )
)

There are a few catches to CMD such as the option SetLocal EnableDelayedExpansion that has to be enabled in order to be able to update variables inside of for-loops. Additionally I experienced difficulties if the number of returned tweets only had two digits which I fixed by letting n = 100 if n < 100 in parse.R. Therefore, I might change to Powershell scripts in the future.

Final thoughts

If you are not very familiar with using your terminal (like me), using Twurl might have a steep learning curve at the beginning. Still, the flexibility Twurl offers has clear advantages over conventional external libraries for downloading Twitter data and mastering Twurl will make Twitter’s API seem less like a black box.

How to effectively use Twitter's full archive API with Twurl

Why use Twurl?

The building blocks

Creating an automatic loop

Final thoughts

Questions? Thoughts? Generate a Comment to this Post!

Related

Using Bash to Query the New Twitter API 2.0

Using Bash to Query the CrowdTangle API and Parsing Outputs to CSV with R

Consistently Substituting IDs with Randomized Numbers Using R

Search this Website and Old Blog Posts