When first starting out using Twitter’s search APIs for accessing research data, I found comprehensive online tutorials on effectively downloading Twitter data to be sparse.
Many researchers aim at downloading complete hashtags or search terms which often involves using several of Twitter’s premium requests at once. In this short tutorial, I suggest a framework which handles these kinds of downloads automatically without the user worrying about incomplete data or the unnecessary use of requests.
Why use Twurl?
Using Twurl instead of popular libraries for Python and R comes with two great advantages:
-
API requests can directly be edited as URL requests such that there will never be compatibility problems when Twitter updates their API services.
-
Twurl will always output all currently available variables as JSON files such that no data ever gets lost when parsed into library-specific file formats.
While you might think that the JSON file format is a drawback, you can easily re-download and parse the status IDs you have downloaded into the external library of your choice if you want to work with an existing file format rather than parsing the JSON files yourself. This process is called “hydrating”, is a great way of sharing Twitter data between researchers that have Twitter API access and might be covered in another post.
The building blocks
After you have opened your terminal and installed Twurl
gem install twurl
make sure that you have created a new API app as well as a full archive dev environment. Write down the name of the dev environment for later. Now, create the keys connected to your app and paste them into the following command
twurl authorize --bearer --consumer-key xyz --consumer-secret xyz --access-token xyz --token-secret xyz
This command should now have created a bearer token for this session with which you can use the dev label you have set up.
A standard full archive URL request has the following form
twurl --bearer "/1.1/tweets/search/fullarchive/<devlabel>.json" -A "Content-Type:application/json" -d '{"#example OR #anotherexample","maxResults":"500","fromDate":"200603220000","toDate":"202001010000"}' > tweets.json
There are a few things to note here:
-
/1.1/tweets/search/fullarchive/<devlabel>.json
is the URL connected to you dev label. Here the flexibility of Twurl becomes apparent, as 1.1 indicates the API version we are using in this example. -
Each request in the current version of Twitter’s premium search API returns up to 500 tweets.
-
Timeframes are in the format
YYYYMMDDmmss
. What you can not see here is the fact that time is always standardized to GMT/UTC (London) and that the API searches backwards in time (just as if you were scrolling through Twitter yourself), such that we need to update “toDate” after each request. (Note: The “fromDate” roughly equates the time in which the first tweet on Twitter has been posted). -
> tweets.json
writes the results into the filetweets.json
. I would not recommend appending each result to the same file and will later cover what we do with this file once we want to start our next query.
Creating an automatic loop
Now that we can use Twitter’s full archive API from our terminal, we need the following ingredients to have a full, automated download loop:
-
A script that updates “toDate” in our download file based on our search result.
-
A script that renames (and moves)
tweets.json
such that no result gets overwritten by its successor. -
A stopping criterion that checks if the full search term is downloaded.
As I am a Windows user that mainly works with R, here is one way of handling these three points with CMD, Powershell, and R:
- I read in
tweets.json
into R with
json_data <- rjson::fromJSON(file="tweets.json")
Then I localize the variable created_at
in json_data
, apply min()
to it an save the result in the variable toDate
.
For writing the updated toDate
into my download script, I
like to use the base R functions cat()
and sink()
.
Additionally, I store my current query in a .txt
file
which I read into the variable q
.
sink("download.cmd")
paste("twurl --bearer \"/1.1/tweets/search/fullarchive/<devenv>.json\" -A \"Content-Type:application/json\" -d \'{\"query\":\"", q, "\",\"maxResults\":\"500\",\"fromDate\":\"200603220000\",\"toDate\":\"", toDate, "\"}\' > tweets.json", sep="") %>%
cat()
sink()
- Before sending the next query, I run the following Powershell
script to add a “_YYYYMMDD_hhmmss” timestamp to
tweets.json
and move it to another directory
Get-ChildItem "*.json" | ForEach-Object {Rename-Item $_.FullName "$BackupFolder$($_.BaseName -replace " ", "_" -replace '\..*?$')$(Get-Date -Format "_yyyyMMdd_hhmmss").json"}
mv *.json json_files
- Lastly, I use
cat()
andsink()
to write the number of tweets that I extracted fromrjson::fromJSON(file="tweets.json")
into another.txt
that my terminal reads in and checks before each download. Typically, a return value of less than 500 indicates that all archived tweets have been downloaded. Here is one possible implementation in Win CMD batch, employing an endless for-loop.
@echo off
SetLocal EnableDelayedExpansion
for /L %%n in (1,0,10) do (
cmd /C download.cmd
echo *** Now initiating R file routine ***
rscript parse.R
set /p returned_n=<returned_n.txt
echo *** Current n is: ***
echo !returned_n!
if !returned_n! LSS 500 (
echo All tweets for query downloaded.
echo Press any key to continue and close this process.
PAUSE
exit
)
)
There are a few catches to CMD such as
the option SetLocal EnableDelayedExpansion
that has to be
enabled in order to be able
to update variables inside of for-loops. Additionally
I experienced difficulties if the number of returned tweets
only had two digits which I fixed by letting n = 100 if n < 100
in parse.R
. Therefore, I might change to Powershell scripts
in the future.
Final thoughts
If you are not very familiar with using your terminal (like me), using Twurl might have a steep learning curve at the beginning. Still, the flexibility Twurl offers has clear advantages over conventional external libraries for downloading Twitter data and mastering Twurl will make Twitter’s API seem less like a black box.