Twitter recently granted academic researchers access to vast amounts of Twitter data for free, which motivated me to write another tutorial on how to query Twitter’s (new) full-archive endpoint through shell script.
One of the advantages of this is that researchers now can delete data sets after conducting a study without worrying to pay for re-downloading data. Importantly, users might delete their posts or change their profile information over time and have the right to have their data also deleted by researchers that previously scraped their Twitter data.
If you work with a windows machine, you will need to download
Cygwin to run Bash shell scripts.
Additionally, I used cURL to
query Twitter’s API endpoints.
For a quick install,
you can type choco install curl
in your shell.
After applying for access, (currently open to graduate students
and researchers),
you will need to set up a project and an app connected to that
project which holds a bearer token
.
If you have a public Git repository, you do not want
other people to see this token as they can use it to
scrape data in your name. Therefore, I recommend saving
the bearer token to a file called token.txt
and
load it into your script environment through
BEARER_TOKEN=$(cat token.txt)
.
Caveats
1. Be aware of rate limits
Twitter documents its rate limits for each API endpoint as they differ across endpoints. You definitely do not want to download more data than the rate limits allow because, among other things, you might end up breaking your download routine by obtaining empty return values.
Fortunately, Bash has a global variable called
SECONDS
which can be used to check if you are within
the rate limit. In the following example, the
rate limit is 3 seconds.
SECONDS=0 # reset rate limit
# send query...
while ! [[ $SECONDS -ge 3 ]]; do sleep 0.1; done
# go to next iteration ...
2. Know your search terms
Searching for Tweets with the Twitter API essentially works the same as searching for tweets on the Twitter platform itself. This means that, by default, the order of search terms does not matter when searching for tweets as long as all of them occur within the same tweet. Additionally, searching for “rstats” returns both tweets with “#rstats” and “rstats” Still, there is a rich selection of search operators to tailor the query search operator to your specific use case.
When in doubt, you might want to search for tweets on Twitter manually and take a look at what tweets are returned. For example, German umlauts are automatically taken into account by Twitter’s search functions, such that “baden_wuerttemberg” automatically also returns “baden_württemberg”.
3. Changes to data structure in the new 2.0 API
From what I have seen so far, Twitter made their API responses much more efficient compared to their old API endpoints. At the same time, this makes more sophisticated data aggregation scripts necessary.
One of the changes is that the API now returns separate tables for post data (“main”), user data (“users”), referenced (retweeted, mentioned, replied to) tweet data (“tweets”), and poll data (“polls”) separately.
I might craft another post describing how I
recently wrote a script aggregating data
given this new form of API responses
in the future.
For now,
one GitHub repo connected to a study in which
I applied such a script
can be found here
(go to /download
).
4. Check your output
Sometimes, you want to download amounts of data
that take a couple of hours to scrape, hence
you will not monitor each and every API response.
I found that around 0.05% of API responses
had an unexpected error that featured
“Service Unavailable” at the beginning
of responses, while the remaining
part of the response still featured data.
This made parsing the JSON responses to
other data formats impossible (at least
with the R package {jsonlite}
) and
I circumvented the problem by finding
out which files featured such a header
and deleting the header manually:
let count=1
for f in *.json; do echo $count; let count+=1; if [[ $(head -c 30 $f) == *"Service"* ]]; then echo $f; sleep 10; fi; done
Naturally, this is just one way of doing this (you could also append names of erroneous files to a text file) and not everyone needs a progress count when iterating through all files (which, by the way, still might take a few minutes).
Download routine
Setup
First, you want to define a couple of things before starting your download routine. From when to when would you like to obtain data? Which variables would you like to have returned? As of now, Twitter requires you to define almost all the variables you would like to download. I used the following settings to download all historical data of hashtags up to 01/01/2021, including, to the best of my knowledge, all currently available variables.
# Setup
BEARER_TOKEN=$(cat token.txt)
from="2008-01-01T00%3A00%3A00Z"
to="2020-12-31T23%3A59%3A59Z"
#max="10" # testing
max="500" # you can download up to 500 tweets at once
# Variables
tweet="&tweet.fields=attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,possibly_sensitive,referenced_tweets,reply_settings,source,text,withheld"
user="&user.fields=created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld"
expansions="&expansions=attachments.poll_ids,attachments.media_keys,author_id,entities.mentions.username,geo.place_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id"
media="&media.fields=duration_ms,height,media_key,preview_image_url,type,url,width,public_metrics"
geo="&place.fields=contained_within,country,country_code,full_name,geo,id,name,place_type"
poll="&poll.fields=duration_minutes,end_datetime,id,options,voting_status"
vars="$tweet$user$expansions$media$geo$poll"
For comprehensive documentation of the /tweets/search/all
endpoint, check out
Twitter’s official documentation
Download
First of all, you need to encode hashtags and space characters to a URL-compatible format:
encode_hashtag () {
echo $1 | sed 's/#/%23/' | sed 's/ /%20/g'
}
Then, you might set your initial query to the following format:
q="#rstats"
q=$(encode_hashtag "$q")
query=$(echo "?query=$q&max_results=$max&start_time=$from&end_time=$to$vars" | sed 's/[\r\n]//g')
Then you send a query with the following command (modify curl flags as appropriate):
curl \
-k \
--connect-timeout 30 \
--retry 100 \
--retry-delay 8 \
--retry-connrefused \
"https://api.twitter.com/2/tweets/search/all$query" -H "Authorization: Bearer $BEARER_TOKEN" > $filename.json
Updating query parameters
After the initial query, you might update the until parameter of the query through this function. The API searches tweets chronologically backwards, so you update the until parameter based on the status ID of the oldest tweet returned…
get_oldest_id () {
tail -c 250 $1 | grep -oP '\"oldest_id\":\"(.+?)\"' | sed 's/oldest_id//g;s/[\":]//g'
}
until=$(get_oldest_id $filename)
…and change the initial query to the following function:
update_query () {
query="?query=$q&max_results=$max&until_id=$until$vars"
}
Additionally, you have to update the output filename in order to save
each response to different files (appending is a bad idea
considering you later have to parse the JSON data),
for example through a variable called
count
:
q="#rstats"
let count=1
update_filename () {
filename="json/$(echo "$q" | sed 's/%23/#/' | sed 's/ /+/g' | sed 's/%20/+/g')-$count.json"
}
# send query and save file...
let count+=1
update_filename
# go to next iteration ...
Everything put together
Here is an example of how I currently download hashtags with Bash. Keep in mind that the functions assume a couple of pre-existing variables in the global environment.
# Additional functions not introduced before
get_result_count() {
tail -c 250 $1 | grep -oP '\"result_count\":[0-9]{1,}' | sed 's/\"result_count\"://g'
}
get_sample_return_date () {
head -c 5000 $1 | grep -oP '\"created_at\":\"(.+?)T' | sed 's/created_at//g;s/[\":T]//g' | head -n 1
}
# Main download function
download_current_q () {
printf "\n\n*** Downloading %s ***\n\n" "$(decode_hashtag "$q")"
let count=1
update_filename
update_initial_query
while [[ true ]]
do
SECONDS=0
send_url_save_to $query $filename
if ! [[ $(get_result_count $filename) -gt 0 ]]
then
rm $filename
printf "\n\n*** Download $(decode_hashtag "$q") complete ***\n\n"
break
fi
printf "\n\n*** Just downloaded $(decode_hashtag "$q"). If found, reference date is %s ***\n\n" $(get_sample_return_date $filename)
until=$(get_oldest_id $filename)
let count+=1
update_query
update_filename
while ! [[ $SECONDS -ge 3 ]]; do sleep 0.1; done
done
}
Downloading multiple hashtags
Saving multiple hashtags
in queries.txt
(one per line), you
could run the above functions
and routines for a series
of hashtags like this:
#! /usr/bin/env bash
# Windows encoding fix
sed -i 's/\r$//g' queries.txt
touch downloaded.txt
while [[ $(wc -l < queries.txt) -gt 0 ]]
do
q=$(head -n 1 queries.txt)
bash main.sh $q
echo $q >> downloaded.txt
sed -i 1d queries.txt # delete first line of queries.txt
sleep 3 # rate limit
done
printf "\n\n*** All queries in queries.txt downloaded ***\n\n"
Next to your setup (download timeframe, variables, …),
you could import functions in main.sh
like this:
# Functions
. functions.sh
# Download parsed query
q=$1
download_current_q
Summary
Here is an enumeration of the steps you need to take in order to download a set of hashtags with the new Twitter V2 API:
-
Apply for researcher access to Twitter’s API services
-
Set up a project with a connected app and save the bearer token to
token.txt
-
Define a download timeframe, a set of variables, and a set of search terms to download
-
Send an initial query for these parameters, update the until parameter in the query based on the oldest tweet returned by the API response
-
Repeat and update the
until
parameter until the response returns 0 tweets -
Repeat steps 4 and 5 for each search term (e.g., hashtag)
Final thoughts
Becoming familiar with Twitter’s new API might take some time, but the effort will be worthwhile given the new API capacities provided to researchers.
Despite this learning curve, being able to query API endpoints through your own shell will make you independent from software packages that might become outdated as Twitter (and other providers) continuously update and improve their API endpoints.
If you would like to get started with Bash, I recommend The Unix Workbench by Sean Kross.