Using Bash to Query the New Twitter API 2.0

tutorial twitterapi bash

Twitter recently granted academic researchers access to vast amounts of Twitter data for free, which motivated me to write another tutorial on how to query Twitter’s (new) full-archive endpoint through shell script.

One of the advantages of this is that researchers now can delete data sets after conducting a study without worrying to pay for re-downloading data. Importantly, users might delete their posts or change their profile information over time and have the right to have their data also deleted by researchers that previously scraped their Twitter data.

If you work with a windows machine, you will need to download Cygwin to run Bash shell scripts. Additionally, I used cURL to query Twitter’s API endpoints. For a quick install, you can type choco install curl in your shell.

After applying for access, (currently open to graduate students and researchers), you will need to set up a project and an app connected to that project which holds a bearer token. If you have a public Git repository, you do not want other people to see this token as they can use it to scrape data in your name. Therefore, I recommend saving the bearer token to a file called token.txt and load it into your script environment through BEARER_TOKEN=$(cat token.txt).

Caveats

1. Be aware of rate limits

Twitter documents its rate limits for each API endpoint as they differ across endpoints. You definitely do not want to download more data than the rate limits allow because, among other things, you might end up breaking your download routine by obtaining empty return values.

Fortunately, Bash has a global variable called SECONDS which can be used to check if you are within the rate limit. In the following example, the rate limit is 3 seconds.

SECONDS=0 # reset rate limit
# send query...
while ! [[ $SECONDS -ge 3 ]]; do sleep 0.1; done 
# go to next iteration ...

2. Know your search terms

Searching for Tweets with the Twitter API essentially works the same as searching for tweets on the Twitter platform itself. This means that, by default, the order of search terms does not matter when searching for tweets as long as all of them occur within the same tweet. Additionally, searching for “rstats” returns both tweets with “#rstats” and “rstats” Still, there is a rich selection of search operators to tailor the query search operator to your specific use case.

When in doubt, you might want to search for tweets on Twitter manually and take a look at what tweets are returned. For example, German umlauts are automatically taken into account by Twitter’s search functions, such that “baden_wuerttemberg” automatically also returns “baden_württemberg”.

3. Changes to data structure in the new 2.0 API

From what I have seen so far, Twitter made their API responses much more efficient compared to their old API endpoints. At the same time, this makes more sophisticated data aggregation scripts necessary.

One of the changes is that the API now returns separate tables for post data (“main”), user data (“users”), referenced (retweeted, mentioned, replied to) tweet data (“tweets”), and poll data (“polls”) separately.

I might craft another post describing how I recently wrote a script aggregating data given this new form of API responses in the future. For now, one GitHub repo connected to a study in which I applied such a script can be found here (go to /download).

4. Check your output

Sometimes, you want to download amounts of data that take a couple of hours to scrape, hence you will not monitor each and every API response. I found that around 0.05% of API responses had an unexpected error that featured “Service Unavailable” at the beginning of responses, while the remaining part of the response still featured data. This made parsing the JSON responses to other data formats impossible (at least with the R package {jsonlite}) and I circumvented the problem by finding out which files featured such a header and deleting the header manually:

let count=1
for f in *.json; do echo $count; let count+=1; if [[ $(head -c 30 $f) == *"Service"* ]]; then echo $f; sleep 10; fi; done

Naturally, this is just one way of doing this (you could also append names of erroneous files to a text file) and not everyone needs a progress count when iterating through all files (which, by the way, still might take a few minutes).

Download routine

Setup

First, you want to define a couple of things before starting your download routine. From when to when would you like to obtain data? Which variables would you like to have returned? As of now, Twitter requires you to define almost all the variables you would like to download. I used the following settings to download all historical data of hashtags up to 01/01/2021, including, to the best of my knowledge, all currently available variables.

# Setup
BEARER_TOKEN=$(cat token.txt)

from="2008-01-01T00%3A00%3A00Z"
to="2020-12-31T23%3A59%3A59Z"
#max="10"   # testing
max="500"   # you can download up to 500 tweets at once

# Variables
tweet="&tweet.fields=attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,possibly_sensitive,referenced_tweets,reply_settings,source,text,withheld"
user="&user.fields=created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld"
expansions="&expansions=attachments.poll_ids,attachments.media_keys,author_id,entities.mentions.username,geo.place_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id"
media="&media.fields=duration_ms,height,media_key,preview_image_url,type,url,width,public_metrics"
geo="&place.fields=contained_within,country,country_code,full_name,geo,id,name,place_type"
poll="&poll.fields=duration_minutes,end_datetime,id,options,voting_status"
vars="$tweet$user$expansions$media$geo$poll"

For comprehensive documentation of the /tweets/search/all endpoint, check out Twitter’s official documentation

Download

First of all, you need to encode hashtags and space characters to a URL-compatible format:

encode_hashtag () {
	echo $1 | sed 's/#/%23/' | sed 's/ /%20/g'
}

Then, you might set your initial query to the following format:

q="#rstats"
q=$(encode_hashtag "$q")
query=$(echo "?query=$q&max_results=$max&start_time=$from&end_time=$to$vars" | sed 's/[\r\n]//g')

Then you send a query with the following command (modify curl flags as appropriate):

curl \
    -k \
    --connect-timeout 30 \
    --retry 100 \
    --retry-delay 8 \
    --retry-connrefused \
    "https://api.twitter.com/2/tweets/search/all$query" -H "Authorization: Bearer $BEARER_TOKEN" > $filename.json

Updating query parameters

After the initial query, you might update the until parameter of the query through this function. The API searches tweets chronologically backwards, so you update the until parameter based on the status ID of the oldest tweet returned…

get_oldest_id () {
	tail -c 250 $1 | grep -oP '\"oldest_id\":\"(.+?)\"' | sed 's/oldest_id//g;s/[\":]//g'
}
until=$(get_oldest_id $filename)

…and change the initial query to the following function:

update_query () {
	query="?query=$q&max_results=$max&until_id=$until$vars"
}

Additionally, you have to update the output filename in order to save each response to different files (appending is a bad idea considering you later have to parse the JSON data), for example through a variable called count:

q="#rstats"
let count=1
update_filename () {
	filename="json/$(echo "$q" | sed 's/%23/#/' | sed 's/ /+/g' | sed 's/%20/+/g')-$count.json"
}
# send query and save file...
let count+=1
update_filename
# go to next iteration ...

Everything put together

Here is an example of how I currently download hashtags with Bash. Keep in mind that the functions assume a couple of pre-existing variables in the global environment.

# Additional functions not introduced before
get_result_count() {
	tail -c 250 $1 | grep -oP '\"result_count\":[0-9]{1,}' | sed 's/\"result_count\"://g'
}
get_sample_return_date () {
	head -c 5000 $1 | grep -oP '\"created_at\":\"(.+?)T' | sed 's/created_at//g;s/[\":T]//g' | head -n 1
}
# Main download function
download_current_q () {
printf "\n\n*** Downloading %s ***\n\n" "$(decode_hashtag "$q")"
let count=1
update_filename
update_initial_query
while [[ true ]]
do	
	SECONDS=0  
	send_url_save_to $query $filename
	if ! [[ $(get_result_count $filename) -gt 0 ]] 
	then
		rm $filename
		printf "\n\n*** Download $(decode_hashtag "$q") complete ***\n\n"
		break
	fi
	printf "\n\n*** Just downloaded $(decode_hashtag "$q"). If found, reference date is %s ***\n\n" $(get_sample_return_date $filename)
	until=$(get_oldest_id $filename)
	let count+=1
	update_query
	update_filename
	while ! [[ $SECONDS -ge 3 ]]; do sleep 0.1; done 
done
}

Downloading multiple hashtags

Saving multiple hashtags in queries.txt (one per line), you could run the above functions and routines for a series of hashtags like this:

#! /usr/bin/env bash

# Windows encoding fix
sed -i 's/\r$//g' queries.txt

touch downloaded.txt
while [[ $(wc -l < queries.txt) -gt 0 ]]
do
	q=$(head -n 1 queries.txt)
	bash main.sh $q
	echo $q >> downloaded.txt
	sed -i 1d queries.txt # delete first line of queries.txt
	sleep 3 # rate limit
done
printf "\n\n*** All queries in queries.txt downloaded ***\n\n"

Next to your setup (download timeframe, variables, …), you could import functions in main.sh like this:

# Functions
. functions.sh

# Download parsed query
q=$1
download_current_q

Summary

Here is an enumeration of the steps you need to take in order to download a set of hashtags with the new Twitter V2 API:

Apply for researcher access to Twitter’s API services
Set up a project with a connected app and save the bearer token to token.txt
Define a download timeframe, a set of variables, and a set of search terms to download
Send an initial query for these parameters, update the until parameter in the query based on the oldest tweet returned by the API response
Repeat and update the until parameter until the response returns 0 tweets
Repeat steps 4 and 5 for each search term (e.g., hashtag)

Final thoughts

Becoming familiar with Twitter’s new API might take some time, but the effort will be worthwhile given the new API capacities provided to researchers.

Despite this learning curve, being able to query API endpoints through your own shell will make you independent from software packages that might become outdated as Twitter (and other providers) continuously update and improve their API endpoints.

If you would like to get started with Bash, I recommend The Unix Workbench by Sean Kross.