One promising new feature of Twitter’s new, free V2 API for academic
researchers is the possibility to
capture conversations,
or reply
threads, through a variable called conversation_id
.
I wanted to explore how to download such conversations and what the Twitter API returns when conversations are searched for.
With the help of my colleague Fitore Morina, I was able to publish the following example thread to Twitter, with indentation indicating layers of interaction:
head of conversation test
reply 1
reply to reply 1
*reply 2 to head #hashtagreallynobodyuses*
reply to hashtag
reply 3 to head
reply to reply 3 forward
reply to reply to reply
_quoted reply 3 to head_
final reply to head
_retweeted final reply to head_
Downloading conversations related to communities
Imagine we would like to download all hashtags related
to the community #hashtagreallynobodyuses
. In order
to gain a better understanding of the community, we do not
only want to look at tweets that use the hashtag, but
also tweets that are part of conversations around
that hashtag.
First, we would download all tweets with
#hashtagreallynobodyuses
via the
/tweets/search/all
endpoint:
BEARER_TOKEN=$(cat token.txt)
from="2021-03-23T00%3A00%3A00Z"
to="2021-03-23T10%3A02%3A59Z"
max="500"
vars="&tweet.fields=conversation_id"
q="%23hashtagreallynobodyuses"
query="?query=$q&max_results=$max&start_time=$from&end_time=$to$vars"
curl "https://api.twitter.com/2/tweets/search/all$query" -H "Authorization: Bearer $BEARER_TOKEN"
Which returns:
{"data":[{"conversation_id":"1374299459614609409","id":"1374299647028649986","text":"reply 2 to head #hashtagreallynobodyuses"}],"meta":{"newest_id":"1374299647028649986","oldest_id":"1374299647028649986","result_count":1}}
Then, we would search for all tweets affiliated with the
conversation ID 1374299459614609409
through the
/tweets/search/recent?query=conversation_id:<ID>
endpoint:
max="100" # limits differ across endpoints
q="1374299459614609409"
query="?query=conversation_id:$q&max_results=$max&start_time=$from&end_time=$to$vars"
curl "https://api.twitter.com/2/tweets/search/recent$query" -H "Authorization: Bearer $BEARER_TOKEN"
Which returns:
{"data":[{"conversation_id":"1374299459614609409","id":"1374299963589599234","text":"reply to reply to reply"},{"conversation_id":"1374299459614609409","id":"1374299919821967361","text":"final reply to head"},{
"conversation_id":"1374299459614609409","id":"1374299851068997632","text":"reply to reply 3 forward"},{"conversation_id":"1374299459614609409","id":"1374299787705659398","text":"reply to hashtag"},{"conversatio
n_id":"1374299459614609409","id":"1374299753828261891","text":"reply 3 to head"},{"conversation_id":"1374299459614609409","id":"1374299647028649986","text":"reply 2 to head #hashtagreallynobodyuses"},{"conversa
tion_id":"1374299459614609409","id":"1374299539272785924","text":"reply to reply 1"},{"conversation_id":"1374299459614609409","id":"1374299482402209796","text":"reply 1"}],"meta":{"newest_id":"13742999635895992
34","oldest_id":"1374299482402209796","result_count":8}}
I admit that the latter response is not too readible, so I wrote a short parsing script in R:
#!/usr/bin/env RScript
d <- jsonlite::fromJSON("conv.json", simplifyDataFrame = TRUE, flatten = TRUE)
tibble::tibble(d$data)
Which prints the following result:
# A tibble: 8 x 3
conversation_id id text
<chr> <chr> <chr>
1 1374299459614609409 1374299963589599234 reply to reply to reply
2 1374299459614609409 1374299919821967361 final reply to head
3 1374299459614609409 1374299851068997632 reply to reply 3 forward
4 1374299459614609409 1374299787705659398 reply to hashtag
5 1374299459614609409 1374299753828261891 reply 3 to head
6 1374299459614609409 1374299647028649986 reply 2 to head #hashtagreallynobodyu~
7 1374299459614609409 1374299539272785924 reply to reply 1
8 1374299459614609409 1374299482402209796 reply 1
There are a couple of things to note here:
- Retweets and mentions of tweets in the reply chain do not count as part of the conversation
This might be a limitation for social network analyses since retweets and mentions can
also be interpreted as being part of interactions.
There are tweet fields called expansions
that allow you to obtain data
about the tweet a retweet was retweeting or the tweet a
mention was mentioning. Still, this method
only points to tweets posted prior to the downloaded tweet connected
to a community and is not able
to capture chains of retweets (e.g., retweets of retweets).
- The head of the conversation is missing and needs to be downloaded separately
You would download those missing tweets with the /tweets/:id
endpoint as described
here.
As the status ID of the tweet that sparked a conversation is equal to the conversation
ID itself, we can just search for the tweet with the status ID 1374299459614609409
curl "https://api.twitter.com/2/tweets/1374299459614609409" -H "Authorization: Bearer $BEARER_TOKEN"
# Output: {"data":{"id":"1374299459614609409","text":"head of conversation test"}}
Notice that if we were only interested in conversations sparked by
tweets with #hashtagreallynobodyuses
, we would be able to skip this step
since the conversation starters would already be part of the data returned
in the context of our initial search.
- Conversations can not reliably distinguish forward and backward search
Originally, we searched for #hashtagreallynobodyuses
in order to discover conversations
related to that hashtag. What we obtained was a complete reply chain, which
both includes the tweet that the tweet reply 2 to head #hashtagreallynobodyuses
replied to
and other sub-threads of the conversation started by the tweet
head of conversation test
(backward search)
as well as the tweets that replied to reply 2 to head #hashtagreallynobodyuses
themselves (forward search).
Notice that there might be qualitative differences between backward search and
forward search. Tweets obtained through backward search are the tweets that made
the community #hashtagreallynobodyuses
contribute a reply to the conversation, for example
when the content of a conversation is worth sharing inside the community.
Tweets obtained through forward search, on the other hand, might capture
discussions about something inside the community #hashtagreallynobodyuses
.
The variable created_at
might be a good proxy for filtering tweets in the sense of a
forward search through only including tweets posted after the tweet
with #hashtagreallynobodyuses
.
Still, tweets replying to unrelated parts
of the “forward conversation” (e.g., additional replies to the conversation starter) might have been posted
later than the tweet with #hashtagreallynobodyuses
as well.
The only secure solution I can think of is to iteratively identify reply chains of tweets that replied
to the tweet reply 2 to head #hashtagreallynobodyuses
through the referenced_tweets.id
variable in the expansions
field of the
/tweets
endpoints.
Summary
Here is a summary of how I would currently go about downloading and investigating conversations inside a given community:
-
Download all tweets with a certain hashtag (e.g., “#EDchat”) via the
/tweets/search/all
endpoint -
Extract all unique conversation IDs of these tweets via the variable
conversation_id
-
Download all tweets affiliated with these conversation IDs via
/tweets/search/recent?query=conversation_id:<ID>
-
Optionally, download the tweets that sparked the conversations via the
/tweets/:id
endpoint -
Clean and merge your data, then perform network analyses based on the variables
conversation_id
anduser_id
Final thoughts and current limitations of the /:conversation_id
endpoint
Searching for conversations is a very promising avenue for Twitter
researchers interested in social network analysis.
Still, as of the time of this post, there are a couple of limitations
to the current functionality Twitter’s API provides
when searching for conversations.
Most notably, there is no option to distinguish replies that followed
tweets affiliated with a certain community from tweets sparking interaction
in a community (forward search vs. backward search).
Additionally, the multitude of download steps necessary to obtain conversations
inside a given community (#hashtag)
might be overly complicated and I would like to see Twitter giving researchers access
to an endpoint that downloads replies automatically when searching for tweets
via the /tweets/search/all
endpoint.
Not only would it make the analysis of conversations on Twitter
more accessible, but
also less error-prone.
Comments
On 4/8/2021, 19:43:19 (GMT+0100) Andrew wrote: Title: run for multiple conversation_ids
Hi Great post, thanks for this. Question though, is there any easy way to run this for say 50 conversation_ids in one go? I have lots of conversation_ids and want to get back all the replies
On 8/8/2021, 13:55:28 (GMT+0200) Conrad wrote: In reply to: run for multiple conversation_ids Title: RE: run for multiple conversation_ids
Thanks! There are multiple ways of achieving this. They involve having a .txt file in which all conversation IDs are stored with each line in the document holding one ID. Then, you can run a while-loop in Bash or R which extracts the IDs and performs a download routine until the .txt file is empty. For each ID, you also want to include a condition which checks whether there are any more tweets left to download (i.e. the number of returned tweets is 0) and updates the date until which the search looks up tweets otherwise. I implemented such a routine in this blog post: https://cborchers.com/2021/02/15/using-bash-to-query-the-new-twitter-api-2.0/ and will also paste the relevant code snippet here for demonstration purposes. I will also attach a recent R script from our research group.
Attachments to Comments
Title: run for multiple conversation_ids
#! /usr/bin/env bash
# Windows encoding fix
sed -i 's/\r$//g' queries.txt
touch downloaded.txt
while [[ $(wc -l < queries.txt) -gt 0 ]]
do
q=$(head -n 1 queries.txt)
bash main.sh $q
echo $q >> downloaded.txt
sed -i 1d queries.txt # delete first line of queries.txt
sleep 3 # rate limit
done
printf "\n\n*** All queries in queries.txt downloaded ***\n\n"
#!/usr/bin/env Rscript
args <- commandArgs(trailingOnly = TRUE)
# Parsing arguments: 1: token reference
# 2: conv ID filename reference holding conv IDs line by line
# Example usage
# rscript R/conversations-search.R mario x00
########################################################################
### Twitter API Conversations Wrapper in R ###
# https://developer.twitter.com/en/docs/twitter-api/conversation-id
token <- paste0("token-", args[1], ".txt") # edit depending on your file and folder names
cid_file <- paste0("data/", args[2], ".txt") # edit depending on your file and folder names
library(httr)
library(tidyverse)
## API Authorization
bearer_token <- read_file(token)
headers <- c(
Authorization = sprintf("Bearer %s", bearer_token)
)
## List of Twitter API Params
params <- list(
query = NULL, # empty to fill up
max_results = "500", # range 10-100 / 500
start_time = "2008-01-01T00:00:00Z", # (YYYY-MM-DDTHH:mm:ssZ) -> RFC3339 date-time
end_time = "2020-12-31T23:59:59Z", # (YYYY-MM-DDTHH:mm:ssZ)
tweet.fields = "attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,possibly_sensitive,public_metrics,referenced_tweets,reply_settings,source,text,withheld",
expansions = "attachments.poll_ids,attachments.media_keys,author_id,geo.place_id,in_reply_to_user_id,referenced_tweets.id,entities.mentions.username,referenced_tweets.id.author_id",
media.fields = "duration_ms,height,media_key,preview_image_url,public_metrics,type,url,width",
place.fields = "contained_within,country,country_code,full_name,geo,id,name,place_type",
poll.fields = "duration_minutes,end_datetime,id,options,voting_status",
user.fields = "created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld",
next_token = NULL # https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/paginate
)
# Get Conversation by conversation_id from API (optionally add next_token for pagination)
get_conversation <- function(conversation_id, next_token = NULL, .headers = headers, .params = params) {
cat("\nQuery:", conversation_id, "| next_token:", next_token, file = log_file)
.params$query <- paste0("conversation_id:", conversation_id)
.params$next_token <- next_token
response <- httr::GET(
url = "https://api.twitter.com/2/tweets/search/all",
httr::add_headers(.headers = .headers), query = .params
)
cat(" | Status:", status_code(response), file = log_file)
return(response)
}
# Return data from API Response (based on jsonlite::fromJSON)
parse_response <- function(response) {
dat <- content(
response,
as = "parsed",
type = "application/json",
simplifyDataFrame = TRUE
# TODO: flatten??
)
return(dat)
}
# Parsing method compatible with our existing data cleaning methods
parse_json_response <- function(response) {
dat <- httr::content(response, as = "text") %>%
jsonlite::fromJSON(simplifyDataFrame = TRUE, flatten = TRUE)
return(dat)
}
### Test ------------------------------------------------------------
# dat <- list()
# dat[[1]] <- get_conversation() %>% parse_json_response()
# View(dat)
### Basic Pagination -----------------------------------------------------------
# open appendable log file, optional
log_file <- file(paste0("logs/download_", Sys.Date(), "_", args[2], ".log"), open = "a")
while (length(readLines(cid_file)) > 0) {
cat("\nWaiting 3 s because of rate limits...")
Sys.sleep(3)
conv_id <- readLines(cid_file)[1] %>% str_remove_all(" ")
cat("\nQuery conversation:", conv_id)
i <- 1
results <- list()
# first API call
results[[i]] <- get_conversation(conv_id, next_token = NULL) %>% parse_json_response()
# get next token
token <- results[[i]][["meta"]][["next_token"]]
# as long as a next_token exists, download next "page"
while (!is.null(token)) {
cat("\nGoing to next page...")
cat("\nWaiting 3 s because of rate limits...")
Sys.sleep(3) # Rate Limiting for 15 min
cat("\nQuery next_token", token, "for conversation", conv_id, ":")
i <- i + 1
results[[i]] <- get_conversation(conv_id, next_token = token) %>%
parse_json_response()
# get next token and loop
token <- results[[i]][["meta"]][["next_token"]]
}
cat("\nSaving conversation...")
saveRDS(results, paste0("conversations-data/", conv_id, ".rds"))
cat("saved!")
# Update reference file
writeLines(readLines(cid_file)[-1], cid_file)
cat("\nReference file", cid_file, "updated!\n")
}