Notes on Downloading Conversations through Twitter's V2 API

twitterapi bash

One promising new feature of Twitter’s new, free V2 API for academic researchers is the possibility to capture conversations, or reply threads, through a variable called conversation_id.

I wanted to explore how to download such conversations and what the Twitter API returns when conversations are searched for.

With the help of my colleague Fitore Morina, I was able to publish the following example thread to Twitter, with indentation indicating layers of interaction:

head of conversation test
    reply 1
        reply to reply 1
    *reply 2 to head #hashtagreallynobodyuses*
        reply to hashtag
    reply 3 to head
        reply to reply 3 forward
            reply to reply to reply
	_quoted reply 3 to head_
    final reply to head
	_retweeted final reply to head_

Imagine we would like to download all hashtags related to the community #hashtagreallynobodyuses. In order to gain a better understanding of the community, we do not only want to look at tweets that use the hashtag, but also tweets that are part of conversations around that hashtag.

First, we would download all tweets with #hashtagreallynobodyuses via the /tweets/search/all endpoint:

BEARER_TOKEN=$(cat token.txt)

from="2021-03-23T00%3A00%3A00Z"
to="2021-03-23T10%3A02%3A59Z"
max="500"  
vars="&tweet.fields=conversation_id"

q="%23hashtagreallynobodyuses"
query="?query=$q&max_results=$max&start_time=$from&end_time=$to$vars"

curl "https://api.twitter.com/2/tweets/search/all$query" -H "Authorization: Bearer $BEARER_TOKEN"

Which returns:

{"data":[{"conversation_id":"1374299459614609409","id":"1374299647028649986","text":"reply 2 to head #hashtagreallynobodyuses"}],"meta":{"newest_id":"1374299647028649986","oldest_id":"1374299647028649986","result_count":1}}

Then, we would search for all tweets affiliated with the conversation ID 1374299459614609409 through the /tweets/search/recent?query=conversation_id:<ID> endpoint:

max="100" 		# limits differ across endpoints
q="1374299459614609409"
query="?query=conversation_id:$q&max_results=$max&start_time=$from&end_time=$to$vars"

curl "https://api.twitter.com/2/tweets/search/recent$query" -H "Authorization: Bearer $BEARER_TOKEN"

Which returns:

{"data":[{"conversation_id":"1374299459614609409","id":"1374299963589599234","text":"reply to reply to reply"},{"conversation_id":"1374299459614609409","id":"1374299919821967361","text":"final reply to head"},{
"conversation_id":"1374299459614609409","id":"1374299851068997632","text":"reply to reply 3 forward"},{"conversation_id":"1374299459614609409","id":"1374299787705659398","text":"reply to hashtag"},{"conversatio
n_id":"1374299459614609409","id":"1374299753828261891","text":"reply 3 to head"},{"conversation_id":"1374299459614609409","id":"1374299647028649986","text":"reply 2 to head #hashtagreallynobodyuses"},{"conversa
tion_id":"1374299459614609409","id":"1374299539272785924","text":"reply to reply 1"},{"conversation_id":"1374299459614609409","id":"1374299482402209796","text":"reply 1"}],"meta":{"newest_id":"13742999635895992
34","oldest_id":"1374299482402209796","result_count":8}}

I admit that the latter response is not too readible, so I wrote a short parsing script in R:

#!/usr/bin/env RScript
d <- jsonlite::fromJSON("conv.json", simplifyDataFrame = TRUE, flatten = TRUE)
tibble::tibble(d$data)

Which prints the following result:

# A tibble: 8 x 3
  conversation_id     id                  text
  <chr>               <chr>               <chr>
1 1374299459614609409 1374299963589599234 reply to reply to reply
2 1374299459614609409 1374299919821967361 final reply to head
3 1374299459614609409 1374299851068997632 reply to reply 3 forward
4 1374299459614609409 1374299787705659398 reply to hashtag
5 1374299459614609409 1374299753828261891 reply 3 to head
6 1374299459614609409 1374299647028649986 reply 2 to head #hashtagreallynobodyu~
7 1374299459614609409 1374299539272785924 reply to reply 1
8 1374299459614609409 1374299482402209796 reply 1

There are a couple of things to note here:

Retweets and mentions of tweets in the reply chain do not count as part of the conversation

This might be a limitation for social network analyses since retweets and mentions can also be interpreted as being part of interactions. There are tweet fields called expansions that allow you to obtain data about the tweet a retweet was retweeting or the tweet a mention was mentioning. Still, this method only points to tweets posted prior to the downloaded tweet connected to a community and is not able to capture chains of retweets (e.g., retweets of retweets).

The head of the conversation is missing and needs to be downloaded separately

You would download those missing tweets with the /tweets/:id endpoint as described here. As the status ID of the tweet that sparked a conversation is equal to the conversation ID itself, we can just search for the tweet with the status ID 1374299459614609409

curl "https://api.twitter.com/2/tweets/1374299459614609409" -H "Authorization: Bearer $BEARER_TOKEN"
# Output: {"data":{"id":"1374299459614609409","text":"head of conversation test"}}

Notice that if we were only interested in conversations sparked by tweets with #hashtagreallynobodyuses, we would be able to skip this step since the conversation starters would already be part of the data returned in the context of our initial search.

Conversations can not reliably distinguish forward and backward search

Originally, we searched for #hashtagreallynobodyuses in order to discover conversations related to that hashtag. What we obtained was a complete reply chain, which both includes the tweet that the tweet reply 2 to head #hashtagreallynobodyuses replied to and other sub-threads of the conversation started by the tweet head of conversation test (backward search) as well as the tweets that replied to reply 2 to head #hashtagreallynobodyuses themselves (forward search). Notice that there might be qualitative differences between backward search and forward search. Tweets obtained through backward search are the tweets that made the community #hashtagreallynobodyuses contribute a reply to the conversation, for example when the content of a conversation is worth sharing inside the community. Tweets obtained through forward search, on the other hand, might capture discussions about something inside the community #hashtagreallynobodyuses. The variable created_at might be a good proxy for filtering tweets in the sense of a forward search through only including tweets posted after the tweet with #hashtagreallynobodyuses. Still, tweets replying to unrelated parts of the “forward conversation” (e.g., additional replies to the conversation starter) might have been posted later than the tweet with #hashtagreallynobodyuses as well. The only secure solution I can think of is to iteratively identify reply chains of tweets that replied to the tweet reply 2 to head #hashtagreallynobodyuses through the referenced_tweets.id variable in the expansions field of the /tweets endpoints.

Summary

Here is a summary of how I would currently go about downloading and investigating conversations inside a given community:

Download all tweets with a certain hashtag (e.g., “#EDchat”) via the /tweets/search/all endpoint
Extract all unique conversation IDs of these tweets via the variable conversation_id
Download all tweets affiliated with these conversation IDs via /tweets/search/recent?query=conversation_id:<ID>
Optionally, download the tweets that sparked the conversations via the /tweets/:id endpoint
Clean and merge your data, then perform network analyses based on the variables conversation_id and user_id

Final thoughts and current limitations of the `/:conversation_id` endpoint

Searching for conversations is a very promising avenue for Twitter researchers interested in social network analysis. Still, as of the time of this post, there are a couple of limitations to the current functionality Twitter’s API provides when searching for conversations. Most notably, there is no option to distinguish replies that followed tweets affiliated with a certain community from tweets sparking interaction in a community (forward search vs. backward search). Additionally, the multitude of download steps necessary to obtain conversations inside a given community (#hashtag) might be overly complicated and I would like to see Twitter giving researchers access to an endpoint that downloads replies automatically when searching for tweets via the /tweets/search/all endpoint. Not only would it make the analysis of conversations on Twitter more accessible, but also less error-prone.

Comments

On 4/8/2021, 19:43:19 (GMT+0100) Andrew wrote: Title: run for multiple conversation_ids

Hi Great post, thanks for this. Question though, is there any easy way to run this for say 50 conversation_ids in one go? I have lots of conversation_ids and want to get back all the replies

On 8/8/2021, 13:55:28 (GMT+0200) Conrad wrote: In reply to: run for multiple conversation_ids Title: RE: run for multiple conversation_ids

Thanks! There are multiple ways of achieving this. They involve having a .txt file in which all conversation IDs are stored with each line in the document holding one ID. Then, you can run a while-loop in Bash or R which extracts the IDs and performs a download routine until the .txt file is empty. For each ID, you also want to include a condition which checks whether there are any more tweets left to download (i.e. the number of returned tweets is 0) and updates the date until which the search looks up tweets otherwise. I implemented such a routine in this blog post: https://cborchers.com/2021/02/15/using-bash-to-query-the-new-twitter-api-2.0/ and will also paste the relevant code snippet here for demonstration purposes. I will also attach a recent R script from our research group.

Attachments to Comments

Title: run for multiple conversation_ids

#! /usr/bin/env bash

# Windows encoding fix
sed -i 's/\r$//g' queries.txt

touch downloaded.txt
while [[ $(wc -l < queries.txt) -gt 0 ]]
do
	q=$(head -n 1 queries.txt)
	bash main.sh $q
	echo $q >> downloaded.txt
	sed -i 1d queries.txt # delete first line of queries.txt
	sleep 3 # rate limit
done
printf "\n\n*** All queries in queries.txt downloaded ***\n\n"

#!/usr/bin/env Rscript
args <- commandArgs(trailingOnly = TRUE)

# Parsing arguments: 1: token reference
#                    2: conv ID filename reference holding conv IDs line by line

# Example usage
# rscript R/conversations-search.R mario x00

########################################################################

### Twitter API Conversations Wrapper in R  ###
# https://developer.twitter.com/en/docs/twitter-api/conversation-id

token <- paste0("token-", args[1], ".txt")   # edit depending on your file and folder names
cid_file <- paste0("data/", args[2], ".txt") # edit depending on your file and folder names

library(httr)
library(tidyverse)

## API Authorization
bearer_token <- read_file(token)
headers <- c(
  Authorization = sprintf("Bearer %s", bearer_token)
)

## List of Twitter API Params
params <- list(
  query = NULL, # empty to fill up
  max_results = "500", # range 10-100 / 500
  start_time = "2008-01-01T00:00:00Z", # (YYYY-MM-DDTHH:mm:ssZ) -> RFC3339 date-time
  end_time = "2020-12-31T23:59:59Z", # (YYYY-MM-DDTHH:mm:ssZ)
  tweet.fields = "attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,possibly_sensitive,public_metrics,referenced_tweets,reply_settings,source,text,withheld",
  expansions = "attachments.poll_ids,attachments.media_keys,author_id,geo.place_id,in_reply_to_user_id,referenced_tweets.id,entities.mentions.username,referenced_tweets.id.author_id",
  media.fields = "duration_ms,height,media_key,preview_image_url,public_metrics,type,url,width",
  place.fields = "contained_within,country,country_code,full_name,geo,id,name,place_type",
  poll.fields = "duration_minutes,end_datetime,id,options,voting_status",
  user.fields = "created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld",
  next_token = NULL # https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/paginate
)

# Get Conversation by conversation_id from API (optionally add next_token for pagination)
get_conversation <- function(conversation_id, next_token = NULL, .headers = headers, .params = params) {
  cat("\nQuery:", conversation_id, "| next_token:", next_token, file = log_file)

  .params$query <- paste0("conversation_id:", conversation_id)
  .params$next_token <- next_token

  response <- httr::GET(
    url = "https://api.twitter.com/2/tweets/search/all",
    httr::add_headers(.headers = .headers), query = .params
  )
  cat(" | Status:", status_code(response), file = log_file)
  return(response)
}

# Return data from API Response (based on jsonlite::fromJSON)
parse_response <- function(response) {
  dat <- content(
    response,
    as = "parsed",
    type = "application/json",
    simplifyDataFrame = TRUE
    # TODO: flatten??
  )
  return(dat)
}

# Parsing method compatible with our existing data cleaning methods
parse_json_response <- function(response) {
  dat <- httr::content(response, as = "text") %>%
    jsonlite::fromJSON(simplifyDataFrame = TRUE, flatten = TRUE)
  return(dat)
}

### Test ------------------------------------------------------------
# dat <- list()
# dat[[1]] <- get_conversation() %>% parse_json_response()
# View(dat)

### Basic Pagination -----------------------------------------------------------
# open appendable log file, optional
log_file <- file(paste0("logs/download_", Sys.Date(), "_", args[2], ".log"), open = "a")

while (length(readLines(cid_file)) > 0) {
  cat("\nWaiting 3 s because of rate limits...")
  Sys.sleep(3)
  conv_id <- readLines(cid_file)[1] %>% str_remove_all(" ")
  cat("\nQuery conversation:", conv_id)

  i <- 1
  results <- list()
  # first API call
  results[[i]] <- get_conversation(conv_id, next_token = NULL) %>% parse_json_response()
  # get next token
  token <- results[[i]][["meta"]][["next_token"]]

  # as long as a next_token exists, download next "page"
  while (!is.null(token)) {
    cat("\nGoing to next page...")
    cat("\nWaiting 3 s because of rate limits...")
    Sys.sleep(3) # Rate Limiting for 15 min
    
    cat("\nQuery next_token", token, "for conversation", conv_id, ":")
    i <- i + 1
    results[[i]] <- get_conversation(conv_id, next_token = token) %>%
      parse_json_response()

    # get next token and loop
    token <- results[[i]][["meta"]][["next_token"]]
  }

  cat("\nSaving conversation...")
  saveRDS(results, paste0("conversations-data/", conv_id, ".rds"))
  cat("saved!")

  # Update reference file
  writeLines(readLines(cid_file)[-1], cid_file)
  cat("\nReference file", cid_file, "updated!\n")
}

Notes on Downloading Conversations through Twitter's V2 API

Summary

Final thoughts and current limitations of the `/:conversation_id` endpoint

Comments

Attachments to Comments

Title: run for multiple conversation_ids

Questions? Thoughts? Generate a Comment to this Post!

Related

Using Bash to Query the New Twitter API 2.0

Using Bash to Query the CrowdTangle API and Parsing Outputs to CSV with R

Code Snippet: Replacing non-UTF-8 characters in filenames with Python (or Unix)

Search this Website and Old Blog Posts

Notes on Downloading Conversations through Twitter's V2 API

Downloading conversations related to communities

Summary

Final thoughts and current limitations of the /:conversation_id endpoint

Comments

Attachments to Comments

Title: run for multiple conversation_ids

Questions? Thoughts? Generate a Comment to this Post!

Related

Using Bash to Query the New Twitter API 2.0

Using Bash to Query the CrowdTangle API and Parsing Outputs to CSV with R

Code Snippet: Replacing non-UTF-8 characters in filenames with Python (or Unix)

Search this Website and Old Blog Posts

Final thoughts and current limitations of the `/:conversation_id` endpoint