handling twitter entities

this article was written by: andrew

they started writing it: Oct 02, 2022

it was last updated: Oct 02, 2022

I had a lot of fun working on this! Since I did this work, Twitter was taken over by a person whose values and goals for the platform are deeply incompatible with my own and so you can now find me on Mastodon!


The goal for the day: build out a tweet view with clickable links using Twitter API data.

Getting started

There's a bunch of ways to pull down tweets - for my purposes, I was trying to grab bookmarks, but this works for any API response that returns you the tweet object. Rather than work with the straight API, I opted for a community client, in the form of twitter-api-v2, a TypeScript library for Node.js.

Fetching tweets via the Twitter API gives you a bunch of things back, including an initially confusing entities field. According to the documentation, it's:

Entities which have been parsed out of the text of the Tweet. Additionally see entities in Twitter Objects. Entities are JSON objects that provide additional information about hashtags, urls, user mentions, and cashtags associated with a Tweet. Reference each respective entity for further details. Please note that all start indices are inclusive. The majority of end indices are exclusive, except for entities.annotations.end, which is currently inclusive. We will be changing this to exclusive with our v3 bump since it is a breaking change.

An example payload sheds some additional understanding:

"entities": {
    "annotations": [
        {
           "start": 144,
           "end": 150,
           "probability": 0.626,
           "type": "Product",
           "normalized_text": "Twitter"
        }
    ],
   "cashtags": [
        {
            "start": 18,
            "end": 23,
            "tag": "twtr"
        }
    ],
    "hashtags": [
        {
            "start": 0,
            "end": 17,
            "tag": "blacklivesmatter"
        }
    ],
    "mentions": [
        {
            "start": 24,
            "end": 35,
            "tag": "TwitterDev"
        }
    ],
    "urls": [
        {
           "start": 44,
           "end": 67,
           "url": "https://t.co/crkYRdjUB0",
           "expanded_url": "https://twitter.com",
           "display_url": "twitter.com",
           "status": "200",
           "title": "bird",
           "description": "From breaking news and entertainment...",
           "unwound_url": "https://twitter.com"
        }
    ]
}

Twitter has processed pretty much everything we need, BUT there's one gotcha. The start and end indices are based on a more raw version of the characters - certain escaped characters, like & and a good chunk of emojis end up throwing the count off if trying to look at the raw text response from the API.

Digging into tweet content parsing

There's a whole writeup on what counts as a character in the docs - makes sense for a company that prides itself in being only 140 280 characters per tweet.

The Good News™️: Twitter has a library, twitter-text, devoted to processing text in a tweet. The not-so-good news - it hasn't been updated in a bit and is relatively under-documented. I would suspect that it was being sunset in favor of including it in their new API SDK, but that's marked as "not ready for production", and more importantly just seems to be a thin OpenAPI-wrapper around the Twitter API, and does not provide any text parsing.

Auto-parsing with twitter-text

Step one was to try auto-parsing using twitter-text's autoLink. I didn't love the options provided - among several things, I would expect a urlClass that mirrors its sibling entity types. It also seemed like it was missing a hashtagIncludeSymbol, which would mirror what usernameIncludeSymbol handles - including the @ in the link when replacing the text.

// Example using autoLink
const { text, entities } = parseTwitterApiResponse();
return TwitterText.autoLink(text, {
  ...entities,
  targetBlank: true,
  hashtagClass: "text-red-700",
  cashtagClass: "text-red-700",
  usernameClass: "text-red-700",
  // I would expect a urlClass to mirror the entity types
  usernameIncludeSymbol: true,
  // No hashtagIncludeSymbol??
});

Lastly, from my first go-through, it wasn't apparent that I could mix in the entities I had pulled in through the API - most notably the ability to link directly to full URLs instead of having to use the default t.co URLs the tweet text field starts with.

In the spirit of "get something working", I ended up iterating through the entities, updating their indices using twitter-text's modifyIndicesFromUnicodeToUTF16, and then doing the replacement myself. An added benefit of this was that I was also able to generate valid JSX and not have to dangerously set inner HTML, which is necessary going the autoLink route...although it wouldn't really be the end of the world to do so.

After sleeping on it and revisiting, it appears it should be possible to address all of my aforementioned concerns, minus the hashtag one. However, I'd want to take the time to also update the types with some documentation, since it tripped me up and I want to make sure I understand what this lawless untyped JavaScript code is doing 😉.