Twitter wants to limit snark and get “healthy.” So far, it’s gone nowhere.

There are few things the internet loves more than a vicious Twitter dunk.

Whether it’s a celebrity ripping on President Donald Trump or a politician roasting her political rivals , Twitter is the perfect forum for the kind of wisecracks, sarcasm, and snappy one-liners that can go viral just like a LeBron James windmill slam. In many ways, Twitter was built for this. The brevity, the ease of virality, and the general snarkiness of the platform have turned dunking into, as Slate called it, a “delicious sport.”

The problem, though, is obvious to spot: Sure, a Twitter dunk, even a nasty one, can be fantastic to watch. Unless, of course, you play for the other team. Publicly trolling or mocking another Twitter user isn’t conducive to good, clean, productive conversation, which is a problem for a company that’s made facilitating conversations a top priority.

So a year ago, Twitter CEO Jack Dorsey announced plans for a somewhat idealistic solution, not just for minimizing Twitter dunks but also for minimizing all kinds of angry, vile, and abusive user behaviors that don’t necessarily violate its rules: He wants to invent a new metric that measures Twitter’s health, and then optimize for it. Twitter even partnered with outside researchers to come up with new metrics for what “healthy” actually looks like.

If Twitter can identify which user interactions are healthy, the thinking goes, then maybe it can change the product to encourage more of those behaviors while discouraging more antisocial conduct.

“If you dunk on somebody and you get a lot of engagement, a lot of ‘Likes,’ a lot of retweets, that is encouraging you to be mean, basically,” said David Gasca, the Twitter product executive in charge of the company’s health efforts, in a recent interview with Recode. “We could imagine ways of changing the product in order to [discourage] that.”

“At the same time,” he continued, “you could imagine changing [the product] such that you provide positive incentives for encouraging more constructive conversation.”

It’s a nice idea, though measuring Twitter’s health is taking much longer than expected, according to exclusive interviews with both Twitter employees and company partners.

The research teams that Twitter announced last July to help them work on this project haven’t even started. One of the two teams has abandoned the project altogether. Internal metrics Twitter is building on its own are still in the “experimentation” phase and aren’t being tested in the wild.

Which means that while Twitter might not think the dunk is a particularly healthy user behavior, it’s probably not going away anytime soon.

Twitter is sick

The idea to measure Twitter’s health was planted in Jack Dorsey’s ear by Deb Roy, an MIT researcher and one-time Twitter employee.

Roy sold his TV analytics startup, Bluefin Labs, to Twitter in 2013 and quickly became Twitter’s chief media scientist , a part-time role that allowed him to teach and conduct research at MIT. It was there that Roy started the Laboratory for Social Machines, a research effort to study public conversations, and Cortico, a nonprofit that partners with that lab to promote its work outside the university. Twitter committed $10 million to help fund the lab in 2014.

Roy stayed in touch with Dorsey and routinely shared with Twitter executives the lab’s research on topics like viral rumors ; he even presented other research at Twitter’s week-long, all-company retreat last summer in San Francisco.

A few months before that retreat, Roy asked Dorsey a thought-provoking question which, according to sources, spurred Dorsey’s tweetstorm last March outlining the health measurement project.


Javier Zarracina/Vox

“Recently we were asked a simple question: could we measure the ‘health’ of conversation on Twitter?” Dorsey tweeted at the time . “If you want to improve something, you have to be able to measure it.”

“Health” has been Twitter’s top buzzword for the better part of a year.

Everything the company seems to do — from cracking down on bots to building new conversation features — has been done in the name of a healthier Twitter. When the company’s user base started shrinking noticeably last year, Twitter said that its focus on health was at least partly to blame .

Measuring the health of interactions is just one part of that broader effort, but it’s one of the more challenging and confusing parts. Removing bots and spam are technical problems. Truly understanding the health of a conversation requires things like understanding who is talking, what they’re talking about, or when someone is using sarcasm. Not all arguments, of course, are bad.

“There is great diversity in what people consider ‘healthy,’” Roy explained in an email to Recode. “Since the platforms did not at their onset define norms, it is much more difficult to retrofit norms long after the networks have grown and taken root.”

On the same day as Dorsey’s tweetstorm, Roy’s nonprofit Cortico published a blog post titled “Measuring the Health of Our Public Conversations.” The post introduced four new metrics that might be used to quantify what a healthy conversation looks like, metrics like “receptivity” and “shared reality.”

If you haven’t heard of those metrics, that’s because they don’t yet exist — at least not in their final form. And that, in a nutshell, is why the idea of measuring the health of conversations promises to be one of the most challenging aspects of Twitter’s recovery plan. There’s no widely adopted way to quantify the health of human interaction, especially at the internet’s scale.

Carolyn Penstein Rose, a professor in the computer science department at Carnegie Mellon, has spent the past decade studying conversations and the technological systems that can be used to improve them. Rose’s area of focus has mostly been studying how human interactions impact learning , but she believes the problems Twitter faces are related.

“Let’s not make this a machine learning problem — it’s a language problem,” Rose said in an interview with Recode. “My advice to Twitter would be, if you want this to be done right, get people who know language, not necessarily [just] machine learning people.”

Inventing a new metric

Dorsey seems to be taking that advice. He was so serious about measuring the health of conversations on Twitter that, a year ago, the company asked researchers to submit proposals for how it could actually do that.

More than 230 proposals were submitted, and last July, Twitter announced that two research groups had been selected as official company partners — one from Leiden University in the Netherlands and one from Oxford University in England. They would receive access to user data and a monetary grant from Twitter; in exchange, these researchers would create new metrics intended to measure the health of interactions on the service.


Javier Zarracina/Vox

But 12 months after Dorsey first tweeted out the plan — and roughly eight months since Twitter introduced those academic partners — Twitter hasn’t unveiled or implemented any new metrics. Neither have its research partners.

In fact, the outside research hasn’t even started.

Lawyers for Twitter and Leiden haven’t been able to solidify the data-sharing and privacy details for the partnership, which means the researchers are simply waiting, according to interviews with Twitter and the company’s partners. The other group at Oxford ran into similar legal obstacles; it abandoned the project altogether.

“It’s proven a lot more difficult than we anticipated,” Gasca admitted. To study user engagements, Gasca said, researchers need sensitive information, like data about who people block or who they report to the company. “In the wake of Cambridge Analytica and all these other issues, we have to be very careful about what we share and how we share it.

“It’s not as simple as handing over some data set.”

Despite the longer-than-expected delay in finalizing the contract, the research group from Leiden is still committed to Twitter’s proposed mission, according to Rebekah Tromble, a political science professor from Leiden University and the lead researcher for the team with which Twitter is working.

The plan, whenever the lawyers get on the same page, is that Leiden’s research team will create four new metrics over a two-year research project — different metrics than the four Cortico proposed in its March 2019 blog post.

Here are the metrics Tromble wants to create:

Mutual recognition: Are people engaging with others who have different beliefs? Or are they simply talking with those who agree with them?

Diversity of perspectives: Are some people not being heard in a conversation or being excluded altogether?

Incivility: This is “counter-normative” conversation, which could include things like insults or curse words, but isn’t necessarily bad or unhealthy.

Intolerance: Intolerance is unhealthy. Are users attacking or critiquing other groups that might be protected?

To create these metrics, the Leiden research group plans to study conversations in the US and UK rooted in two hot-button issues. The first, immigration, is a wellspring of fiery discussion — full of hot takes, deeply felt beliefs, and troublesome language.

“The second topic that we’re looking at is daylight savings time,” Tromble said. “I always get a bit of a laugh when I tell people.”

The thinking is that daylight savings time still creates a lot of good discussion but isn’t as politically charged as other issues of the day. “The views that people hold about this don’t necessarily map along traditional left-right political lines,” she added.

None of these questions or topics would be easy to understand in a vacuum, and Twitter’s product and policies provide extra challenges. Twitter conversations are predominantly public, but users don’t need to use their real identities. Some interactions involve people with offline personal relationships of which Twitter wouldn’t likely be aware.

There’s also Twitter’s brevity: Tweets are limited to 280 characters, which can limit space for a more nuanced discussion. And Twitter’s algorithms also promote tweets with lots of likes or retweets. As we’ve learned from Twitter dunks, those aren’t always a signal that something is healthy.

“Social dynamics, things like conversations, are messy,” Tromble explained. “Simply putting a number on them is never going to get at all of the nuance and all of the complexity.”

Actually changing behavior

None of that research can happen, though, unless Twitter and Leiden can figure out the privacy elements of the data sharing. In the meantime, Twitter isn’t waiting on the outside researchers. Twitter employees are building their own internal metrics to measure user interactions with the help of some user volunteers.

Right now, Twitter is testing two metrics. The first is used to measure the health of single tweets — what Gasca calls a “toxicity” metric — and is based on machine learning algorithms created by Google that the search giant has made public for other companies to use.

The second metric doesn’t yet have a name, though Gasca called it “healthy.” The metric is meant to measure conversational health and takes into account three factors: civility, receptivity, and constructivity.

Twitter says it’s in the “experimentation” phase for this metric, which means the company is still gathering data. That process looks like this: Twitter takes real conversation threads, and asks real Twitter users to review those conversations and rate them for each of the three factors: Was it civil? Was it constructive? Were participants receptive to the ideas and inputs of others?

A Twitter spokesperson says the humans reviewing and rating these conversations are paid, and make up a “diverse group of people who use Twitter at least once a month.” The company didn’t elaborate, but that diversity of perspectives is key, says Rose. “It’s important that the process they engage in to build whatever [algorithms] they are going to deploy is done in an iterative way that engages users who are diverse,” she said. Twitter is a global platform, and your race, gender, and sexual orientation could all factor into how you determine a particular conversation’s “health.”

The ratings are then plugged into a computer and used to train a software algorithm that will eventually understand what healthy conversations look like.

If Twitter does indeed make it that far, then the product changes will start. Gasca believes some of Twitter’s work around measuring health could make it into the product as early as next quarter. Twitter executives like to use the word “incentives” — as in, how can Twitter incentivize people to behave a certain way?

“When you’re about to engage in a conversation, how do we encourage you to choose the healthier response versus the more toxic response?” explained Kathie Pham, a staff researcher at Twitter.

It’s clear that Twitter doesn’t yet know what those product changes will look like. But there are ideas. Dorsey has spoken publicly about removing certain incentives that already exist. For example, he said in November that public follower counts can be problematic . (Twitter co-founder Ev Williams said the same thing .)

There are also algorithms the company uses to show people tweets. If people send “healthier responses” to other users, Twitter’s algorithms could prioritize those responses over unhealthy ones.

And then there are social rewards such as likes and retweets — e.g. incentives for dunking — that Twitter is also thinking about .

“Every product by just being a product incentivizes certain behaviors. A lot of times they’re not deliberately chosen,” Gasca said. “If you modify the product, you can modify behaviors. The challenge then is what do you actually want to modify it to? That’s where it’s tricky.”

In her years studying conversation, Rose has examined Reddit and GitHub and message boards to see how conversations happen online. Twitter, though, hasn’t been on her list.

“Twitter contributions tend to be very short, they tend to be a little bit cryptic. They are not very natural interactions conversationally,” she explained. “It’s really more monologue-y. People make statements. And as people consume those statements, they may reblog them, but there is not a lot of extended conversational interaction.”

Twitter’s 280-character limit, a hallmark of the service, poses another problem.

Brevity is the soul of snark

Shiri Melumad studies mobile consumer behavior as a marketing professor at the Wharton School at the University of Pennsylvania. Melumad published a study in January exploring the effect space constraints have on what people share. “People tend to be more emotional if they’re pressed to write less,” she explained, which of course leads to more opinionated and controversial tweets.


Javier Zarracina/Vox

Couple that tendency with this: In a separate study still under review about how news stories are passed around the internet like a giant game of telephone, Melumad and her colleagues found that as stories get further from their initial source, people know fewer and fewer actual details about what actually happened. So they offer up their opinions instead.

“In the face of fewer details, people seem to be writing summaries that are increasingly opinionated, and they’re increasingly negatively opinionated,” she said of her findings. “They have this sort of desire to fill in this void with something, and they’re filling it in with something that they do know, which is their opinions about the information that’s presented to them.”

In other words, Twitter’s brevity lends itself to emotional tweets, while its virality breeds legions of opinionated, less-informed tweets.

That’s a tough environment for creating a healthy dialogue.

Twitter likely knows this already. It’s been 18 months since the company expanded the length of tweets to 280 characters from 140 characters. Like its healthy conversations measurement plan, Twitter says it extended tweets to change user incentivizes — only the incentive in this case was business-related: Twitter wanted people to tweet more often.

The good news is that 280 characters didn’t break Twitter. It’s an example of a change that, by both company measures and public perception , seems to have worked. It’s proof that changing the Twitter product can indeed change user incentives.

The bad news? Even 280-character tweets took Twitter years to roll out . Who knows how long it will take Twitter to create a new health metric for something as nuanced and complicated as human interaction. “It’s still very early days and exploratory,” Pham cautioned.

Until then, enjoy the dunks.