
Conversations Gone Awry Corpus v1.00 (released June 2018)

Distributed together with:

"Conversations gone awry: Detecting early signs of conversational failure"
Justine Zhang, Jonathan P. Chang, Cristian Danescu-Niculescu-Mizil, Lucas Dixon, Yiqing Hua, Nithum Thain, Dario Taraborelli
ACL 2018

NOTE: If you have results to report on this corpus, please send an email to cristian@cs.cornell.edu so we can add you to our list of people using this data.  Thanks!


Contents of this README:

	A) Brief description
	B) Files description
	C) Contact

A) Brief description:

This corpus contains a collection of paired conversations from Wikipedia editors' talk Pages (http://en.wikipedia.org/wiki/Wikipedia:Talk_page_guidelines) with metadata and human annotation info:

- 635 pairs (1,270 conversations) with a total of 6,963 comments
- involving a total of 2,146 editors
- taking place on 576 talk pages (and their archives)

metadata includes:
	- structure of the conversations
    - talk page titles and unique page IDs

human annotation info includes:
    - whether a comment contains a personal attack
    - whether the conversation a comment belongs to contains a personal attack


This data was collected from late 2017 to early 2018 and was annotated in April 2018.

The dataset can be downloaded using the convokit.download() helper function: convokit.download("conversations-gone-awry-corpus"). Alternatively you can access it directly here: http://zissou.infosci.cornell.edu/socialkit/datasets/conversations-gone-awry-corpus/paired_conversations.json.

B) Files description

<===>  paired_conversations.json

This file contains 6,963 comments in convokit JSON format. Each comment is a JSON object, and the whole file is a list of comment objects. This is a well-formatted JSON array, not a JSONLIST, so it can be directly parsed with an off-the-shelf JSON parser.

Each comment object contains the following standard convokit fields:
- id: the unique ID of the comment
- root: the unique ID of the conversation this comment belongs to (see Notes for how comment ID and conversation ID are related)
- text: the text content of the comment
- timestamp: the unix timestamp (in seconds) of when the comment was posted
- user: the name of the user who posted the comment

In addition, each object also includes a custom field "awry_info" which points to a nested JSON object containing comment metadata and annotation info. The awry_info object contains the following fields:
- page_title: the title of the talk page the comment came from
- page_id: the unique numerical ID of the talk page the comment came from
- pair_id: the conversation ID (root) of the conversation that this comment's conversation is paired with
- comment_has_personal_attack: whether this comment was judged by 3 crowdsourced workers to contain a personal attack
- conversation_has_personal_attack: whether any comment in this comment's conversation contains a personal attack

Notes:

Regarding conversation layout: convokit expects the root of a comment to be a pointer to another comment - this is how it reconstructs conversations. However, this property is not strictly true in Wikipedia talk page conversations. 
Specifically, the algorithm we used to reconstruct Wikipedia conversations assigns IDs to whole conversations, where conversation ID is determined by the ID of a special comment known as the section header. The section header corresponds to the conversations "title" or "subject" as seen on the original talk page. 
In order to preserve conversation structure while remaining consistent with convokit, we include the section headers in the data, so that each root points to a valid object. 
However, section headers are not really utterances and should thus be ignored when doing any NLP tasks. To help with this, we include an addition field "is_section_header" which is set to true for section header "comments".

C) Contact:

Please email any questions to: cristian@cs.cornell.edu (Cristian Danescu-Niculescu-Mizil)



