Statistical Models for the Analysis of Short User-Generated Documents - Author Identification for Conversational Documents

Decanato - Facoltà di scienze informatiche

Data d'inizio: 29 Gennaio 2014

Data di fine: 30 Gennaio 2014

You are cordially invited to attend the PhD Dissertation Defense of Giacomo INCHES on Wednesday, January 29th 2014 at 09h30 in room A32 (Red building)

In recent years short user-generated documents have been gaining popularity on the Internet and attention in the research communities. These kind of documents are generated by users of the various online services: platforms for instant messaging communication, for real-time statuses posting, for discussing, for writing reviews. Each of these services allows users to generate written texts with particular properties and which might require specific algorithms for being analysed.

This dissertation aims at analysing these kind of documents. We conducted qualitative and quantitative studies to identify the properties that might allow for characterising them. We compared the properties of these documents with the properties of standard documents employed in the literature, like newspaper articles, and defined a set of characteristics that are distinctive of the documents generated online. We also observed two classes within the online user-generated documents: the conversational documents and those involving group discussions.

We later focus on the class of conversational documents, that are short and spontaneous. We created a novel collection of real conversational documents retrieved online (e.g. Internet Relay Chat) and distributed it as part of an international competition. The competition was about author characterisation, which is one of the possible studies of authorship attribution documented in the literature. Another field of study is authorship identification, that became our main topic of research. We approached the authorship identification problem for conversational documents in all its variants: the closed class and the open class problem. For each problem we employed documents from the collection we released and from a collection of Twitter messages, as representative of conversational or short user-generated documents. We proved the unsuitability of standard authorship identification techniques for conversational documents and proposed a novel method capable of reaching high accuracy rates. As opposed to standard methods that worked well only for few authors, the proposed technique allowed for reaching accuracy beyond 90% for hundreds of users.

Dissertation Committee:

  • Prof. Fabio Crestani, Università della Svizzera italiana, Switzerland (Research Advisor)
  • Prof. Michael Bronstein, Università della Svizzera italiana, Switzerland (Internal Member)
  • Prof. Mehdi Jazayeri, Università della Svizzera italiana, Switzerland (Internal Member)
  • Prof. Fazli Can, Bilkent University, Turkey (External Member)
  • Prof. Douglas W. Oard, University of Maryland, USA (External Member)