Abstract:
Collection of 300 support tickets manually labeled for semantic similarity, obtained from a IT support company in the Florianópolis (Brazil) region. Each ticket is represented by an unstructured text field, which is typed by the user that opened the call. The labeling process was performed in 2022 by three IT support professionals. The corpus contains tickets in many languages, mainly English, German, Portuguese and Spanish.
All Personal Identifiable Information (PII) and sensitive information were removed (substituted by a tag (...)
Collection of 300 support tickets manually labeled for semantic similarity, obtained from a IT support company in the Florianópolis (Brazil) region. Each ticket is represented by an unstructured text field, which is typed by the user that opened the call. The labeling process was performed in 2022 by three IT support professionals. The corpus contains tickets in many languages, mainly English, German, Portuguese and Spanish.
All Personal Identifiable Information (PII) and sensitive information were removed (substituted by a tag indicating the original content, for instance: the sentence "this text was written by Leonardo" is converted to "this text was written by [NAME]"). The removal was performed in three steps: first, the automated machine learning-based tool AWS Comprehend PII Removal was used; then, a sequence of custom regular expressions was applied; last, the entire corpus was manually verified.
(Read More)