Challenges on social NLP Dataset Annotation



The development of natural language processing and AI applications require a gold standard dataset. Data is the pillar of those intelligent applications, and an annotation is a way to acquire it. In this talk, I will first discuss the main components of WebAnno, one of the most popular annotation tools to date, which is a generic, distributive, and web-based annotation tool. I will then discuss the extension of WebAnno, which is called CodeAnno, that supports hierarchical document-level annotation, particularly for the codebook annotation in social science. In the second part of my talk, the approaches and challenges of social NLP datasets, such as hate speech, sentiment, and fake news datasets using crowdsourcing frameworks. I will conclude the talk by presenting our Telegram bot-based social media annotation tool called ASAB, which is built to alleviate crowdsourcing limitations in the annotation of low-resource languages.