Tutorial 2 | CARMA 2024

Review of most popular Topic Modelling techniques

Date: Thursday, 27 June
Time: 16:15-18:15
Room: Aula 2.2

Topic modeling is used in information retrieval to infer the hidden themes in a collection of documents and thus provides automatic means to organize, understand and summarize large collections of textual information. Topic models also offer an interpretable representation of documents used in several downstream Natural Language Processing (NLP) tasks. Topic modeling is applied in different fields ranging from bioinformatics to economics, and social sciences, by detecting patterns like clustering algorithms which partition data.

There are several algorithms and models available to extract topics from texts (large corpora), which arise from different families:

Native Bayesian generative probabilistic model such as Latent Dirichlet Allocation, Correlated Topic model, Probabilistic Latent Semantic Analysis (PLSA)
Linear algebra models, based on matrix decompositions such as Latent Semantic Analysis, Non Negative Matrix Factorization
Clustering algorithms on Word Embedding spaces such as Top2Vec, BERTopic.

Some of these families may be suitable for some tasks and kind of data, while others preferable in other cases. The different models differ in simplicity, computation efficiency, and modeling assumptions. They accordingly differ in how they perform on different corpora and different applications. There is little consensus on the aspects of topic model evaluation. There have also been different methods to evaluate a specific aspect.

The tutorial will illustrate pros and cons (for instance sensitivity to outliers, necessity or not of text cleaning) for the above-mentioned methods. In addition, few evaluation criteria will be discussed such as: quality (coherence and perplexity measures); interpretability; stability; topic diversity.

Outline

Topic modelling (context and basics)
Latent Dirichlet Allocation
- Theory
- Hands on (Python lab)
Non negative matrix factorization
- Theory
- Hands on (Python lab)
Clustering algorithms on Embedding Spaces (Top2Vec, BERTopic)
- Theory [Word Embeddings]
- Theory [Top2Vec, BERTopic]
- Hands on (Python lab)
Use cases
- Topic modelling in Official Statistics
- Model evaluation metrics

Target Audience

Undergraduate and post-graduate researchers and practitioners with textual data. No prior knowledge is required. Basic knowledge of Python could help understanding hands-on sessions.

The trainers will provide a github repository containing notebooks and datasets. Laptop to run Python code locally or a colab account, could be useful.

Presenters

Mauro Bruno and Francesco Pugliese are researchers at Istat, working at the Methodological Directorate. Their main research activity focuses on Big Data for Official Statistics, and NLP techniques applied to Social Media data.