Using Language Models to Enhance K-Means Clustering

New method to deal with tabular data

Nov 29, 2023

Today, I discovered a GitHub repository where the author demonstrates the innovative use of a large language model to enhance K-Means clustering. While I have not personally tested this method, the results shared by the author suggest it holds significant promise. The concept is quite interesting, so I wrote it down.

The case study in question involves customer segmentation, focusing on several technical features:

1. Age (numeric)

2. Job: Type of job (categorical)

3. Marital: Marital status (categorical)

4. Education: Education level (categorical)

5. Default: Presence of credit in default (binary)

6. Balance: Average yearly balance in euros (numeric)

7. Housing: Possession of a housing loan (binary)

8. Loan: Possession of a personal loan (binary)

Traditionally, one might apply K-Means clustering directly to this raw data, yielding moderately effective results. However, the novel approach involves integrating a language model into the clustering process. According to my understanding of the code, for each customer, a descriptive string is generated, such as "age: 25, job: teacher, ..., loan: yes." This string is then transformed into a vector by applying a language model. Subsequently, K-Means clustering is performed on these sentence embeddings, resulting in enhanced outcomes.

Image from https://github.com/damiangilgonzalez1995/Clustering-with-LLM

This method is particularly fascinating as it appears to leverage the human-centric knowledge encoded in the language model, potentially aiding in more nuanced clustering. However, this technique may not universally apply to all types of tabular data, especially in areas where language models lack training and contextual understanding.

For further reading:

There is a blog post describing something similar for predicting house prices.
There is also a paper related to this idea. In this case, the authors fine-tune the language model with the sentences describing the tabular data and the target labels.

Kevin’s digital whiteboard

Discussion about this post