Let's brainstorm a list of miscellaneous things:

In [3]:
items_text = """
- pasta
- thomas dolby
- alpha
- apples
- cats
- pears
- meters
- brick
- dogs
- beta
- howard jones
- concrete
- asphalt
- milk
- rebar
- gillian gilbert
- hamsters
- bread
- butter
- wendy carlos
- gamma
- birds
- bananas
- rick wakeman
- inches
- glass
- feet
- gary numan
- miles
- lumber
- kilometers
- geoff downes
"""

# Split the text into non-empty lines...
items = [x for x in items_text.split("\n") if x]

Let's install some needful modules...

In [4]:
%pip install scikit-learn torch sentence_transformers accelerate

Note: you may need to restart the kernel to use updated packages.


Next, let's pick an embedding model and generate semantic vector representations for all our list items:

In [78]:
from sentence_transformers import SentenceTransformer

# 384 dimensions - https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
# embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# 384 dimensions - https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2
# embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')

# 768 dimensions - https://huggingface.co/sentence-transformers/all-mpnet-base-v2
# embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

# 768 dimensions - https://huggingface.co/thenlper/gte-base
# embedding_model = SentenceTransformer('thenlper/gte-base')

# 1024 dimensions - https://huggingface.co/thenlper/gte-large
embedding_model = SentenceTransformer('thenlper/gte-large')

embeddings = embedding_model.encode(items)
embeddings

array([[-0.02736097,  0.00340217, -0.01076854, ..., -0.02489066,
         0.01647391, -0.02072625],
       [-0.03888908,  0.02349574,  0.0143796 , ..., -0.01913134,
        -0.03127009, -0.0304915 ],
       [ 0.01443943,  0.02336113,  0.00634783, ...,  0.00120864,
        -0.01801292, -0.02678853],
       ...,
       [ 0.00835066,  0.00595316, -0.01179765, ..., -0.01259451,
        -0.0165759 , -0.0056388 ],
       [ 0.0005869 ,  0.01886297, -0.0079366 , ..., -0.04092352,
        -0.01162215, -0.01117233],
       [-0.00889315, -0.00544541, -0.02917784, ..., -0.01641204,
        -0.01544971, -0.01567657]], dtype=float32)

Now that we have vectors, let's try clustering them within the semantic space of the model. This should be roughly analogous to grouping them by meaning:

In [214]:
from sklearn.cluster import KMeans
from itertools import groupby

# Let's say we want to organize the list into this many clusters
n_clusters = 9

# Use the k-means algorithm to come up with a cluster ID for each embedding
cluster_ids = KMeans(n_clusters=n_clusters, n_init='auto').fit_predict(embeddings)

# Associate each cluster ID with the corresponding item
cluster_ids_with_items = zip(cluster_ids, items)

# Group the pairs of (cluster_id, item) into lists based on cluster ID
grouped_cluster_ids_with_items = groupby(
    sorted(cluster_ids_with_items, key=lambda x: x[0]),
    key=lambda x: x[0]
)

# Simplify that whole mess so we just have a list of clustered items
clustered_items = [
    [item for cluster_id, item in item_group]
    for cluster_id, item_group
    in grouped_cluster_ids_with_items
]

clustered_items

[['- meters', '- inches', '- feet', '- miles', '- kilometers'],
 ['- alpha', '- beta', '- gamma'],
 ['- brick', '- concrete', '- asphalt', '- rebar', '- glass', '- lumber'],
 ['- howard jones', '- gillian gilbert', '- wendy carlos'],
 ['- apples', '- pears', '- bananas'],
 ['- cats', '- dogs', '- hamsters', '- birds'],
 ['- pasta', '- bread'],
 ['- thomas dolby', '- rick wakeman', '- gary numan', '- geoff downes'],
 ['- milk', '- butter']]

It's not perfect, but we've got our list roughly organized. 
Next, let's load up an LLM to use very shortly:

In [202]:
"""
https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/tree/main

This model is about 2.2GB.
My 2021 MacPook Pro with an Apple M1 Pro and 32GB of RAM seems to have no problem with this model.
"""
import torch
from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

Loading the LLM is separate from using it, so we can iterate faster on the prompt in this next cell:

In [243]:
system_prompt = """You are a helpful but terse assistant."""

user_prompt = """
Given the following list of items, I need a succinct label that effectively encapsulates the overall theme or purpose.

This is the list of items:

%s

Can you generate a concise, descriptive label for this list? Thanks in advance!
"""

def generate_topic(items):
    text = "\n".join(items)
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt % text},
    ]
    prompt = pipe.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    results = pipe(
        prompt,
        max_new_tokens=32,
        do_sample=True,
        # this tells the LLM how much of a rando to be while selecting tokens during generation
        temperature=0.1,
        # this tells the LLM how many different tokens to decide between at each step of generation
        top_k=3,
        # this tells the LLM how picky to be about the most likely tokens to select while generating
        top_p=0.8,
    )
    # HACK: trim the prompt off the start of the generated text
    generated_text = results[0]['generated_text'][len(prompt):].strip()    
    return generated_text
    
for cluster in clustered_items:
    topic = generate_topic(cluster)

    print(f"# {topic}")
    print()
    for item in cluster:
        print(f"{item}")
    print()

# "Essential Measurement Tools for Everyday Life"

- meters
- inches
- feet
- miles
- kilometers

# "Key Components for Successful Project Management"

- alpha
- beta
- gamma

# "Materials for Construction and Repair"

- brick
- concrete
- asphalt
- rebar
- glass
- lumber

# "Essential Artists: Howard Jones, Gillian Gillespie, and Wendy Carlos"

- howard jones
- gillian gilbert
- wendy carlos

# "Fresh Fruits"

- apples
- pears
- bananas

# "Animals"

- cats
- dogs
- hamsters
- birds

# "Essential Ingredients for a Comforting Meal"

- pasta
- bread

# "Top 5 Legendary Musicians of the 1980s"

- thomas dolby
- rick wakeman
- gary numan
- geoff downes

# "Food Essentials"

- milk
- butter

