SEO clustering helps you organize keywords into groups that share the same search intent. Doing this manually takes hours. AI makes the process faster and more accurate. This article walks through a practical workflow for automated SEO clustering using AI tools and embeddings.
Prepare the Keyword Set and Pull SERP and URL Data

The first step is building a clean keyword list. Start by exporting keywords from tools like Google Search Console, Ahrefs, or Semrush. Include search volume, keyword difficulty, and current rankings if available. Aim for a list of at least 200 to 500 keywords to make clustering worthwhile.
Once you have the list, pull SERP data for each keyword. You need the top 10 URLs that rank for each term. Use a SERP API such as DataForSEO, SerpAPI, or ValueSERP to collect this data at scale. Store the results in a spreadsheet or database with the following columns: keyword, search volume, ranking URL, SERP title, and SERP position.
URL overlap is one of the most reliable signals for grouping keywords. If two keywords share three or more of the same ranking URLs in the top 10, Google treats them as similar in intent. This overlap method works well before you even apply AI. It gives you a baseline clustering structure that AI can then refine.
Clean the data before moving forward. Remove duplicate keywords, strip out branded terms if they are not relevant, and flag any keywords with zero search volume. You want a tight, focused dataset. Garbage input produces garbage clusters.
Some teams also pull the meta descriptions and H1 tags from the top-ranking URLs. This extra data improves the quality of embeddings later. If you have the resources to scrape this information, include it in your dataset.
Create AI Embeddings for Keywords and SERP Titles
Embeddings are numerical representations of text. They capture meaning and context in a way that traditional keyword matching cannot. When you convert keywords and SERP titles into embeddings, similar terms end up close together in a high-dimensional space. This closeness becomes the foundation for clustering.
Use an embedding model to process your keyword list. OpenAI’s text-embedding-ada-002 model is a common choice. Google’s Vertex AI also offers strong embedding models. You can run these through a simple Python script using the relevant API.
For each keyword, create an embedding that combines the keyword itself with its top SERP titles. For example, if the keyword is “technical SEO audit,” you would feed the model a string like: “technical SEO audit | How to Run a Technical SEO Audit | Technical SEO Audit Checklist 2024 | Free Technical SEO Audit Tool.” This combined input produces a richer embedding that reflects both the keyword and what Google considers relevant for it.
This is also a good point in the workflow to consider whether your keyword set covers local intent. If you are working on a campaign that targets a specific city or region, keywords with local modifiers behave differently in SERP results. Teams that focus on local seo optimization often find that local keywords cluster separately from broader informational terms, even when the core topic is the same. Keeping local and national keywords in distinct groups from the start saves time during the labeling stage.
Store your embeddings in a vector format. A Pandas dataframe with an embeddings column works well for smaller datasets with fewer than 10,000 keywords. For larger datasets, use a vector database like Pinecone, Weaviate, or Chroma. These tools allow faster similarity searches and make the clustering step more efficient.
The quality of your embeddings directly affects the quality of your clusters. Spend time carefully crafting the input strings. Include the SERP titles that carry the strongest topical signals. Avoid feeding the model irrelevant or low-quality titles from spammy ranking pages.
Cluster with AI Similarity and Pick Thresholds
With embeddings ready, the next step is to calculate similarities between keywords and group them into clusters. Cosine similarity is the standard metric for this task. It measures the angle between two embedding vectors. A score of 1.0 means identical meaning. A score close to 0 means no meaningful relationship.
There are two main approaches to clustering: hierarchical clustering and density-based clustering.
Hierarchical clustering builds a tree structure that links keywords from most similar to least similar. You can cut the tree at any level to produce clusters of different sizes. The agglomerative method, available in Python’s scikit-learn library, works well for SEO datasets. It is straightforward to implement and produces clean results.
DBSCAN is a density-based method that identifies clusters based on the density of keywords in the embedding space. It handles noise well and does not require you to specify the number of clusters in advance. This makes it useful when you do not know how many topic groups your keyword set contains.
Threshold selection is critical. If your similarity threshold is too high, you end up with hundreds of tiny clusters that are hard to manage. If it is too low, unrelated keywords merge into the same group. A threshold between 0.82 and 0.88 cosine similarity works well for most SEO datasets. Test several values and review the output manually before committing to one.
After running the algorithm, count the number of clusters and check the average cluster size. Clusters with only one keyword are outliers. Clusters with more than 50 keywords are likely too broad. Adjust the threshold and re-run until the distribution looks balanced.
Pros of AI-based clustering include speed, scalability, and consistency. You can process thousands of keywords in minutes. The AI does not get tired or make subjective judgment calls.
Cons include the need for technical setup, occasional nonsensical groupings in edge cases, and the cost of embedding API calls at scale. For a dataset of 5,000 keywords, you might spend between five and fifteen dollars on API costs, depending on the model you use.
Label Clusters by Intent and Choose the Primary Keyword
Clustering produces groups of related keywords. Labeling gives each group a clear name and assigns a search intent category. This step makes the clusters actionable for content planning.
Search intent falls into four categories. Informational queries seek knowledge. For example, “How does PageRank work?” Navigational queries look for a specific website or page. Commercial queries compare options before a purchase, such as “best SEO tools 2024.” Transactional queries signal purchase or sign-up intent, such as “buy Semrush subscription.”
To label clusters, feed the top keywords in each group to an AI language model with a prompt like: “Here are ten related keywords. Identify the common topic and search intent. Return a short cluster label and an intent category.” GPT-4 and Claude both handle this task well. You can batch-process clusters to speed up the labeling step.
After labeling, choose the primary keyword for each cluster. The primary keyword is the term you will target as the main focus of a page or article. Select it based on three factors: search volume, keyword difficulty, and how well it represents the cluster’s topic. In most cases, the highest-volume keyword with a manageable difficulty score is the right choice.
Avoid selecting a primary keyword that does not match the dominant intent of the cluster. If eight out of ten keywords in a cluster have informational intent but the highest-volume term looks transactional, the cluster may need to be split into two separate groups.
Once you have labeled clusters and primary keywords, export the final table to a content calendar or project management tool. Each cluster represents one page or content asset. The primary keyword becomes the focus of the page title. Supporting keywords in the cluster become subheadings, FAQ sections, or related content within the same page.
This workflow reduces keyword research time by over 60 percent compared to manual methods. It produces clusters that align with how Google groups content, improving the chances of ranking multiple keywords on a single well-structured page. Start with a clean dataset, use strong embeddings, test your thresholds carefully, and label with AI assistance to build a content strategy that is grounded in data.







