Tf-idf for seo content audits

How to Use Python Scripts to Get TF IDF Scores for SEO Content Audits

If you want to improve your SEO performance, you need to understand how your content compares to other pages that talk about the same topic. One simple way to do this is to use TF IDF. TF IDF stands for term frequency inverse document frequency. It is a statistical method that shows how important a word is inside one page and across a group of pages.

In this guide, you will learn what TF IDF means, why it matters for SEO, and how to use a Python script to calculate TF IDF scores for any list of URLs.

What TF IDF Means

TF IDF is a combination of two parts:

1. Term Frequency

This measures how often a word appears on one page. A word that appears many times will have a higher term frequency.

2. Inverse Document Frequency

This measures how rare or common the word is across all the pages you are comparing. If a word appears in every page, it is not special. If a word appears in only one or two pages, it is more important.

When you multiply these two parts, you get the TF IDF score. A high TF IDF score means the word is important in that page and is not too common across the other pages.

Why TF IDF Matters for SEO

Before search engines began using advanced language models, TF IDF was one of the main ways they measured relevance. Even today TF IDF can help you understand how your content focuses on important keywords.

Here is what TF IDF can help you do:

  • Discover the words your page truly emphasizes
  • Check if your page aligns with the keywords you want to target
  • Compare your page with competitor pages
  • Identify content gaps
  • Improve on-page SEO

TF IDF gives you a more objective picture than simple keyword counts.

What You Need Before Running the Script

To calculate TF IDF scores with Python, you need:

  • A list of URLs
  • A Python environment such as Google Colab
  • A few libraries like TextBlob, BeautifulSoup, Pandas, and Cloudscraper

You can paste as many URLs as you want. Some users work with twenty pages. Others go up to hundreds.

The Python Script That Calculates TF IDF

Below is the script used in the video demonstration. It does three main things:

!pip install cloudscraper

import cloudscraper
from bs4 import BeautifulSoup
from textblob import TextBlob as tb

list_pages = [
    "https://emmanueldanawoh.com/how-to-use-google-bert-scores-in-seo-content-writing/",
    "https://emmanueldanawoh.com/how-to-avoid-being-a-victim-of-domain-squatting-homograph-attacks/",
    "https://emmanueldanawoh.com/seo-content-writing-how-to-optimize-for-entity-salience/",
    # Add more URLs as needed
]

scraper = cloudscraper.create_scraper()

list_content = []

for x in list_pages:
    content = ""
    html = scraper.get(x)
    soup = BeautifulSoup(html.text)

    for y in soup.find_all('p'):
            content = content + " " + y.text.lower()

    list_content.append(tb(content))

import math
from textblob import TextBlob as tb

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

import nltk
nltk.download('punkt')

list_words_scores = [["URL","Word","TF-IDF score"]]
for i, blob in enumerate(list_content):
    scores = {word: tfidf(word, blob, list_content) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:5]:
        list_words_scores.append([list_pages[i],word,score])

import pandas as pd

df = pd.DataFrame(list_words_scores)
df.to_excel('filename.xlsx', header=False, index=False)
  1. Scrapes the content of each URL
  2. Extracts all paragraph text
  3. Calculates TF IDF scores and stores the top words for each page

What the Output Means

When the script finishes running, you will get an Excel file with three columns:

  • URL
    The page the script analyzed
  • Word
    The most important words on that page
  • TF IDF score
    A score that shows how strongly each word stands out

This is helpful because it shows you which terms your page is truly known for. If the top TF IDF terms on your page do not match the target keywords you want to rank for, you may need to adjust your content.


How to Use TF IDF in Your SEO Process

Here are practical ways to use these scores:

1. Improve keyword targeting

Check if your page highlights the right phrases.

2. Compare against competitors

Run the script for competitor pages. Compare their top terms with yours.

3. Guide content rewrites

If your high-value keywords are missing, you will know exactly where to focus.

4. Spot content strengths

Some pages may already have a strong topical focus. TF IDF helps you identify them.

Final Thoughts

TF IDF is simple but powerful. It gives you a clear understanding of how your content communicates its main ideas. When combined with Python, you can run large content audits quickly and with very little manual work.

If you want to take your SEO work to the next level, learning how to calculate TF IDF with Python is a great step forward.

Leave a Reply

Your email address will not be published. Required fields are marked *