Blog Grid - Emmanuel Dan-Awoh

February 15, 2026

How To Use the Copyscape API for SEO With this Python Script

If you are serious about SEO, making sure your content is original is key. Duplicate content can hurt your search engine rankings and reduce your website traffic. One tool that helps solve this problem is the Copyscape API. This tool allows you to check your content for duplication across the web in a programmatic way.

What is the Copyscape API

Copyscape is a popular plagiarism detection service. It helps you find content that has been copied or duplicated from your website or any other source. The API version of Copyscape lets developers integrate its plagiarism detection capabilities into software or scripts. This is especially useful for websites with many pages or for agencies managing multiple clients.

Some of the main features include:

Plagiarism detection: Check if content has been copied anywhere on the internet.
Batch processing: Check multiple URLs or content pieces at once.
Flexible integration: Use multiple programming languages to work with the API.

These features make it a valuable tool for anyone involved in SEO, content publishing, or digital marketing.

Why SEO Professionals Use the Copyscape API

The Copyscape API is useful for different groups:

Content publishers: Verify content originality before publishing.
SEO agencies: Monitor client websites to protect content from plagiarism.
Educational institutions: Check student submissions for academic integrity.
Content aggregators: Filter out duplicated content from multiple sources.

In addition, detecting duplicated content can help prevent negative SEO tactics. Some people copy content from high-ranking websites and publish it elsewhere to reduce the original site’s authority. By regularly checking your content, you can identify and address this type of issue.

How to Get Started with the Copyscape API

To start using the Copyscape API, follow these steps:

Create a Copyscape account and purchase credits. Each search costs a small fee, usually around $0.03 per search.
Obtain your API key from your account. This key allows you to access the Copyscape servers programmatically.
Prepare a list of URLs you want to check. This is usually done in an Excel file with a column called URL.
Use a Python script to send requests to the API and gather duplication data.

Here is a simple example using Python:

This script reads a list of URLs, sends them to Copyscape, and collects duplication information in an Excel file. You can then review which content has been copied and take action if necessary.

Interpreting Results

Once the script runs, the output Excel file will show:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd

# Copyscape credentials
username = "your_username"
myapikey = "your_api_key"

# Load URLs from Excel
df = pd.read_excel('urls.xlsx')
list_urls = df['URL'].tolist()

# Store results
all_data = []

for url in list_urls:
    try:
        page = urlopen(f"https://www.copyscape.com/api/?u={username}&k={myapikey}&o=csearch&c=10&q={url}")
        soup = BeautifulSoup(page, 'xml')
        results = soup.find_all("result")
        for result in results:
            data = {
                'URL': result.find("url").text,
                'Title': result.find("title").text,
                'Text Snippet': result.find("textsnippet").text,
                'Min Words Matched': result.find("minwordsmatched").text,
                'View URL': result.find("viewurl").text,
                'Percent Matched': result.find("percentmatched").text
            }
            all_data.append(data)
    except Exception as e:
        print(f"Error processing {url}: {e}")

df_combined = pd.DataFrame(all_data)
df_combined.to_excel('results.xlsx', index=False)

print("Data extraction complete. Excel file saved as 'results.xlsx'.")

The original URL
Titles of copied content
A snippet of the matched text
How many words matched
The percentage of duplication

With this information, you can determine which content needs to be rewritten or protected.

Conclusion

Using the Copyscape API for SEO is a smart way to maintain content originality and protect your website from plagiarism. Whether you are managing a single blog or a large site, the API makes it easier to detect duplication, monitor client content, and take action when necessary. By integrating it into your workflow, you can improve your SEO strategy and keep your content unique.

November 16, 2025

How to Use Python Scripts to Get TF IDF Scores for SEO Content Audits

If you want to improve your SEO performance, you need to understand how your content compares to other pages that talk about the same topic. One simple way to do this is to use TF IDF. TF IDF stands for term frequency inverse document frequency. It is a statistical method that shows how important a word is inside one page and across a group of pages.

In this guide, you will learn what TF IDF means, why it matters for SEO, and how to use a Python script to calculate TF IDF scores for any list of URLs.

What TF IDF Means

TF IDF is a combination of two parts:

1. Term Frequency

This measures how often a word appears on one page. A word that appears many times will have a higher term frequency.

2. Inverse Document Frequency

This measures how rare or common the word is across all the pages you are comparing. If a word appears in every page, it is not special. If a word appears in only one or two pages, it is more important.

When you multiply these two parts, you get the TF IDF score. A high TF IDF score means the word is important in that page and is not too common across the other pages.

Why TF IDF Matters for SEO

Before search engines began using advanced language models, TF IDF was one of the main ways they measured relevance. Even today TF IDF can help you understand how your content focuses on important keywords.

Here is what TF IDF can help you do:

Discover the words your page truly emphasizes
Check if your page aligns with the keywords you want to target
Compare your page with competitor pages
Identify content gaps
Improve on-page SEO

TF IDF gives you a more objective picture than simple keyword counts.

What You Need Before Running the Script

To calculate TF IDF scores with Python, you need:

A list of URLs
A Python environment such as Google Colab
A few libraries like TextBlob, BeautifulSoup, Pandas, and Cloudscraper

You can paste as many URLs as you want. Some users work with twenty pages. Others go up to hundreds.

The Python Script That Calculates TF IDF

Below is the script used in the video demonstration. It does three main things:

!pip install cloudscraper

import cloudscraper
from bs4 import BeautifulSoup
from textblob import TextBlob as tb

list_pages = [
    "https://emmanueldanawoh.com/how-to-use-google-bert-scores-in-seo-content-writing/",
    "https://emmanueldanawoh.com/how-to-avoid-being-a-victim-of-domain-squatting-homograph-attacks/",
    "https://emmanueldanawoh.com/seo-content-writing-how-to-optimize-for-entity-salience/",
    # Add more URLs as needed
]

scraper = cloudscraper.create_scraper()

list_content = []

for x in list_pages:
    content = ""
    html = scraper.get(x)
    soup = BeautifulSoup(html.text)

    for y in soup.find_all('p'):
            content = content + " " + y.text.lower()

    list_content.append(tb(content))

import math
from textblob import TextBlob as tb

def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)

def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

import nltk
nltk.download('punkt')

list_words_scores = [["URL","Word","TF-IDF score"]]
for i, blob in enumerate(list_content):
    scores = {word: tfidf(word, blob, list_content) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    for word, score in sorted_words[:5]:
        list_words_scores.append([list_pages[i],word,score])

import pandas as pd

df = pd.DataFrame(list_words_scores)
df.to_excel('filename.xlsx', header=False, index=False)

Scrapes the content of each URL
Extracts all paragraph text
Calculates TF IDF scores and stores the top words for each page

What the Output Means

When the script finishes running, you will get an Excel file with three columns:

URL
The page the script analyzed
Word
The most important words on that page
TF IDF score
A score that shows how strongly each word stands out

This is helpful because it shows you which terms your page is truly known for. If the top TF IDF terms on your page do not match the target keywords you want to rank for, you may need to adjust your content.

How to Use TF IDF in Your SEO Process

Here are practical ways to use these scores:

1. Improve keyword targeting

Check if your page highlights the right phrases.

2. Compare against competitors

Run the script for competitor pages. Compare their top terms with yours.

3. Guide content rewrites

If your high-value keywords are missing, you will know exactly where to focus.

4. Spot content strengths

Some pages may already have a strong topical focus. TF IDF helps you identify them.

Final Thoughts

TF IDF is simple but powerful. It gives you a clear understanding of how your content communicates its main ideas. When combined with Python, you can run large content audits quickly and with very little manual work.

If you want to take your SEO work to the next level, learning how to calculate TF IDF with Python is a great step forward.

November 16, 2025

How to Use the Wayback Machine API for SEO

If you work in SEO, you already know how important it is to understand what changed on a website over time. Sometimes a site drops in ranking, and you need to know why. Other times you want to check how a competitor changed their content or design. The Wayback Machine is one of the best tools for this job. It stores snapshots of millions of websites so you can travel back in time and see older versions of any page.

In this guide, you will learn what the Wayback Machine does, why it matters for SEO, and how you can use its API along with a simple Python script to pull historical snapshots at scale.

What the Wayback Machine Does

The Wayback Machine is a digital archive of the internet. It crawls websites and saves snapshots of pages at different points in time. You can visit the website, enter any URL, and browse how that page looked on specific dates.

Here are the main things it offers:

A large archive of snapshots from many years ago
A date selector that lets you choose a specific day
A search feature that works across URLs and domains

With this tool, you can study any website and see its past content, layout, and structure.

Why the Wayback Machine Matters for SEO

SEO changes all the time. When a site drops in traffic, the problem may be something that changed months ago. The Wayback Machine helps you find clues.

Here are ways SEOs use it:

1. Analyze historical content

You can check what your content looked like before rankings changed. Maybe a section was removed. Maybe keywords disappeared. Maybe the structure changed.

2. Recover lost content and backlinks

If a page was deleted or rewritten, older versions may still exist in the archive. This helps you restore useful content or rebuild lost link value.

3. Study competitor strategy

Competitors are always updating their pages. By checking their old snapshots, you can study their design choices, their content growth, and the changes they made over time.

4. Audit site performance

Large SEO audits often need long term data. The Wayback Machine can reveal patterns that help explain traffic drops or improvements.

The Practical Use of the Wayback Machine API

Checking one or two URLs is easy. Checking hundreds is not. This is where the API helps. The API lets you interact with the Wayback Machine using code so you can pull snapshots for many URLs at once.

The Wayback Machine offers three main APIs:

JSON API
Memento API
CDX API

In this guide, we will focus on the Memento API because it is simple to use and works well with Python.

What You Need Before Running the Script

To use the Python script, prepare two things:

An Excel file that contains all the URLs you want to study
A date range that defines how far back you want to look

For example, you can select a one-year period, such as June 2023 to June 2024.

Your Excel sheet should have:

No empty rows
No empty columns
A header in the first row
URLs starting from the second row

The Python Script That Pulls Wayback Machine Data

Here is the script used to collect snapshots:

# Install the necessary libraries
!pip install --upgrade wayback
!pip install pandas openpyxl

import wayback
import pandas as pd
from datetime import date
from openpyxl import load_workbook  # Import for reading Excel files

# Define paths and date range
excel_file = "time_travel_pages.xlsx"  # Replace with your Excel file path
sheet_name = "Sheet1"  # Replace with the sheet name containing URLs
date_from = date(2023, 6, 1) # date( Year, Month, Day)
date_to = date(2024, 6, 1) # date( Year, Month, Day)

# Initialize a list to store records
records_list = []




# Create Wayback Machine client
client = wayback.WaybackClient()

# Read URLs from Excel
wb = load_workbook(filename=excel_file, read_only=True)
sheet = wb[sheet_name]  # Access the specified sheet

# Loop through each row in the sheet (assuming URLs are in the first column)
for row in sheet.iter_rows(min_row=2):  # Skip the header row (row 1)
  url = row[0].value  # Assuming URLs are in the first column (index 0)
  if url:  # Check if there's a value in the cell
    # Search the Wayback Machine
    for record in client.search(url, from_date=date_from, to_date=date_to):
      # Construct memento URL (optional, if needed)
      # memento_url = f"http://web.archive.org/web/{record.timestamp}/{record.url}"

      # Collect data
      record_data = {
          'original_url': record.url,
          'timestamp': record.timestamp,
          # Use memento_url if needed, otherwise use view_url
          'memento_url': record.view_url  # Or memento_url if constructed
      }
      records_list.append(record_data)

# Create DataFrame and export to Excel
df = pd.DataFrame(records_list)
df['timestamp'] = df['timestamp'].dt.tz_localize(None)
df.to_excel('wayback_records.xlsx', index=False)

print("Data exported to wayback_records.xlsx")

When you run the script:

It reads your Excel file
It checks the Wayback Machine for each URL
It collects snapshots that fall within your date range
It exports all results into a spreadsheet

Your output file will contain:

The original URL
The exact snapshot timestamps
A memento link you can click to see how the page looked on that date

This gives you a clean archive of snapshot data for your entire URL list.

How This Helps You in SEO

With your output spreadsheet, you can now:

Compare content across dates
Detect structural changes
Restore old high-performing copy
Track competitor updates
Run timeline-based audits

This process speeds up SEO analysis and makes it easier to explain historical issues to clients or teammates.

Final Thoughts

The Wayback Machine is one of the most powerful but underrated tools in SEO. When paired with the API and a simple Python script, it becomes even more useful. You can collect large amounts of historical page data in minutes and use it to improve rankings, recover content, and study competitors.

If you want to level up your SEO practice, start using the Wayback Machine API. It gives you the power to see the past and improve the future.