Unveiling SEO Insights: n-Gram Analysis of Google's QRG 2022 vs 2023
- Nadav Harari
- Nov 29, 2023
- 5 min read
Updated: Dec 17, 2023
Understanding Google's algorithms is crucial for anyone serious about SEO. But what if you could go beyond the algorithm and tap into the very guidelines that shape it?
Google's Quality Raters Guidelines can help in understanding SEO trends. In this article, I use n-gram analysis to compare the 2022 and 2023 versions of these guidelines. This method allows us to pinpoint specific changes in language and focus, providing insights into Google's evolving priorities.
I hope that my analysis will help SEO professionals and digital marketers understand these shifts and how they might impact SEO strategies. Get ready for a detailed, data-driven exploration of the subtle yet significant changes in Google's approach to quality rating.
Objective
The aim of this case study is to analyze word frequencies and n-grams (up to 10-grams) in both the 2022 and 2023 versions of Google's Quality Rater Guidelines (QRG).
By identifying the most commonly used terms and phrases in each document, along with their main frequency differences, we aim to understand the key focus areas of these documents. This analysis should reveal new insights about essential SEO themes.
What is Google's Search Quality Evaluator Guidelines?
Google's Search Quality Evaluator Guidelines is a document that provides detailed guidance to human raters who evaluate the quality of Google's search results. These guidelines help raters understand how to assess the quality of a webpage and how it meets the needs of users who are looking for information. The document covers various aspects like expertise, experience, authoritativeness, and trustworthiness (E-E-A-T), as well as the relevance and usefulness of the content.
What are n-grams?
In simple terms, an "n-gram" is a sequence of "n" words that appear next to each other in a text. The "n" stands for the number of words you're looking at in a row.
For example:
1-gram (or "unigram") is just a single word, like "apple."
2-gram (or "bigram") is a sequence of 2 words, like "green apple."
3-gram (or "trigram") is a sequence of 3 words, like "The green apple."
"Very positive website reputation for the topic of the page." – I am a 10-gram
Why Analyze Word Frequencies and n-grams?
Analyzing word frequencies and n-grams can be a valuable approach for gaining insights into Google's search algorithms. For SEO professionals, this method goes beyond theoretical analysis; it serves as a practical tool that can provide actionable insights for refining your strategy. Here is why:
Securing Your Buy-In: If words and phrases like ‘E-E-A-T’, ‘YMYL’, and ‘user intent’ are among the top 50 most frequent meaningful words in the document, it indicates that Google seriously considers these aspects. This can help you secure more buy-in for your SEO budget.
Correlation Between Google’s Announcements and Actual Guidelines: If Google announces a greater emphasis on content demonstrating firsthand experience, or if you notice a surplus in results from forum sites like Reddit and Quora, this should also be reflected in the QRG. Indeed, words and phrases such as ‘perspectives’ (+125%), ‘unhelpful’ (+64%), ‘experience’ (+12%), ‘helpful results’ (+117%), ‘social media’ (+11%), and ‘discussion’ (+38%) show a notable increase in the latest QRG version of November 2023.
Pinpointing Your Strategy: Understanding the most frequently occurring words and phrases in Google's QRG can help you identify core concepts that Google values. For example, frequent appearances of words like ‘user’, ‘reputation’ and ‘quality’ suggest areas to focus on when optimizing your website.
In essence, analyzing word frequencies and n-grams from Google's Quality Raters Guidelines, is like having a cheat sheet for what Google considers important, allowing you to align your strategies for maximum impact and secure buy-in for your SEO recommendations.
Key Findings
The 2023 version contains 4% fewer words than the previous version and is 8 pages shorter (56,882 words vs. 59,367 words | 168 pages vs. 176 pages).
The top three most frequent meaningful words in the 2022 version are ‘page’, ‘user’, and ‘query’, while in the 2023 version, they are ‘page’, ‘content’, and ‘website’.
The term ‘E-E-A-T’ appears 116 times which is 8% less than in the 2022 version.
'User intent' (Bigram) is the second most frequent expression in both versions.
The top two most frequent 10-grams in the document include the phrases 'needs met' and 'user intent'.
Variations of the word ‘Authority’ (e.g., ‘Authoritative’, ‘Authoritativeness’, etc.) appear 20% less in the 2023 version.
Conversely, variations of the word ‘Experience’ (e.g., ‘Experience’, ‘Experiences’) appear 12% more in the 2023 version.
Interestingly, some of the top words and phrases that appear more frequently in the new version seem to align with the latest updates in Google’s algorithms and systems. Words and phrases like ‘perspectives’ (+125%), ‘unhelpful’ (+64%), ‘Experience’ (+12%), ‘helpful results’ (+117%), ‘social media’ (+11%), ‘discussion’ (+38%) and more show a notable increase.
The 3-gram ‘Needs Met ratings’ appears 72% more than in the previous version, aligning with the added guidance for specific types of 'Needs Met ratings' mentioned on Search Engine Land.
The phrase ‘user intent’ (-21%) and the word ‘information’ (-12%) appear less in the new version, which may relate to the removal of outdated and redundant examples, as stated by Search Engine Land.
Tools and Technologies Used
For this analysis, I used Python libraries like PyPDF2 for PDF reading, pandas for data manipulation, and NLTK for natural language processing.
I chose Google Colab as it provides a cloud-based Python environment, making it easier to manage libraries and share code.
Methodology
I utilized both versions of the QRG PDFs and processed them with two similar Python scripts.
Data Preprocessing
1. I excluded the table of contents (up to page 5) in order not to 'pollute' the data.
2. I excluded manually most of the "stop words" (e.g. "the", "of, "to", "for", "in", "on", "is", "are" etc.) from the Unigrams tab and left only the words that have high lexical meaning.
3. I lowered the case of all the words before processing the text to avoid repetitions.
Data Highlights
Top 20 meaningful words (Unigrams) - Exact match
1-Gram | Frequency 2022 | Frequency 2023 | % Change |
page | 1043 | 983 | -6% |
user | 541 | 453 | -16% |
query | 538 | 487 | -9% |
website | 529 | 499 | -6% |
information | 519 | 459 | -12% |
content | 488 | 505 | 3% |
users | 444 | 332 | -25% |
rating | 400 | 387 | -3% |
quality | 366 | 330 | -10% |
result | 356 | 337 | -5% |
pages | 353 | 347 | -2% |
intent | 335 | 283 | -16% |
mc (Main Content) | 322 | 311 | -3% |
purpose | 302 | 298 | -1% |
high | 248 | 228 | -8% |
helpful | 197 | 187 | -5% |
reputation | 184 | 178 | -3% |
people | 178 | 190 | 7% |
topic | 171 | 171 | 0% |
results | 158 | 178 | 13% |
Top variations* related to EEAT
1-Gram | Frequency 2022 | Frequency 2023 | % Change |
E-E-A-T | 121 | 111 | -8% |
YMYL | 118 | 118 | 0% |
Experience | 107 | 120 | 12% |
Authority | 88 | 76 | -14% |
Trust | 166 | 162 | -2% |
Expertise | 139 | 132 | -5% |
* I included all variations of each root word, accounting for different forms like singulars, plurals, and tenses. For instance, the root 'Trust' includes variations like 'Trustworthy', 'Trustworthiness', or 'untrustworthy'.
2023 Version: Key Words/Phrases with Increased Frequency
Word / Phrase | Frequency 2022 | Frequency 2023 | % Change |
reasonable | 11 | 40 | 264% |
perspectives | 4 | 9 | 125% |
intents | 6 | 9 | 50% |
unhelpful | 11 | 18 | 64% |
queries | 150 | 189 | 26% |
experience | 107 | 120 | 12% |
meets results | 12 | 29 | 142% |
helpful results | 6 | 13 | 117% |
social media | 37 | 41 | 11% |
needs met ratings | 9 | 14 | 56% |
interpretation | 54 | 88 | 63% |
discussion | 29 | 40 | 38% |
forum | 43 | 50 | 16% |
opinion | 31 | 34 | 10% |
very helpful | 41 | 57 | 39% |
Note: The n-grams listed in the output sheet may not exactly match their appearance in the original QRG document. This discrepancy is due to the way the Python library processes n-grams, which may involve removing or altering punctuation marks and other special characters. Therefore, the n-grams should be considered as processed forms of the original text for analytical purposes.
Conclusion
The comparative analysis of Google's Quality Rater Guidelines (QRG) from 2022 to 2023 reveals significant insights into the evolving priorities of Google's search algorithms. My findings indicate a shift towards content quality, user experience, and the relevance of information presented on websites. The increased frequency of words such as ‘experience’, ‘discussion’, ‘forum,’ and more suggests that Google is prioritizing websites offering genuine value and user-centric content over merely authoritative sources—a trend that SEO professionals have likely already noticed.
For SEO professionals, these insights imply a need to adapt strategies to align with these evolving priorities. Focusing on creating content that addresses user intent comprehensively, ensuring that information is not only authoritative but also experiential and engaging, will likely be more beneficial. The reduced word count and streamlined content in the 2023 QRG version further suggest that Google is emphasizing clarity and precision in information presentation.
In essence, staying abreast of these subtle yet impactful shifts in Google's focus areas is crucial. By continuously analyzing and adapting to these changes, SEO experts can refine their strategies to better meet the demands of both Google's algorithms and user expectations, ultimately leading to improved search rankings and online visibility.
Explore Deeper: Access the Full n-Gram Analysis Data and Python Code
For a more detailed exploration and to uncover additional clues, insights, and patterns, I encourage readers to delve into the attached Excel file containing the comprehensive n-gram analysis data.
Gain further insights by downloading the complete dataset in Excel format using this link. (Google Sheets was not an option due to its limitation of 10 million cells).
Links to the Python code on Google Colab: [2022 Version] and [2023 Version].
FAQ
Why did you create two versions of the Python script?
Why did you pick these specific terms listed in the tables above?
How can I further utilize the data in the attached Excel file?
About the Author
I am Nadav Harari, an SEO specialist with a passion for data analysis and digital marketing. Feel free to contact me at Nadav@hararidigital.com or follow me on LinkedIn.