Google’s ranking algorithm relies on a set of intrinsic values manually defined by its engineers.
In other words, these are fixed parameters that do not change dynamically and apply as absolute rules.
In this article, let’s investigate one of these values — a key element that sheds light on how the world’s leading search engine manages a fundamental aspect of its ranking system: indexing.
To be indexed, or not to be indexed, that is the question
For any website, being indexed by Google—and staying indexed—is a critical challenge.
If a page is not indexed, all other SEO efforts—content creation, link acquisition, conversion optimization, and more—become useless.
Yet, Google’s index is not limitless. Or rather, Google does not want it to be limitless.
In 2020, Google’s index contained 400 billion documents (pages). This figure was revealed during the cross-examination of Pandu Nayak, Google’s Vice President of Search, in the U.S. antitrust case against Google.
From Google’s perspective, as a functioning and profitability-conscious business, a larger number of indexed pages means more storage space, more computing power to analyze, classify, and monitor them.
This leads to increased operational costs—something every company, including Google, is looking to reduce today.
To control the growth of its index, Google’s search engine employs a wide array of techniques, including canonicalization (deduplication), predictive crawling, penalties, and more.
But what about pages that have been in the index for a long time? Probably, not all of them deserve to stay.
Google has a precise and well-defined mechanism to clean up its index.
Let’s investigate together!
Setting Up Our Playground
For our research, we will use Screaming Frog SEO Spider, whose paid version allows us to enrich our crawl data with information from the Google Search Console API.
- In the menu, select: Configuration > API Access > Google Search Console.
- Log in to your account.
- Go to the “URL Inspection” tab.
- Check both boxes as shown in the image below.
“Inspection d’URL” is an API integrated into the Google Search Console API, allowing users to retrieve technical status information about pages as recognized by the search engine.
This tool is very useful for avoiding the need to inspect URLs one by one in the Search Console interface. The only limitation is a maximum of 2,000 pages per day per property, which can be bypassed by creating multiple properties.
It’s time to start your crawl.
Once the crawl has started, open the “Google Search Console” tab, where you’ll find plenty of useful information directly from Google’s index :
We find ourselves facing 20 columns of scattered technical indicators, and for now, nothing seems clear.
But as Henri Bergson said 90 years ago: Disorder is simply order that we are not looking for.
Let’s narrow our focus to four key columns:
- Summary (whether the page is present in Google’s index or not)
- Coverage (the reason why the page is not indexed, if applicable)
- Last crawl (the date when Googlebot last visited the page)
- Days Since Last Crawled
Here, we can see for each URL whether it is indexed by Google and how much time has passed since the last crawl.
Let’s sort the data by the “Days Since Last Crawled” column in ascending order.
And suddenly, our data organizes itself into a clear system of causes and effects.
Let’s investigate this with 5 real cases of different types, from different markets.
Case 1: Official Website of a Tire Manufacturer (Portugal)
This is one of the most well-known tire manufacturers in the Portuguese market.
After applying the analysis described above, we observe two possible states in the “Summary” column:
- “URL is on Google”
- “URL is not on Google”
But the most interesting insight comes from the “Days Since Last Crawled” column.
It appears there is a causal relationship between crawl frequency and a URL’s indexation status.
More specifically, URLs seem to be deindexed if Googlebot hasn’t crawled them for 130 days.
An Important Clarification
When configuring Screaming Frog, we made sure to check the option to send only indexable URLs to the URL Inspector.
In other words, the data we’re analyzing includes only technically valid pages—no noindex tags, no rel=canonical pointing elsewhere, and no pages blocked by robots.txt or other restrictions.
To avoid survivor bias, here are 4 more real examples.
Case 2: Sport News Website (France)
This is a completely different type of site, yet we observe the same pattern:
Pages that haven’t been crawled for 130 days are automatically removed from Google’s index.
They transition from the status “Submitted and indexed” to “Crawled – currently not indexed.”
Case 3: Fashion Magazine (Italy)
We observe exactly the same trend on this italian fashion magazine:
Pages that haven’t been crawled for 130 days gradually shift from “Submitted and indexed” to “Crawled – currently not indexed.”
Case 4: Corporate Site with a Forum (Worldwide)
Yet another type of website—a business site with an integrated forum for Q&A.
Once again, the same observation: the 130-day threshold applies.
Pages that haven’t been crawled for this duration tend to shift from “Submitted and indexed” to “Crawled – currently not indexed.”
Case 5: Governmental Website (France)
For the fifth and final example, a french institutional website—same pattern:
Pages that haven’t been crawled for 130 days transition from “Submitted and indexed” to “Crawled – currently not indexed.”
The 130-Day Rule
In all the observed examples, we consistently see the same trend:
- The indexation status of our pages depends on the crawl frequency by Google.
- It seems that Google applies a static crawl threshold of 130 days. Each page on the site has its own crawl frequency, which evolves over time. If this frequency drops to the point where Googlebot hasn’t crawled the page for 130 days, the page is removed from the index.
- Therefore, it’s important to analyze your pages with a crawl window of 130 days or more to optimize and improve their value.
What to Do with Pages Not Crawled for 130 Days?
Now, the legitimate question: What do we do with this knowledge?
To answer this, it’s important to understand how the search engine allocates and distributes its crawling resources.
The crawl frequency is a dynamic value that the search engine constantly seeks to optimize in order to crawl the pages that are most worthy of it.
“If you want to increase how much we crawl, then you somehow have to convince search that your stuff is worth fetching, which is basically what the scheduler is listening to.”
Gary Illyes, Google Analyst.
Crawl Frequency Calculation
From a website standpoint, the crawl frequency is primarily determined by two groups of factors:
- Content Quality of the Page
- PageRank of the Page
Now, gather the pages that haven’t been crawled for 130 days and try to answer the following questions:
From a Qualitative Perspective:
- What do these pages have in common?
- Do they belong to a specific type?
For example:
- On our tire manufacturer site (Case 1): Among the pages in question, we find category pages by brand that lack products or any differentiating content.
- On the media site (Case 2): These are pages with very similar tags that can be optimized and enriched further.
- On the fashion magazine site (Case 3): These are very short contents originally designed for social media distribution.
By improving the quality of these pages, you can enhance their crawlability and, in turn, their indexation.
“Scheduling is very dynamic. As soon as we get the signals back from search indexing that the quality of the content has increased across this many URLs, we would just start turning up demand.”
Gary Illyes, Google Analyst.
From the Perspective of PageRank
In addition to content, the frequency with which a page is crawled by Google is closely tied to its authority, which is formalized in the concept of PageRank.
The deeper a page is within the site structure, the less important it is considered.
And when this importance drops to a minimal threshold, to the point where Googlebot doesn’t deem it necessary to crawl the page more than once every 130 days, it eventually gets removed from the index.
This is essentially a clean-up process performed by Google, where pages deemed unimportant are removed from the index. This also explains why some pages that were indexed for a long time can suddenly be excluded.
Questions to Consider Regarding PageRank:
- Where are the deindexed pages located in the site structure?
- What is their depth level?
- Do they receive enough internal and external links?
Two Final Tips:
- If you want to know which pages are considered the most valuable by Google, crawl frequency is one of the most reliable indicators.
- To conduct this study across your entire site, you can analyze logs. Ask your host or developer to export the logs for at least the past 130 days. Cross-reference them with your crawl data: pages that were crawled but don’t appear in the log files over the last 130 days are almost certainly not indexed.
Source : This is the English version of the article in my newsletter “SEO, Data & Growth” : https://newsletter.alekseo.com/p/12-google-and-la-regle-des-130-jours.