You are at a conference and the speaker compares two terms using the number of estimated hits returned by Google Search. A very common thing to do. Almost nobody seems to care that those numbers do not resemble the number of actual results.
When you search for something at Google Search and Google tells you that there are 11,346,000 results, this number is completely made up. There is an algorithm which determines those numbers without looking at the complete set of data. And this algorithm is based on some (secret) ingredients. Google has enormous capabilities but determining the number of actual hits of their search data can not be done in real-time.
Let's take a closer look.
Suppose that Google Search could really determine the real numbers of results. Following the theory of sets and Boolean algebra, I defined following assumptions when searching for two arbitrary search terms «foo» and «bar»:
|Query||Number of Hits Returned|
|foo AND bar||C; less than A; less than B|
|foo OR bar||D; more than A; more than B|
|foo -bar||E; less than A|
|bar -foo||F; less than B|
|foo AND bar||C|
|pages of 10 for "foo AND bar"||C/10|
For the AND/OR/NOT queries, we can only give rough estimates since it is hard to determine better numbers without the whole data-set.
In order to get numbers myself, I defined pairs of search terms and used Google Search (via Google.com) to query for the terms.
My software environment was Debian GNU/Linux Jessie with the most current Tor Browser. I think that using Tor Browser does give me less personalized search results because I am not logged in with any Google account nor does the Tor Browser show highly unique browser fingerprints:
Within our dataset of several hundred thousand visitors, only one in 2667.37 browsers have the same fingerprint as yours.
Currently, we estimate that your browser has a fingerprint that conveys 11.38 bits of identifying information.
Using Tor does not prevent localization at all: the tor exit node I was using has a geographical location associated. However, I was using the same tor connection for all queries. Therefore, the numbers should be comparable to each other even when the reproducibility of the exact numbers is at least questionable.
My Tor Browser identifies itself as Firefox 45.6.0 (Build identifier: Mozilla/5.0 (Windows NT 6.1; rv:45.0) Gecko/20100101 Firefox/45.0) and the time of queries was roughly 2017-01-15 7pm CET.
The first set of terms was «emacs» and «orgmode». Since Google Search is case-insensitive, I am only using lower case terms here.
|emacs AND orgmode||251000||<346000|
|emacs OR orgmode||6370000||>6210000|
|emacs||6200000||6210000||10000 additional hits|
|orgmode||347000||346000||1000 additional hits|
|emacs AND orgmode||252000||251000||1000 additional hits|
The last row is not really about estimated numbers: I determined the number of actual result pages («actual pages») using following algorithm: search for «foo AND bar», follow the result pages using the «Next» buttons until the end is reached. For the terms above, it was this URL. Then, Google shows following statement:
In order to show you the most relevant results we have omitted some entries very similar to the 150 already displayed. If you like you can repeat the search with the omitted results included.
I clicked on the link behind «search with the omitted results included» and followed the results until they do not yield any more results. For the terms above, the corresponding final URL I got was this. This result page shows a statement like «Page 56 of about 252,000 results (1.19 seconds)». There is clearly a discrepancy because 56 pages with ten results each is less than 560 results and Google states that they've got 252000 results. That's only 0.22 percent of the estimated results.
Except for this, there were some minor differences between the expected result numbers and the numbers returned by Google.
The second set of terms was «linux» and «torvalds»:
|linux AND torvalds||472000||<3070000|
|linux OR torvalds||34600000||>396000000||contradiction: <10% of expected value|
|linux||397000000||396000000||one million hits difference|
|torvalds||3090000||3070000||10000 hits difference|
|linux AND torvalds||491000||472000||19000 hits difference|
The query for «linux OR torvalds» returned far less than the number of pages estimated for «linux» which is a contradiction to the assumptions.
When querying the terms and its AND-combination for the second time, Google returns more results than with the first query.
Once again, the number of actual result pages differs greatly from the number of hits shown by Google: only 0.13 percent of results could be navigated to.
The next set of terms was «vienna» and «mozart»:
|vienna AND mozart||1190000||<84500000|
|vienna OR mozart||271000000||>187000000|
|mozart||85000000||84500000||500000 additional hits|
|vienna AND mozart||1190000||1190000|
Here it was interesting to see that vienna without mozart returned even more hits than vienna alone. This is a clear contradiction. Same holds for mozart without vienna which returned more hits than mozart alone.
The second query for «mozart» returned 500000 more hits than the first one.
The number of actual result pages is only 55 which is dramatically less than the 119000 is should have been.
Now for the next set of terms: «trump» and «fake»:
|trump AND fake||81400000||<675000000|
|trump OR fake||900000000||>885000000|
|trump AND fake||81400000||81400000|
It is interesting to see that with these result sets, the numbers do fulfill the assumptions with only one exception: the number of actual results is again only a tiny fraction of the number of hits stated by Google. The less than 570 results found are far less than the 8140000 estimated.
The last set of terms are chosen somewhat different. Previous terms were rather general resulting in millions of search result estimates. To compare those set of terms with a set that is not likely to return millions of results, I chose «linuxtage» and «privatsphäre» (German for privacy):
|linuxtage AND privatsphäre||12000||<47000|
|linuxtage OR privatsphäre||40200000||>40100000|
|linuxtage AND privatsphäre||12000||12000|
We still see a huge difference in the number of actual result pages. Overall, smaller numbers might indicate better result estimates.
The next terms are «treibsand» (German for quicksand) and «burggraben» (German for moat):
|treibsand AND burggraben||1130||<319000|
|treibsand OR burggraben||477000||>470000|
|treibsand AND burggraben||1130||1130|
This results supports the guess that terms that are less general tend to return better estimates. The number of actual result pages is still way off the estimated value. However the difference is much smaller than in the other examples.
Many more things can be tested such as «subtracting» a less general term from a general term should lead to almost the same number than the general turn alone was estimated. And so on.
The web page SEO Chat has a nice article about this topic. They summarize the reason for this arbitrary numbers in an excellent way:
One of the reasons that engines like Google or Bing can find so many results is that they don’t bother to collect them for your use. The chances that you will need them are slim to none, and even if you did by chance require a deeper web page, you wouldn’t be able to find it. There are just too many to sift through. Instead, you would have to go back and narrow down your search terms to get a better list of choices, something we are all pretty used to doing by now.
I can copy this from my point of view. However, they also state that «those results still do exist» which my own experiment contradicts.
Yes, Google might have those hundreds of thousands of results in their back-end. I can not think of any reason why they won't show them to me while navigating through the result set.
Wikipedia itself has a very interesting article about Google search and the notability of a subject. Unfortunately, they don't mention anything about the number of hits returned.
Here is an article that explains some discrepancies: When Google Search does a query for «foo», it does not look at as many data as a query for «foo -bar». Therefore, the second query is able to find more results than the first one. Well, this might be a perfectly fine explanation but it also underlines my basic premise: the number of estimated search results is a completely made-up arbitrary number.
Maybe the estimated number of search results will vanish in future.
In the meantime: don't use those numbers to compare terms.