04-14-2004, 09:06 AM
I know that this question is so abstract, that it is almost philosophical... but XL-Dennis has a post (here, in Ozgrid (http://ozgrid.com/forum/viewthread.php?tid=10318)) where he makes an insteresting point:
Nowdays it is quite common that we have 40 / 80 / 120 GB harddrives and the lack of a professional and a sophisticated filemanagement-tool has becomed painful obvious.
For instance, searching for a source/file with searchengines such as Google et al on the Internet is much more easier and much faster then searching for a file on the local harddrive."So this really got me wondering... how the heck is that possible??
Using Google (of course!) to help search for the solution, I tracked down one article that seems to be on-point.
Why is Google so Fast? (http://www.searchengineposition.com/info/Articles/GoogleFast.asp)
But this answer basically boils down to "it's all in RAM". Well, that's fine, but to go a step further, it must have a pre-defined set of searches already done, right? I suppose it could have pre-filtered searches on every word in the Dictionary plus keep track of all requested words that have taken place in the past? Or I guess break down every single document as they are entered into the database, recording a new key-word result for any new words that were not previously present in the database?
Even in RAM it's unimaginable that a raw search on billions of web pages amounting to trillions of words could possibly be searched in real-time... Yet, bizarre words like "yodeleheehoo" do generate results!
Any thoughts on this??
04-14-2004, 09:14 AM
Your file system is not a database. Yet.
04-14-2004, 09:50 AM
Your file system is not a database. Yet.
Foreshadowing Longhorn? :)
I read an article about the two guys who created Google, and it seems the answer to the question is a closely guarded secret by them. It's like the secret formula for Wonka's Everlasting Gob-Stopper. :D :D :D
04-14-2004, 10:05 AM
Well, I downloaded a few on-line books & grabbed all unique words from them & poked 'em into a database. After all was done and cleaned up, I ended up with about 23,000 unique words. The books were
The Lord of the Rings (all 3 books)
Stranger in a Strange Land (unabridged)
Each of these books are on the order of 400 pages in length, and each page holds approximately 500 words. So, we're talking about approximately 1 million words compressed down to 23,000. If I process more books, I would expect to see fewer and fewer words added with each. After processing "The Hobbit", for instance, the database had about 15K words.
I've been told that in the english language, there are approximately 80,000 unique words, including technical terms. Throw in made-up words, and you've probably doubled this number. So, even though there are trillions of words out there, the number of unique words in all languages should be in the millions. So, if you set up an indexing schema, a word would point at (almost always) multiple page links. So, when you perform a google search using a single word, you would get a very quick return (the count of all records, plus the first N matches.) When you google search with multiple words, the links for the words are compared, and only those in common are returned.
Databases are designed to perform quick searches, if properly indexed.
So, really, the answer ISN'T that "it's all in RAM." The answer is that google performs massive searches and correlations every month. I log the visitors to my site, and "Hey! There's the google bot, Right on time!" They grab the data, extract their search words and terms, and add them to their index, so you can quickly find it.
Google is licensing their search engine to corporations so they can search/correlate their own, propietary data and quickly find it.
04-14-2004, 10:13 AM
Ok, good point... so it's not really "Trillions of words out on the web" cross-referenced against a seemingly "infinite" possiblities of word searches, but is actually just Billions of web-pages cross refereneced against thousands of keywords.
I can start to see how this is more manageable now... but it still is an interesting challenge, eh? OnErr0r would have a field-day working on optimizing this one! :)
04-14-2004, 10:19 AM
Yes, but saying that, you've still got to include all of the foreign words that are searched also - this will hike up the number of words to be searched and cross referenced. There's also the fact that each search has to have "noise" words removed, also; plus the number of people all using Google at the same time...
I log the visitors to my site, and "Hey! There's the google bot, Right on time!" They grab the data, extract their search words and terms, and add them to their index, so you can quickly find it.
Just wondering, how often does google send their bots crawling around your site? Do they do it at a specific date or is it more like every week or so?
I read that Google deals with the thousands of requests they get every minute by just having lots and lots of servers all doing a bit of the work.
04-14-2004, 11:44 AM
Google runs on a cluster, which is part of the reason why searching at different times produces different results. If your request is routed to a different section of the cluster, you will get the results unique to those nodes. This allows it to hold most, if not all, of its data in RAM. Using proprietary data connections (faster than Gigabit), a cluster of computers can share RAM nearly as fast as if it were one giant computer.
04-14-2004, 03:20 PM
The Google Cluster Architecture: http://www.computer.org/micro/mi2003/m2022.pdf
From a usenet post:
Google is fast, because:
"The interface is clear and simple. Pages load instantly."
"By fanatically obsessing on shaving every excess bit and byte from
our pages and increasing the efficiency of our serving environment,
Google has broken its own speed records time and again. Others assumed
large servers were the fastest way to handle massive amounts of data.
Google found thousands of networked PCs to be faster. Where others
accepted apparent speed limits imposed by search algorithms, Google
wrote new algorithms that proved there were no limits. And Google
continues to work on making it all go even faster."
"Linux cluster (more than 10,000 servers)"
Google doesn't search the whole web when it gives an answer to a
"1. The web server sends the query to the index servers."
"2. The query travels to the doc servers"
"3. The search results are returned to the user"
04-14-2004, 03:34 PM
As to the earlier comment about Longhorn's file system (WinFS), it is only going to be a partial DB (better than nothing though). Its really just a meta data DB model layed over a conventional FS, so it won't be the same as a true DB. In fact, you can turn it off and the FS will still function fine.
04-14-2004, 09:09 PM
Wrong wrong and wrong.
The technology behind googles search techniques is explained here:
04-14-2004, 09:20 PM
Ahahahaha... :D Love it!
04-15-2004, 02:40 AM
Ermmm - remind me not to look for a job as an engineer at Google - you'd spend all your day with a large metal scraper!
I wonder if the competition have tried breeding insane pigeons to infiltrate the coops?
04-15-2004, 09:34 AM
I wonder what the Google parking lot looks like ....