
The Internet must be one of the most chaotic environments on the planet. There are hundreds of thousands of people on the net and all of them are either putting or getting information on the Internet (most both). What is the best way to search the Internet? This is not an easy question to answer. There are as many ways to search the Internet as there are places to be searched. The web is quickly becoming the interface of choice for accessing the content on the Internet so that is what I will concentrate on. Some search services are WAIS, Archie, gopher, and proprietary (many will gladly sell you their database and search engines), but I am not going to linger on the technologies underlying these databases since the same methods can be used on most of them. There are some special functions that can be used on one that wont work on another, but mostly you will get the hang of which ones support the advanced features like boolean searching (quite common) and regular expressions (more powerful, but much less common). I will address boolean searching to some extent, but regular expressions, unfortunately, are not always identical between two different search engines, so take my descriptions with salt. Fortunately, regular expression searches in their simplest form look the same as base phrase searches, so you will probably have little problems with simple searchs.
Here is a subset of the different kinds of searches that you are bound to find out there on the Internet. For some of them I have created a sample data set to search and provided sample queries. Some of them I could not find a viable sample query for this small a data set so I simply gave (hopefully obvious) examples. Also, I have listed sites on the Internet that use the type of query shown. This list is obviously incomplete since there are hundreds (if not thousands) of different search engines on the Internet. If you need clarification of any of these, feel free to send me e-mail, but please exhaust the online help for the service you are using before you resort to asking me a question. And dont expect me to do your research for you, I will happily answer questions on how to search a specific service but I cannot answer questions on where to search. For some ideas on where to search, look at my bookmarks (on my web page, the address is at the end of this document). Also, look at search.com who has compiled an enormous list of places to search on the Internet.
In the descriptions below I have been pretty cavalier in my use of capitalized words to (hopefully) enhance the clarity of the paragraphs. You need to check the search engine you are using as to their particular use of capitol letters. Some ignore them completely while others enforce them (meaning that if you capitalize a word, only words with identical capitalization will be found). In my search engine summaries below I have noted how the engine handles capitalized words (if I know). If you get back fewer hits than you expected, check your capitalization (you are almost always safe lower casing every query).
A keyword search engine simply takes a list of words. Often you can pick (using a radio button or drop list) whether all of the words must be found (AND) or whether any of the words can be found (OR) before a match is made. Some let you designate whether the keywords are to be considered full words (separated by spaces and punctuation) or simply letters (sometimes called sub-strings). Usually keyword searches do not care about the case of the keywords. Often you can enclose multiple words in quotes and they will be treated as one word (they may have to exist unbroken by line endings in the text for the search to work).
Punctuation within a list of keywords is usually ignored, therefore joe sara is the same as joe, sara or joe-sara or joe&sara. Most keyword engines use a limiting operator (often +) or a exclusion operator (often -) that you can use to limit your query (for example to get pages that must contain cookies that are made with chocolate, nuts, or caramel, but you dont want turtles you might use "+cookies chocolate nuts caramel -turtle"). Many also provide a wildcard operator (usually *) that indicates that a keyword begins with some characters (for example "cal*" would match calendar, calculate, caltech, etc.). Some allow you to specify the beginning and end of the word (for example "st*p" would match stop, step, stewardship, etc.). Most require a certain number of characters before the wildcard operator to avoid wasting resources in the search engine (often 3 characters so that "a*" and "ab*" are both illegal but "abc*" is not).
Also, you should try hard to use uncommon words that relate to what you want to find (for example, if you are looking for information on jeffery dahlmer, searching for jeff is unlikely to be useful, whereas searching for dahlmer is more likely to find something useful, and searching for dahlmer criminal insane would be even more specific (or even "Jeffery Dahlmer" criminal insane). You need to be careful to note that the sites default relationship is AND for multiple keywords, if it is OR adding more keywords actually makes your search less specific. If the site sorts your results for you, you can help it along by duplicating the keyword that is most important (AltaVista lets you specify which words are important with a ranking field). This will double (triple, etc.) the number of hits associated with that word (for example, dahlmer dahlmer criminal insane) would make sure pages that talk about Dahlmer specifically rather than simply mentioning his name, would rise to the top (for example, check out http://www.rru.com/~meo/WiredDiff/ch03/val.html).
Services that use keyword searches: Yahoo, Excite, Alta Vista (simple query section)
| Service | Default relationship |
| Yahoo | AND |
| Four11 | AND |
| HotBot | AND |
| InfoSeek | OR |
| Lycos | AND |
| WebCrawler | BOTH |
| Excite | OR |
| Alta Vista | OR |
Example data for keyword and phrase searche samples:
Sample queries:
| Joe sara | Matches lines 1, 2 and 4 | Matches line 1 |
| Joe money | Matches lines 1, 3, and 5 | Matches none |
| "band saw" money | Matches lines 1, 3 and 5 | Matches lines 3 and 5 |
A phrase search is a very simple search and you will probably not find too many of these on the Internet. A phrase search looks for an exact match for the search text. There is no concept of and/or because there is only one "thing" that is being searched for.
Services that use phrase searches: Some gopher search tools
Sample queries:
| band saw | Matches lines 1, 3 and 5 |
| saw band | Matches none |
| joe c | Matches lines 1 and 4 |
A concept search allows you to type in a topic or short description of what you are looking for. They use a thesaurus to find what you are looking for. Typing extraneous words like "and," "or," "it," etc. and punctuation may confuse these search engines. There is a big difference between the keywords cut and cutting with a search like this since they will result in a completely different set of synonyms.
Services that use Concept searches: MetaCrawler, Excite
A natural search is written in a non-structured language (like English) which allows the user to describe what they are looking for and the engine then is responsible for distilling the query down to a structured format. A sample natural search might be "I am looking for programs related to winsock that allow FTP file transfers." or "FTP programs for winsock". In theory these two would be equivalent and should return very similar, if not identical, results, but in practice the first will probably not work at all while the second one is very likely to find something. This is because, while the engines claim to process "real" sentences, there are none that are powerful enough to actually parse full english, so you still need to put thought into how you phrase your query. Try to find the most succinct phrase that describes what you are searching for and, in general, stay away from fluff (like "I want," "I am looking," "Why cant you find," etc.).
Services that use natural searches: Infoseek
Regular expressions are very precise ways of denoting search strings. They were designed by computer scientists to describe computer languages (lexical analysis, for those who care). Most programmers editors use regular expressions for searching, and anyone with more than a passing familiarity with Unix likely use them on a weekly if not daily basis. Regular expressions are much too complicated to cover completely here, except to show you some simple constructs. Remember that different engines will probably have all of these operations, but the actual operators may be different from engine to engine (Ive left out the ones that are most likely to be different between engines). One feature of regular expressions is that there are often many ways to express the same query.
Services that use regular expression searches: LookUP!, Archie
Most of the operations are shown below:
| | | OR | a|b |
| ^ | Beginning of line | ^The game |
| $ | End of line | that was that.$ |
| () | Grouping | (this)|(that) |
| * | 0 or more occurances | a*bcd |
| + | 1 or more occurances | a+bcd |
| [] | Matches any in set | [abcd] |
| [^] | Matches none in set | [^abcd] |
| \ | escape character | 5\*3=15 |
| . (period) | Matches any character | time.*ss |
Sample queries:
| [iI]nternet | Matches "Internet" or "internet" |
| news[123456789] | Matches "news1" or "news2" or "news3", ..., "news9" |
| I|Internet | Matches "Internet" or "internet" |
| t|boy | Matches "toy" or "boy" |
| (toy)|(boy) | Matches "toy" or "boy" |
| (to)|(bo)y | Matches "toy" or "boy" |
| x*yz | Matches "yz" or "xyz" or "xxyz" or "xxxyz" or ... or "xxxxxxxxxxxxyz", etc. |
| x+yz | Matches "xyz" or "xxyz", etc., but does NOT match "yz" |
| kiss.. | Matches "kisser" or "kisses" or "kiss43", etc. |
| time\.001 | Matches "time.001" only |
Boolean searches use boolean logic. This is one of the most common methods of specifying advanced searches on the Internet. Boolean uses the concepts of TRUE and FALSE. The simple way to think about it is to imagine the search engine comparing all possible pages on the web with your query (in the case of a Web search) and only those that evaluate to TRUE are returned to you as being found. A boolean search engine recognizes at least three (sometimes more) operators: AND, OR, and NOT. Many times these engines also use NEAR which is not (strictly speaking) a boolean operator.
Some services have separate simple and advanced searches (AltaVista for example), while others rely on parsing the search string (requiring you to type the boolean operators in all UPPER CASE, HotBot for example).
Services that use boolean searches: AltaVista, Excite, HotBot, Open Text, WebCrawler, and most others
The following is a description of each of these operators:
AND
The AND operator works if the engine finds the items on both sides of
AND.
For example, Joe AND Sara,
Money AND Politics, etc. The truth table for AND is as
follows:
| TRUE | TRUE | FALSE |
| FALSE | FALSE | FALSE |
OR
The OR operator works if the engine finds either of the items around the
OR.
For example, Joe OR Sara,
Money OR Love, etc. The truth table for OR is as follows:
| TRUE | TRUE | TRUE |
| FALSE | TRUE | FALSE |
NOT
The NOT operator reverses the meaning of the result. For example Joe AND NOT Sara would match any items which contained Joe, but did not contain Sara. Some search engines use the abbreviated form of Joe NOT Sara and dont allow the (longer but, in my opinion, more readable) AND NOT construct.
NEAR
The NEAR operator dictates that the keywords must be found within a certain number of words of each other. For example, Joe near Sara means that a phrase like "Joe loves Sara" would match, but a phrase like "Joe was watching the stars shine while he was washing his car and then there was an eclipse at which point Sara came out." probably would not match. The distance between words which are "NEAR" is different for each search engine but 7 to 20 words apart is common (WebCrawler is unique in that it defines near as within 1 word by default, see WebCrawler below). This allows you to eliminate pages that just happen to have the two words on them (like Web and Browser) rather than pages that are specifically about Web Browsers without forcing the two words to be exactly next to each other like "Web Browser". Also the order of the words is not important, so "Browsers of the Web" would match as would "Web related Browsers".
FOLLOWED BY
The FOLLOWED BY operator is similar to the NEAR. It specifies that the second keyword must follow the first in the document, but does not specify how many words are in between the two keywords. So parking FOLLOWED BY lot would find "Parking in the lot was terrible," but not "there was lots of parking." The only search engine that currently supports this operator is Open Text.
Parenthesis ( )
You can use Parenthesis to associate items within a boolean search. For example, lets say you are looking for either red cars or blue wagons, you would use the search query: (red near car) or (blue near wagon). Another example would be if you were looking for cars that were either red or blue: car near (red or blue). If you were looking for a couple named Joe and Sara, and you knew their last name was Katsenblatz, you could search for: (Joe or Sara) and Katsenblatz. Also, you can nest parenthesis like: ((Joe or Sara) near (Bob or Jane)) and Katsenblatz. This would match "Joe Bob Katsenblatz", "Joe Jane Katsenblatz", "Sara Jane Katsenblatz", "Katsenblatz Bob Sara", etc.
(in alphabetic order)
Type of data indexed: Web, Usenet
Alta Vista is a spider. This means that to get information into the database people submit their pages to the service and it goes out to the page and reads it to find out how to index it. This means that there is a lot of garbage in the index. Most people on the net dont know how to properly set up their pages to be indexed by the search engines so you have to take that into account when creating your queries.
Alta Vista has two different interfaces to their search engine: Simple and Advanced. Their simple interface is a keyword interface with a few additional features which I will discuss. Their advanced search is a boolean search.
The simple search is a keyword search. You can put multiple words within quotes or string words together with punctuation (such as periods, commas, or semi-colons) so that "hello world" and hello;world are equivalent keywords. The default relationship between the keywords is OR (therefore adding extra keywords makes your query less specific unless you use the prefix commands). You can prefix your keywords with + to indicate that the word is not optional (the equivalent of AND), and you can use the prefix - to indicate that the word must not occur on the page (the equivalent of NOT). You can also use the suffix * to indicate "begins with", for example top* would match top, tops, topside, topic, etc. An important feature of Alta Vista is that keywords with any upper case letters in them must be matched exactly, and words with all lower case will match any case.
Valid queries for Alta Vista Simple search:
+country +music -western writer*
+"country music" -western
+rock;and;roll +"compact disk"
+"Indigo Girls" +lyrics
Important points:
More help on simple queries for Alta Vista Simple search can be found at:
http://altavista.digital.com/av/content/help.htm
ttp://altavista.digital.com/cgi-bin/query?pg=aq
Important points:
The following constrain keywords are allowed:
| keyword | limits the search to |
| title:text | in the title of a page |
| anchor:text | in an anchor |
| text:text | in the body of a page |
| applet:name | of an applet |
| object:name | of an activex object |
| link:URL | in an anchor |
| image:name | of an image file |
| url:text | in the pages URL |
| host:text | in the host portion of the pages URL |
| domain:URL | within the specified domain (such as net, com, edu, mil, etc.) |
| from:text | in the from field of a usenet message |
| subject:text | in the subject field |
| newsgroups:Only | search messages in the news group containing this text |
| summary:text | in the summary of a usenet message |
| keywords:text | in the keyword list of a usenet message |
Alta Vista Advanced Search Help can be found at:
http://altavista.digital.com/av/content/help_advanced.htm
Type of data indexed: Web, Usenet, Organized by Topic
Important points:
Type of data indexed: e-mail addresses
Important points:
Type of data indexed: Web, Usenet
Important points:
The following constrain keywords are allowed (HotBot calls them Meta Words):
| keyword | limits the search to |
| domain:domain | name to restrict search to |
| depth:depth | of pages retrieved (number) |
| linkdomain:domain | name page must point to |
| linkext:extension | extension of files to match |
| scriptlanguage:name | name of language to search for |
| newsgroup:name | name of newsgroup to search |
| feature:tag | searches for HTML feature tags (such as embed, script, activex, frame, table, etc.) |
| after:day/month/year | day/month/year |
| before:day/month/year | day/month/year |
| within:limit | a time frame (for example, 4/months, 3/days, 12/years) |
Type of data indexed: Web, Usenet, newswires, e-mail addresses, company profiles, FAQs
Important points:
The following constrain keywords are allowed:
| keyword | limits the search to |
| link:URL | in an anchor |
| site:domain | name to search |
| url:text | in pages URL |
| title:text | in pages title |
Type of data indexed: Web
Important points:
Type of data indexed: Web
Important points:
Type of data indexed: Web
Important points:
Type of data indexed: Organized by Topic
Important points:
For a good comparison of various search engines read the article, "Best search engines for finding scientific information on the Net" at http://www.medfarm.unito.it/pharmaco/itcrs/new/comparis.html.
Last updated Monday, February 21, 2005
Send mail to me at sgartner@pingbot.com
Search animation Copyright © 1997 Eclipse
Digital Imaging
Copyright © 1995-2005, M. Scott Gartner