Searching on the Internet


Table of Contents


The Internet must be one of the most chaotic environments on the planet. There are hundreds of thousands of people on the net and all of them are either putting or getting information on the Internet (most both). What is the best way to search the Internet? This is not an easy question to answer. There are as many ways to search the Internet as there are places to be searched. The web is quickly becoming the interface of choice for accessing the content on the Internet so that is what I will concentrate on. Some search services are WAIS, Archie, gopher, and proprietary (many will gladly sell you their database and search engines), but I am not going to linger on the technologies underlying these databases since the same methods can be used on most of them. There are some special functions that can be used on one that won’t work on another, but mostly you will get the hang of which ones support the advanced features like boolean searching (quite common) and regular expressions (more powerful, but much less common). I will address boolean searching to some extent, but regular expressions, unfortunately, are not always identical between two different search engines, so take my descriptions with salt. Fortunately, regular expression searches in their simplest form look the same as base phrase searches, so you will probably have little problems with simple searchs.

Here is a subset of the different kinds of searches that you are bound to find out there on the Internet. For some of them I have created a sample data set to search and provided sample queries. Some of them I could not find a viable sample query for this small a data set so I simply gave (hopefully obvious) examples. Also, I have listed sites on the Internet that use the type of query shown. This list is obviously incomplete since there are hundreds (if not thousands) of different search engines on the Internet. If you need clarification of any of these, feel free to send me e-mail, but please exhaust the online help for the service you are using before you resort to asking me a question. And don’t expect me to do your research for you, I will happily answer questions on how to search a specific service but I cannot answer questions on where to search. For some ideas on where to search, look at my bookmarks (on my web page, the address is at the end of this document). Also, look at search.com who has compiled an enormous list of places to search on the Internet.

In the descriptions below I have been pretty cavalier in my use of capitalized words to (hopefully) enhance the clarity of the paragraphs. You need to check the search engine you are using as to their particular use of capitol letters. Some ignore them completely while others enforce them (meaning that if you capitalize a word, only words with identical capitalization will be found). In my search engine summaries below I have noted how the engine handles capitalized words (if I know). If you get back fewer hits than you expected, check your capitalization (you are almost always safe lower casing every query).

The kinds of searches to look for:

Keyword searchestop

A keyword search engine simply takes a list of words. Often you can pick (using a radio button or drop list) whether all of the words must be found (AND) or whether any of the words can be found (OR) before a match is made. Some let you designate whether the keywords are to be considered full words (separated by spaces and punctuation) or simply letters (sometimes called sub-strings). Usually keyword searches do not care about the case of the keywords. Often you can enclose multiple words in quotes and they will be treated as one word (they may have to exist unbroken by line endings in the text for the search to work).

Punctuation within a list of keywords is usually ignored, therefore joe sara is the same as joe, sara or joe-sara or joe&sara. Most keyword engines use a limiting operator (often ‘+’) or a exclusion operator (often ‘-‘) that you can use to limit your query (for example to get pages that must contain cookies that are made with chocolate, nuts, or caramel, but you don’t want turtles you might use "+cookies chocolate nuts caramel -turtle"). Many also provide a wildcard operator (usually ‘*’) that indicates that a keyword begins with some characters (for example "cal*" would match calendar, calculate, caltech, etc.). Some allow you to specify the beginning and end of the word (for example "st*p" would match stop, step, stewardship, etc.). Most require a certain number of characters before the wildcard operator to avoid wasting resources in the search engine (often 3 characters so that "a*" and "ab*" are both illegal but "abc*" is not).

Also, you should try hard to use uncommon words that relate to what you want to find (for example, if you are looking for information on jeffery dahlmer, searching for jeff is unlikely to be useful, whereas searching for dahlmer is more likely to find something useful, and searching for dahlmer criminal insane would be even more specific (or even "Jeffery Dahlmer" criminal insane). You need to be careful to note that the sites default relationship is AND for multiple keywords, if it is OR adding more keywords actually makes your search less specific. If the site sorts your results for you, you can help it along by duplicating the keyword that is most important (AltaVista lets you specify which words are important with a ranking field). This will double (triple, etc.) the number of hits associated with that word (for example, dahlmer dahlmer criminal insane) would make sure pages that talk about Dahlmer specifically rather than simply mentioning his name, would rise to the top (for example, check out http://www.rru.com/~meo/WiredDiff/ch03/val.html).

Services that use keyword searches: Yahoo, Excite, Alta Vista (simple query section)

Service Default relationship
Yahoo AND
Four11 AND
HotBot AND
InfoSeek OR
Lycos AND
WebCrawler BOTH
Excite OR
Alta Vista OR

Example data for keyword and phrase searche samples:

  1. Joe came through the door and saw Sara standing there with a band saw in her hands.
  2. Sara was preparing to cut through the wall into the bank where she was hoping that
  3. there would be lots of money. This was because the band saw had cost her more money
  4. than she currently could afford. It really was silly of her since Joe could have loaned her
  5. any money she wanted, and she didn’t really need a band saw anyway.

Sample queries:

Keywords
results using OR
results using AND
Joe sara Matches lines 1, 2 and 4 Matches line 1
Joe money Matches lines 1, 3, and 5Matches none
"band saw" moneyMatches lines 1, 3 and 5 Matches lines 3 and 5

Phrase searchestop

A phrase search is a very simple search and you will probably not find too many of these on the Internet. A phrase search looks for an exact match for the search text. There is no concept of and/or because there is only one "thing" that is being searched for.

Services that use phrase searches: Some gopher search tools

Sample queries:

Keywords
results
band sawMatches lines 1, 3 and 5
saw bandMatches none
joe c Matches lines 1 and 4

Concept searchestop

A concept search allows you to type in a topic or short description of what you are looking for. They use a thesaurus to find what you are looking for. Typing extraneous words like "and," "or," "it," etc. and punctuation may confuse these search engines. There is a big difference between the keywords cut and cutting with a search like this since they will result in a completely different set of synonyms.

Services that use Concept searches: MetaCrawler, Excite

Natural searchestop

A natural search is written in a non-structured language (like English) which allows the user to describe what they are looking for and the engine then is responsible for distilling the query down to a structured format. A sample natural search might be "I am looking for programs related to winsock that allow FTP file transfers." or "FTP programs for winsock". In theory these two would be equivalent and should return very similar, if not identical, results, but in practice the first will probably not work at all while the second one is very likely to find something. This is because, while the engines claim to process "real" sentences, there are none that are powerful enough to actually parse full english, so you still need to put thought into how you phrase your query. Try to find the most succinct phrase that describes what you are searching for and, in general, stay away from fluff (like "I want," "I am looking," "Why can’t you find," etc.).

Services that use natural searches: Infoseek

Regular Expression searchestop

Regular expressions are very precise ways of denoting search strings. They were designed by computer scientists to describe computer languages (lexical analysis, for those who care). Most programmer’s editors use regular expressions for searching, and anyone with more than a passing familiarity with Unix likely use them on a weekly if not daily basis. Regular expressions are much too complicated to cover completely here, except to show you some simple constructs. Remember that different engines will probably have all of these operations, but the actual operators may be different from engine to engine (I’ve left out the ones that are most likely to be different between engines). One feature of regular expressions is that there are often many ways to express the same query.

Services that use regular expression searches: LookUP!, Archie

Most of the operations are shown below:

Operator
Description
Usage
| OR

a|b

^ Beginning of line

^The game

$ End of line

that was that.$

() Grouping

(this)|(that)

* 0 or more occurances

a*bcd

+ 1 or more occurances

a+bcd

[] Matches any in set

[abcd]

[^] Matches none in set

[^abcd]

\ escape character

5\*3=15

. (period) Matches any character

time.*ss

Sample queries:

[iI]nternet Matches "Internet" or "internet"
news[123456789] Matches "news1" or "news2" or "news3", ..., "news9"
I|Internet Matches "Internet" or "internet"
t|boy Matches "toy" or "boy"
(toy)|(boy) Matches "toy" or "boy"
(to)|(bo)y Matches "toy" or "boy"
x*yz Matches "yz" or "xyz" or "xxyz" or "xxxyz" or ... or "xxxxxxxxxxxxyz", etc.
x+yz Matches "xyz" or "xxyz", etc., but does NOT match "yz"
kiss.. Matches "kisser" or "kisses" or "kiss43", etc.
time\.001 Matches "time.001" only

Boolean searchestop

Boolean searches use boolean logic. This is one of the most common methods of specifying advanced searches on the Internet. Boolean uses the concepts of TRUE and FALSE. The simple way to think about it is to imagine the search engine comparing all possible pages on the web with your query (in the case of a Web search) and only those that evaluate to TRUE are returned to you as being found. A boolean search engine recognizes at least three (sometimes more) operators: AND, OR, and NOT. Many times these engines also use NEAR which is not (strictly speaking) a boolean operator.

Some services have separate simple and advanced searches (AltaVista for example), while others rely on parsing the search string (requiring you to type the boolean operators in all UPPER CASE, HotBot for example).

Services that use boolean searches: AltaVista, Excite, HotBot, Open Text, WebCrawler, and most others

The following is a description of each of these operators:

AND

OR

NOT

NEAR

FOLLOWED BY

Parenthesis ( )

Search Engine Specific Helptop

(in alphabetic order)

Alta-Vistatop

Type of data indexed: Web, Usenet

Alta Vista is a spider. This means that to get information into the database people submit their pages to the service and it goes out to the page and reads it to find out how to index it. This means that there is a lot of garbage in the index. Most people on the net don’t know how to properly set up their pages to be indexed by the search engines so you have to take that into account when creating your queries.

Alta Vista has two different interfaces to their search engine: Simple and Advanced. Their simple interface is a keyword interface with a few additional features which I will discuss. Their advanced search is a boolean search.

Alta Vista Simple searchtop

http://altavista.digital.com/

The simple search is a keyword search. You can put multiple words within quotes or string words together with punctuation (such as periods, commas, or semi-colons) so that "hello world" and hello;world are equivalent keywords. The default relationship between the keywords is OR (therefore adding extra keywords makes your query less specific unless you use the prefix commands). You can prefix your keywords with ‘+’ to indicate that the word is not optional (the equivalent of AND), and you can use the prefix ‘-‘ to indicate that the word must not occur on the page (the equivalent of NOT). You can also use the suffix ‘*’ to indicate "begins with", for example top* would match top, tops, topside, topic, etc. An important feature of Alta Vista is that keywords with any upper case letters in them must be matched exactly, and words with all lower case will match any case.

Valid queries for Alta Vista Simple search:

Important points:

More help on simple queries for Alta Vista Simple search can be found at:

http://altavista.digital.com/av/content/help.htm

Alta Vista Advanced Searchtop

ttp://altavista.digital.com/cgi-bin/query?pg=aq

Important points:

The following constrain keywords are allowed:

Alta Vista Advanced Search Help can be found at:

http://altavista.digital.com/av/content/help_advanced.htm

Excitetop

http://www.excite.com/

Type of data indexed: Web, Usenet, Organized by Topic

Important points:

Four11top

http://www.four11.com/

Type of data indexed: e-mail addresses

Important points:

HotBottop

http://www.hotbot.com/

Type of data indexed: Web, Usenet

Important points:

The following constrain keywords are allowed (HotBot calls them Meta Words):

Infoseektop

http://www.infoseek.com/

Type of data indexed: Web, Usenet, newswires, e-mail addresses, company profiles, FAQs

Important points:

The following constrain keywords are allowed:

Lycostop

http://www.lycos.com/

Type of data indexed: Web

Important points:

Open Texttop

http://index.opentext.net/

Type of data indexed: Web

Important points:

WebCrawlertop

http://webcrawler.com/

Type of data indexed: Web

Important points:

Yahootop

http://www.yahoo.com/

Type of data indexed: Organized by Topic

Important points:

For a good comparison of various search engines read the article, "Best search engines for finding scientific information on the Net" at http://www.medfarm.unito.it/pharmaco/itcrs/new/comparis.html.


Last updated Monday, February 21, 2005
Send mail to me at sgartner@pingbot.com
Search animation Copyright © 1997 Eclipse Digital Imaging
Copyright © 1995-2005, M. Scott Gartner