Finding needles in haystacks

Finding a needle in a haystack

The Internet and the World Wide Web provides ways to access tens of millions of documents, from the full-text versions of Shakespeare's plays to a 10-year-old's holiday tips.

The greatest challenge still remains: finding what you want or need in cyberspace.

Just as there is no central authority running the Internet, there is no central directory for every Internet site, Web page and newsgroup posting.

There are directories and search engines to help you, but as you enter into what has become the great global scavenger hunt, keep several things in mind:

*Not everything is listed. Many of the search tools available rely upon individuals to enter their own descriptive information into the search tool's database. Thus, someone may choose not to list a page or not know the process for listing their site, and you may miss what you're looking for.

Many government and military sites are not indexed by search engines. The information you want may be on the internet, but if the search engine you are using doesn;t index it, you won;t find it.

*The listing may be limited to key words. Some search engines do send virtual robots out to find a match for the word you submit, but others rely upon the person listing the site to provide key words to match the user's request. This could mean that you search for a key word which doesn't happen to be on that list, or, equally frustrating, the key word may have multiple meanings. For example, we searched for Web sites about the environment. Most of the first matches concerned the Unix computer system, which technically is called the Unix Operating Environment.

*The site may not be in English. As the World Wide Web spreads throughout the actual World, more sites appear in various languages and alternative alphabets or pictograms. Your computer or the search tools may not support a search in Hebrew or Japanese without special software. (And you may not know what words to search for in another language.)

AltaVista allows searching sites by language and provides translations of some foreign language sites.

*Not everything is on the Net. Some days it seems like most of human history has been put online; however, vast repositories of information still remain in libraries, businesses and private homes.

Information still protected by copyright laws does not appear online -- or if it does, it is only in an illegal pirated version.

*Searches are only text-based. Although the Web can be used to access audio, video, images and other file formats, search engines only match words. A picture or sound may be discovered, but only if the words describing it match the user's query. To search for certain kinds of media, stick to AltaVista. AltaVista allows searching for certain kinds of files and formats.

When a Web site's content is supplied by a database, the search engines won't access the database because computer engineers have not yet figured out how to do a query that way yet.

Searching is still possible, if not perfect. Since the Web became popular, two distinct ways of finding information have been developed: directories (sometimes called catalogs) and search engines.

Directories are the yellow pages of the Internet. The first major directory was Yahoo, developed by two Stanford engineering students. Yahoo, like most directories, is arranged by categories that become increasingly specialized. For example, if you were looking for the Web page for a baseball team in Daytona Beach, click:

  • Regional
  • U.S. States
  • Florida
  • Cities
  • Daytona Beach
  • Sports
  • Daytona Cubs
  • Daytona Cubs (official)

Other directories have quickly developed and as these directories grew, so did the frustration of going down level by level to find a site, leading many directories to incorporate search engines for the directories themselves so you can find listings more quickly. Some of the directories also allow you to search beyond their own directory listings.

Search engines themselves fall into two general categories. The earliest version would search databases of information which had been submitted by the developer of a given site. These search engines usually used the titles of Web documents to find a match.

Of course, information which had not been submitted or which was contained deeper within the document would not be retrieved. For instance, the Los Angeles Music Center Opera Web site contains several screens with information about artistic director Plácido Domingo. We submitted this site to some search engines with his name as one of the key words, which mean that those screens would be returned to someone who asked a seach engine about his name. However, engines which only searched the title of the page or the first page of the site would return the message that no matches were found.

To solve this problem, some of the newest search engines, such as AltaVista promise to search every page of every document on the Web, including threads from newsgroups.

While such a search is not entirely possible (some pages are protected by passwords, for example), searches may now yield an overwhelming number of references, sometimes including multiple references to the same site.

The user needs to learn more advanced searching techniques to avoid messages like, "128,000 references to the word association have been returned." Some search engines do return references ranked by how relevant they think the information is, but we've often found a low degree of accuracy in such rankings.

Tips on beginning your search are in the next lesson of the Internet Guide.