


[ TOC ]
This page describes the process of searching with Swish-e. Please see the SWISH-CONFIG page for information the Swish-e configuration file directives, and SWISH-RUN for a complete list of command line arguments.
Searching a Swish-e index involves passing command line arguments to it that specify the index file to use, and the query (or search words) to locate in the index. Swish-e returns a list of file names (or URLs) that contain the matched search words. Perl is often used as a front-end to Swish-e such as in CGI applications, and perl modules exist to for interfacing with Swish-e.
[ TOC ]
The -w command line argument (switch) is used specify the search query to Swish-e.
When running Swish-e from a shell prompt, be careful to protect your query
from shell metacharacters and shell expansions. This often means placing
single or double quotes around your query. See Searching with Perl if you plan to use Perl as a front end to Swish-e.
The following section describes various aspects of searching with Swish-e.
[ TOC ]
You can use the Boolean operators and, or, or not in searching. Without these Boolean operators, Swish-e will assume you're anding the words together. The operators are not case sensitive.
[Note: you can change the default to oring by changing the variable DEFAULT_RULE in the config.h file and recompiling Swish-e.]
Evaluation takes place from left to right only, although you can use parentheses to force the order of evaluation.
Examples:
swish-e -w "smilla or snow" -f myIndex |
Retrieves files containing either the words ``smilla'' or ``snow''.
swish-e -w "smilla and snow not sense" -f myIndex
swish-e -w "(smilla and snow) and not sense" -f myIndex (same thing)
|
retrieves first the files that contain both the words ``smilla'' and ``snow''; then among those the ones that do not contain the word ``sense''.
[ TOC ]
The wildcard (*) is available, however it can only be used at the end of a word: otherwise is is considerd a normal character (i.e. can be searched for if included in the WordCharacters directive).
swish-e -w "librarian" -f myIndex |
this query only retrieves files which contain the given word.
On the other hand:
swish-e -w "librarian*" -f myIndex |
retrieves ``librarians'', ``librarianship'', etc. along with ``librarian''.
Note that wildcard searches combined with word stemming can lead to
unexpected results. If stemming is enabled, a search term with a wildcard
will be stemmed internally before searching. So searching for
running* will actually be a search for run*, so running* would find runway. Also, searching for runn* will not find running
as you might expect, since running stems to run in the index, and thus runn* will not find run.
[ TOC ]
Expressions are always evaluated left to right:
swish -w "juliet not ophelia and pac" -f myIndex |
retrieves files which contain ``juliet'' and ``pac'' but not ``ophelia''
However it is always possible to force the order of evaluation by using parenthesis. For example:
swish-e -w "juliet not (ophelia and pac)" -f myIndex |
retrieves files with ``juliet'' and containing neither ``ophelia'' nor ``pac''.
[ TOC ]
MetaNames are used to represent fields (called columns in a database) and provide a way to search in only parts of a document. See SWISH-CONFIG for a description of MetaNames, and how they are specified in the source document.
To limit a search to words found in a meta tag you prefix the keywords with the name of the meta tag, followed by the equal sign:
metaname = word
metaname = (this or that)
metatname = ( (this or that) or "this phrase" )
|
It is not necessary to have spaces at either side of the ``='', consequently the following are equivalent:
swish-e -w "metaName=word"
swish-e -w "metaName = word"
swish-e -w "metaName= word"
|
To search on a word that contains a ``='', precede the ``='' with a ``\'' (backslash).
swish-e -w "test\=3 = x\=4 or y\=5" -f <index.file> |
this query returns the files where the word ``x=4'' is associated with the metaName ``test=3'' or that contains the word ``y=5'' not associated with any metaName.
Queries can be also constructed using any of the usual search features, moreover metaName and plain search can be mixed in a single query.
swish-e -w "metaName1 = (a1 or a4) not (a3 and a7)" -f index.swish-e |
This query will retrieve all the files in which ``a1'' or ``a2'' are found in the META tag ``metaName1'' and that do not contain the words ``a3'' and ``a7'', where ``a3'' and ``a7'' are not associated to any meta name.
[ TOC ]
To search for a phrase in a document use double-quotes to delimit your search terms. (The phrase delimiter is set in src/swish.h.)
You must protect the quotes from the shell.
For example, under Unix:
swish-e -w '"this is a pharase" or (this and that)'
swish-e -w 'meta1=("this is a pharase") or (this and that)'
|
Or under Windows:
swish-e -w \"this is a pharase\" or (this and that) |
You can use the -P switch to set the phrase delimiter character. See SWISH-RUN for examples.
=head2 Context |
At times you might not want to search for a word in every part of your
files since you know that the word(s) are present in a
particular tag. The ability to seach according to context greatly increases
the chances that your hits will be relevant, and Swish-e provides a
mechanism to do just that.
The -t option in the search command line allows you to search for words that exist only in specific HTML tags. Each character in the string you specify in the argument to this option represents a different tag in which the word is searched; that is you can use any combinations of the following characters:
H means all<HEAD> tags
B stands for <BODY> tags
t is all <TITLE> tags
h is <H1> to <H6> (header) tags
e is emphasized tags (this may be <B>, <I>, <EM>, or <STRONG>)
c is HTML comment tags (<!-- ... -->)
|
# This search will look for files with these two words in their titles only.
swish-e -w "apples oranges" -t t -f myIndex
|
# This search will look for files with these words in comments only.
swish-e -w "keywords draft release" -t c -f myIndex
|
This search will look for words in titles, headers, and emphasized tags.
swish-e -w "world wide web" -t the -f myIndex
|
[ TOC ]
Perl ( http://www.perl.com/ ) is probably the most common programming language used with Swish-e, especially in CGI interfaces. Perl makes searching and parsing results with Swish-e easy, but if not done properly can leave your server vulnerable to attacks.
When designing your CGI scripts you should carefully screen user input, and include features such as paged results and a timer to limit time required for a search to complete. These are to protect your web site against a denial of service (DoS) attack.
Included with every distribution of Perl is a document called perlsec -- Perl Security. Please take time to read and understand that document before writing CGI scripts in perl.
Type at your shell/command prompt:
perldoc perlsec |
If nothing else, start every CGI program in perl as such:
#!/usr/local/bin/perl -wT
use strict;
|
That alone won't make your script secure, but may help you find insecure code.
[ TOC ]
There are many examples of CGI scripts on the Internet. Many are poorly
written and insecure. A commonly seen way to execute Swish-e from a perl
CGI script is with a piped open. For example, it is common to see this type of open():
open(SWISH, "$swish -w $query -f $index|"); |
This open() gives shell access to the entire Internet! Often an attempt is made to
strip $query of bad characters. But, this often fails since it's hard to guess what every bad character is. Would you have thought about a null? A better approach is to
only allow
in known safe characters.
Even if you can be sure that any user supplied data is safe, this piped open still passes the command parameters through the shell. If nothing else, it's just an extra unnecessary step to running Swish-e.
Therefore, the recommended approach is to fork and exec swish-e directly without passing through the shell. This process is described in
the perl man page perlipc under the appropriate heading Safe Pipe Opens.
Type:
perldoc perlipc |
If all this sounds complicated you may wish to use a Perl module that does all the hard work for you.
[ TOC ]
The Swish-e distribution includes a Perl module called SWISH::API. SWISH::API provides access to the Swish-e C Library.
The SWISH::API module is not installed by default.
The SWISH::API module will embed Swish-e into your perl program so that searching does not require running an external program. Embedding the Swish-e program into your perl program results in faster Swish-e searches, especially when running under a persistent environment like mod_perl since it avoids the cost of opening the index file for every request (mod_perl is much also much faster than CGI because it avoids the need to compile Perl code for every request).
See the README file in the perl directory of the Swish-e distribution for installation instructions. Documentation for the SWISH::API module is available at http://swish-e.org and is installed along with other HTML documentation on your computer.
[ TOC ]
$Id: SWISH-SEARCH.pod,v 1.5 2003/05/15 05:38:30 whmoseley Exp $
. [ TOC ]
![]()