Search Websites
ht://Dig on www.uga.edu


Background - Setup - Compiling New Versions - Contingency Plan

Background

The main search engine used on www.uga.edu is ht://Dig, a complete indexing and search system. Indexing is performed periodically (and automatically) under the vigilance of the University WWW Services Coordinator (UWC). User searches are performed via a fill-out form, the most significant feature of which is a search box which provides an input area for search words. The search box is a prominent feature of the UGA homepage. Details on the setup are included below.

A local copy of the source code for the current production version of ht://Dig is located here:

/usr/local/src/htdig
Documentation and related information on ht://Dig are available from the ht://Dig website:

http://www.htdig.org

When a new version of ht://Dig is downloaded for installation, the source should be retrieved from the ht://Dig website and placed in /usr/local/src/htdig.

Setup

Version ht://Dig 3.1.5
Hardware Specifics 384M RAM -- Disk Consideration
Production Directory /usr/www/ss/
Test Directory /usr/www/ssd/

/usr/www/ss/ is actually the home directory for a WWW account set up for the special purpose of providing a UGA-wide site search (hence the name ss for the account). This home directory contains the following sub-directories:

public_html

public_html is primarily a repository for images used on initial search pages and search results pages (more on the results pages in htdig/common).

home.html in this directory is nothing more than a pointer to the initial search page http://www.uga.edu/search.

The local directory contains UGA images used on the results pages.

bin

Contains the search program htsearch, a CGI program invoked from an HTML form. The Apache server configuration file /usr/local/apache//conf/httpd.conf includes the following ScriptAlias directive to enable this directory as a CGI directory:
ScriptAlias /ss-bin/ /usr/www/ss/bin/
A simple example of the HTML which invokes htsearch:
<form method="post" 
         action="/ss-bin/htsearch">
<font size=-1 
         face="helvetica,arial,universal">
Search UGA Websites:<br>
</font>
<font class="FORM" 
         size=-1 
         face="helvetica,arial,universal">
<input 
         type="text" 
         size=15 
         name="words" 
         value="">
</font>
<input 
         type="submit" 
         value="Go">
<input 
         type=hidden 
         name=method 
         value=and>
<input 
         type=hidden 
         name=config 
         value=htdig>
<input 
         type=hidden 
         name=restrict 
         value="">
<input 
         type=hidden 
         name=exclude value="">
</form>

htdig/bin

ht://Dig binaries and related programs for creating the UGA primary database are located here:

rundig is the single command which actually performs the indexing (rundigt is a test program). It is slightly modified from the distribution:
  • TMPDIR is set.
  • move of .work files commented out, requiring they be moved by hand.
  • a section has been added which checks for the existence of .work files and sends mail to the UWC in the event that these files do exist (a likely indication that files which should have been moved were not moved).
  • hop count can (and used to be but no longer is) be set here. Now set in htdig.conf using:
    max_hop_count: 5
    
    The hop count sets the number of "hops" followed from the UGA homepage for the pages to be included in the UGA primary database.
It is run once a week from root's crontab:
# Run ht://Dig
0 9 * * 5 /usr/www/ss/htdig/bin/rundig -a > /dev/null 2>&1
#
The -a creates the "alternate" .work files. Default configuration file htdig.conf is used (more on the configuration files in htdig/conf).

htdig/conf

ht://Dig configuration files for indexing and searching are located here:
 
htdig.conf is the default configuration file, used to create the UGA primary database.

htdig-u.conf is based on the default configuration file. It sets a different common_dir for general use (more on common_dir in htdig/common) in conjunction with the restrict variable in the HTML form which invokes htsearch. The UGA primary database is searched, but results will match the value of restrict. Example (changed lines emphasized):

<form method="post" 
         action="/ss-bin/htsearch">
<font 
         size=-1 
         face="helvetica,arial,universal">
Search UGA Websites:<br>
</font>
<font 
         class="FORM" 
         size=-1 
         face="helvetica,arial,universal">
<input 
         type="text" 
         size=15 
         name="words" 
         value="">
</font>
<input 
         type="submit" 
         value="Go">
<input 
         type=hidden 
         name=method 
         value=and>

<input type=hidden 
          name=config value=htdig-u>
<input type=hidden 
          name=restrict 
          value="http://www.coe.uga.edu/coenews/">

<input 
          type=hidden 
          name=exclude 
          value="">
</form>
Notice the use of the absolute URL for the value of restrict. This is required -- forcing restriction to a particular site but enabling a URL in the results pages by using the value of restrict (more on the results pages in htdig/common).

Additional documentation provides more complete instructions on the use of restrict.

The remaining configuration files not owned by root are maintained by departmental webmasters (with the assistance of the UWC). These webmasters are encouraged to use restrict if at all possible. The *t.conf configuration files are provided to the departmental webmasters for testing purposes.

htdig/common
htdig/common-u

Each directory contains a group of similar files. The .html files are used to customize the output on the search results pages. The files used in each directory are:
footer.html
header.html
nomatch.html
syntax.html
The other .html files are output templates which could be used in addition to, or in place of, the ones above. The ht://Dig website has more information on these other templates.

common is used to customize output for the search results pages used on the main UGA pages (e.g., a search performed from the UGA homepage). common-u is used in conjunction with the restrict variable, as described in htdig/conf (as in this example: WWW Pages for Departments, Organizations, and Units). The graphic used for the results pages is different and the .html files in common-u include a URL to the restrict value (absolute URL required):

<a href="$(RESTRICT)">$(RESTRICT)</a>
The non-HTML files (most of which end in .db), are endings and synonym databases. These files are semi-static and are not rebuilt each time a new index is created. It is important to note, however, that these new .db files will only be created in common and are linked to common-u.

htdig/db

The primary database files for the University of Georgia are located here:

Subsequent to each rundig, a set of alternate work files with .work appended to the name of the file is created. The UWC is responsible for moving the .work files to the production equivalent (removing the .work extension).

It can take as long as 36 to 48 hours for rundigt to complete. The UWC should use this time estimate, the relative size of the .work files, and the system top command (to insure that neither htdig or htmerge are running).

Total disk space required for the indexes is approximately 4 times the amount of the set of production database files -- the set itself, the .work files, the temporary files used for sorting and merging (as defined in TMPDIR in rundig), and a backup set of the production database files.

Compiling, Testing, and Installing New Versions of ht://Dig

Documentation included with the distribution of ht://Dig and documentation available at http://www.htdig.org provide a valuable resource of information regarding current production and new versions of ht://Dig. This documentation, coupled with attention to the procedures presented below, should effect a successful compile, test, and install of new versions of ht://Dig.

A mirror account and CGI directory are available for testing new versions of ht://Dig. The account is ssd with a home directory of:

/usr/www/ssd
The CGI directory is bin within this directory and the Apache server configuration file /usr/local/apache//conf/httpd.conf includes the following ScriptAlias directive to enable this directory as a CGI directory:
ScriptAlias /ssd-bin/ /usr/www/ssd/bin/
Be sure to remove the public_html, htdig, and bin directories before installing a new version of ht://Dig in /usr/www/ssd. This will insure a new initial set of binaries and related filies.

Test Compile and Install (SSD)

Acquire the new version of ht://Dig from http://www.htdig.org and place the compressed version in:
/usr/local/src/htdig
Ungzip and untar the file here; cd to the newly created directory; read README for general installation instructions. (Also see these local instructions for any special considerations, should they exist.) Follow the pointer in README to the installation document.

Run:

configure
as described in the installation document.

When configure has completed, edit CONFIG. There are several values that need to be changed in CONFIG and should be changed only after reviewing the current production version's CONFIG file. Change the new version's CONFIG file with respect to the mirror account -- replacing ss with ssd throughout the file.

After editing CONFIG, run

make
It may take a little while, but all the binaries should build without incident.

After make, run

make install
to install the ht://Dig components as specified in CONFIG.

Testing (SSD)

After the new version of ht://Dig has been installed, the UWC should work within the ssd directory to insure that a new set of searchable databases can be built.

HTML forms located in:

/usr/local/apache//htdocs/search/test
and accessed via the URL:
http://www.uga.edu/search/test
can be used to test:
  • the complete UGA primary database (configuration file htdig)
  • a restricted portion of the UGA primary database (configuration file htdig-ut and an arbitrary value for restrict)
  • a departmental database (configuration file ucnst, but others are equally good candidates)
Any important files and directories should be copied from ss to ssd. These include:
  • public_html/local (directory contains UGA images used on the results pages)
  • htdig/common and htdig/common-u (after backing up the new version's htdig/common)
  • htdig/conf (being mindful of permissions and backing up the new version's equivalents)
  • htdig/bin/rundig and htdig/bin/rundigt (after backing up the new version's rundig)
  • htdig/bin/msg (file containing mail message sent to UWC when .work files are found)
New version testing consists of building a new database and testing it with the test forms. First, build a small database and test it with the test forms. Use rundigt and test.conf in /usr/www/ssd/htdig/conf. Make sure that each of these files reference:
ssd
and NOT:
ss
This is particularly important with respect to where the database files are to be written, since it would be possible to overwrite the production database files. The test database files should be written to:
/usr/www/ssd/htdig/db
and rundigt should be run from:
/usr/www/ssd/htdig/bin
initially as:
./rundigt -c ../conf/test.conf &
and subsequently as:
./rundigt -a -c ../conf/test.conf &
Keep in mind that the -a creates an "alternate" set of work files named *.work. These files must be moved to names with .work removed.

If testing indicates that the HTML which invokes htsearch requires modification, the UWC should be aware that this HTML may be incorporated into other main UGA pages for which the UWC is responsible (in addition to the homepage). These pages can be located by using the file system search tool glimpse and searching for the word htsearch.

After a successful test, and if it is determined that significant changes will result with an upgrade to the new version, inform the departmental webmasters with their own configuration files of new version testing. Offer assistance if anyone wishes to participate. Also mail the UGA Webmasters discussion list (ugawww@listserv.uga.edu) to announce testing and solicit comments and participation from this group. Subsequent mailings may also be required during installation if significant changes result from the upgrade. The level of required communication is at the discretion of the UWC and should be proportional to the assessed impact of upgrade to the new version.

Build a complete UGA primary database and test it with the test forms. If successful, be sure to keep the database files. The database files will be used as the production UGA primary database when the new version of ht://Dig is installed as the production version.

Production Compile and Install

After a UGA primary database built with the new version of ht://Dig has been successfully tested, it can be used to effect a smooth transition from the current production version to the new version.

Begin the transition by informing the UGA Webmasters discussion list (ugawww@listserv.uga.edu) of when the transition will take place. Done properly there should be no interruption of service to searches performed from the UGA homepage and all subsidiary searches, including those using restrict.

Service disruptions are possible for departmental webmasters with their own configuration files if existing databases are often not compatible with new versions of ht://Dig. However, since these are typically relatively small databases, they can be rebuilt in a matter of minutes. The service interruption impact can be mitigated by short-term use of restrict.

Proceed as follows:

  1. Copy rundigt in ssd/bin to rundig. Copy test.conf in ssd/conf to htdig.conf. Sychronizing the test file names with their production equivalents will facilitate the transition to the new version.

  2. Edit /usr/local/apache//conf/httpd.conf and change:

    ScriptAlias /ss-bin/ /usr/www/ss/bin/ ScriptAlias /ssd-bin/ /usr/www/ssd/bin/

    to

    ScriptAlias /ss-bin/ /usr/www/ssd/bin/
    ScriptAlias /ssd-bin/ /usr/www/ss/bin/
    
    and restart httpd. Swapping the two actual locations (second argument) enables ssd as the actively used version and creates a temporary ScriptAlias for the new version's permanent location (ss). Perform a few searches from the UGA homepage, which should be successful. It is quite possible that a few images may be broken images (due to hardcoded paths to images in ss/images). When the new version is installed in ss, the images will re-appear.

  3. Back up /usr/www/ss using tar. Something like this from /usr/www:
    tar -cvf /tmp/ss.tar ss
    
    Move ss.tar to a permanent location at the completion of the new ht://Dig install.

  4. Remove ss:
    userdel -r ss
    
    and recreate the account as any new account would be created. (with the UWC as the website administrator). This insures a clean install location.

  5. Edit the new version's CONFIG file with respect to the production account -- replacing ssd with ss throughout the file.

    After editing CONFIG, run

    make
    
    It may take a little while, but all the binaries should build without incident.

    After make, run:

    make install
    
    to install the ht://Dig components as specified in CONFIG.

  6. Copy these files and directories from ssd to ss:

    • public_html/local (directory contains UGA images used on the results pages)
    • htdig/common and htdig/common-u
    • htdig/conf (being mindful of permissions)
    • htdig/bin/rundig and htdig/bin/rundigt
    • htdig/bin/msg (file containing mail message sent to UWC when .work files are found)
    • all the database files in htdig/db

    Check all files to insure that changes necessitated by the new version are applied. Pay particular attention to rundig and htdig.conf. Make sure that rundig and htdig.conf reference:

    ss
    
    and NOT:
    ssd
    

    There is no need to create backup files in ss because ssd serves as a backup area to ss after the above files are copied. System backups are also available for all files located in ssd and ss.

  7. Use the Test forms to test the new version in its ss location.

    Do not proceed with the next step unless tests are successful.

  8. Rename the ScriptAlias directories to reflect the normal production/test status:
    ScriptAlias /ss-bin/ /usr/www/ss/bin/
    ScriptAlias /ssd-bin/ /usr/www/ssd/bin/
    
    and restart httpd.

  9. Inform ugawww@listserv.uga.edu and the departmental webmasters with their own configuration files that installation of the new version is complete. If requested, be available to provide assistance to the departmental webmasters.

Contingency Plan

After a functioning UGA primary database has been created, there are very few circumstances that can cause searches to fail. When searches do fail, it can most likely be attributed to corrupted database files. Though rare if the UWC will exercise reasonable caution, a corrupted database can result in hours of downtime to the search service.

As mentioned earlier, rundig creates temporary work files. Before these files are moved to their production equivalents, back up the current production files and remove old backups. (Always keep one generation backed up).

As a last resort in the event that the newest iteration of database files as well as the backups fail, modify the hop count in rundig from:

-h 6
to
-h 3

and then run the program "by hand":

rundig 
This will create a small, but functional database in a relatively short period of time. When the small database is created, reset the hop count to 6 and run rundig -a. As documented earlier, the -a option creates an alternate set of .work files which must be moved to the production equivalent (removing the .work extension).