diff options
Diffstat (limited to 'debian/htdig/htdig-3.2.0b6/htdoc/htdig.html')
-rw-r--r-- | debian/htdig/htdig-3.2.0b6/htdoc/htdig.html | 256 |
1 files changed, 256 insertions, 0 deletions
diff --git a/debian/htdig/htdig-3.2.0b6/htdoc/htdig.html b/debian/htdig/htdig-3.2.0b6/htdoc/htdig.html new file mode 100644 index 00000000..0416c90b --- /dev/null +++ b/debian/htdig/htdig-3.2.0b6/htdoc/htdig.html @@ -0,0 +1,256 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> +<html> + <head> + <title> + ht://Dig: htdig + </title> + </head> + <body bgcolor="#eef7ff"> + <h1> + htdig + </h1> + <p> + ht://Dig Copyright © 1995-2004 <a href="THANKS.html">The ht://Dig Group</a><br> + Please see the file <a href="COPYING">COPYING</a> for + license information. + </p> + <hr size="4" noshade> + <dl> + <dd> + <h2> + Synopsis + </h2> + </dd> + <dd> + htdig [<em>options</em>] [<em>start_url_file</em>] + </dd> + </dl> + <dl> + <dd> + <h2> + Description + </h2> + </dd> + <dd> + Htdig retrieves HTML documents using the HTTP protocol and + gathers information from these documents which can later be + used to search these documents. This program can be + referred to as the search robot. + </dd> + </dl> + <dl> + <dd> + <h2> + Options + </h2> + </dd> + <dd> + <dl compact> + <dt> + -a + </dt> + <dd> + Use alternate work files. Tells htdig to append <em> + .work</em> to database files, causing a second copy of + the database to be built. This allows the original + files to be used by htsearch during the indexing run. When + used without the "-i" flag for an update dig, htdig will + use any existing .work files for the databases to update. + </dd> + <dt> + -c <em>configfile</em> + </dt> + <dd> + Use the specified <em>configfile</em> file instead of the + default. + </dd> + <dt> + -h <em>maxhops</em> + </dt> + <dd> + Restrict the dig to documents that are at most <em> + maxhops</em> links away from the starting document. + </dd> + <dt> + -i + </dt> + <dd> + Initial. Do not use any old databases. This is + accomplished by first erasing the databases. + </dd> + <dt> + -m <em>url_file</em> + </dt> + <dd> + Minimal. Index only the URLs listed in + <em>url_file</em> and no others. + A file name of "-" reads from STDIN. + See also the <em>start_url_file</em> argument. + </dd> + <dt> + -s + </dt> + <dd> + Print statistics about the dig after completion. + </dd> + <dt> + -t + </dt> + <dd> + Create an ASCII version of the document database. This + database is easy to parse with other programs so that + information can be extracted from it for purposes other + than searching. One could gather some interesting + statistics from this database. + <p>Each line in the file starts with the document id + followed by a list of + <strong>\t<em>fieldname</em>:<em>value</em></strong>. + The fields always appear in the order listed below: + </p> + <table border=0> + <tr> <th>fieldname</th><th>value</th></tr> + <tr> <td>u</td><td>URL</td></tr> + <tr> <td>t</td><td>Title</td></tr> + <tr> <td>a</td><td>State (0 = normal, 1 = not found, 2 + = not indexed, 3 = obsolete)</td></tr> + <tr> <td>m</td><td>Last modification time as reported + by the server</td></tr> + <tr> <td>s</td><td>Size in bytes</td></tr> + <tr> <td>H</td><td>Excerpt</td></tr> + <tr> <td>h</td><td>Meta description</td></tr> + <tr> <td>l</td><td>Time of last retrieval</td></tr> + <tr> <td>L</td><td>Count of the links in the document + (outgoing links)</td></tr> + <tr> <td>b</td><td>Count of the links to the document + (incoming links or backlinks)</td></tr> + <tr> <td>c</td><td>HopCount of this document</td></tr> + <tr> <td>g</td><td>Signature of the document used for + duplicate-detection</td></tr> + <tr> <td>e</td><td>E-mail address to use for a + notification message from htnotify</td></tr> + <tr> <td>n</td><td>Date to send out a notification + e-mail message</td></tr> + <tr> <td>S</td><td>Subject for a notification e-mail + message</td></tr> + <tr> <td>d</td><td>The text of links pointing to this + document. (e.g. <a + href="docURL">description</a>)</td></tr> + <tr> <td>A</td><td>Anchors in the document (i.e. <A + NAME=...)</td></tr> + </table> + </dd> + <dt> + -u <em>username:password</em> + </dt> + <dd> + Tells htdig to send the supplied username and password + with each HTTP request. The credentials will be encoded + using the 'Basic' authentication scheme. There <strong> + HAS</strong> to be a colon (:) between the username and + password. + </dd> + <dt> + -v + </dt> + <dd> + Verbose mode. This increases the verbosity of the + program. Using more than 2 is probably only useful for + debugging purposes. The default verbose mode (using + only one -v) gives a nice progress report while + digging. This progress report can be a bit + cryptic, so here is a brief explanation. A line + is shown for each URL, with 3 numbers before the + URL and some symbols after the URL. The first + number is the number of documents parsed so + far, the second is the DocID for this document, + and the third is the hop count of the document + (number of hops from one of the start_url + documents). After the URL, it shows a "*" for + a link in the document that it already visited, + a "+" for a new link it just queued, and a "-" + for a link it rejected for any of a number of + reasons. To find out what those reasons are, + you need to run htdig with at least 3 -v options, + i.e. -vvv. If there are no "*", "+" or "-" symbols + after the URL, it doesn't mean the document was + not parsed or was empty, but only that no links + to other documents were found within it. With + more verbose output, these symbols will get + interspersed in several lines of debugging output. + </dd> + <dt> + <em>start_url_file</em> + </dt> + <dd> + A file containing a list of URLs to start indexing + from, or "-" for STDIN. This will augment the default + <a href="attrs.html#start_url">start_url</a> + and override the file supplied to + [-m <em>url_file</em>]. + </dd> + </dl> + </dd> + </dl> + <dl> + <dd> + <h2> + Files + </h2> + </dd> + <dd> + <dl> + <dt> + <a href="attrs.html#config_dir">CONFIG_DIR</a>/htdig.conf + </dt> + <dd> + The default configuration file. + </dd> + </dl> + <dl> + <dt> + <a href="attrs.html#database_dir">DATABASE_DIR</a>/db.docdb + </dt> + <dd> + Stores data about each document (title, url, etc.). + </dd> + </dl> + <dl> + <dt> + <a href="attrs.html#database_dir">DATABASE_DIR</a>/db.words.db, + <a href="attrs.html#database_dir">DATABASE_DIR</a>/db.words.db_weakcmpr + </dt> + <dd> + Record which documents each word occurs in. + </dd> + </dl> + <dl> + <dt> + <a href="attrs.html#database_dir">DATABASE_DIR</a>/db.excerpts + </dt> + <dd> + Stores start of each document to show context of + matches. + </dd> + </dl> + </dd> + </dl> + <dl> + <dd> + <h2> + See Also + </h2> + </dd> + <dd> + <a href="htmerge.html">htmerge</a>, + <a href="htsearch.html" target="_top">htsearch</a>, + <a href="attrs.html">Configuration file format</a>, and + <a href="http://www.robotstxt.org/wc/norobots.html"> + A Standard for Robot Exclusion</a>. + </dd> + </dl> + <hr size="4" noshade> + + Last modified: $Date: 2004/06/12 13:39:13 $ + + </body> +</html> |