                      Copyright 1996, Hyperion Softword

                    *************************************
                    *      Orpheus 2, version 2.30      *
                    *************************************

                Comments and queries to:  Hyperion Softword,
              535 Irene-Couture, Sherbrooke, QC J1L 1Y8, Canada
                    tel/fax - 819-566-6296 (Rod Willmot)
                       email - willmot@interlinx.qc.ca


Contents of this file:  Purpose
                        Usage
                        Registration Requirements
                        Project Requirements
                        Memory and Disk Requirements
                        Index Order
                        Output File 1 - ALLWORDS.LOG
                        Making an Exclusion List
                        Making an Inclusion List
                        Output File 2 - projname.IDX
                        Reader Interface - Full-text Search
                        Foreign Language Support

Last revised:  11/17/96


PURPOSE:
========

    OHINDEX.EXE (The Orpheus Indexer) is a utility for indexing finished
    documents created with Orpheus.  The resulting IDX file enables full-text
    search in the Reader.  Note that the Indexer belongs to the feature set
    of Orpheus Professional; see "Registration Requirements" below.
    To make sure you have a valid IDX, you should re-index whenever you build
    a new finished copy of your project.


USAGE:
======

        OHINDEX [/options] projname[.prj]

    The command-line must include the name of an existing project.  For
    example, if the project is TEST (with a project file named TEST.PRJ),
    a minimal command-line would be "OHINDEX test".  The project must be
    in at least semi-finished state, with a valid NODELIST file in the
    project directory; see "Project Requirements" below.

    Command-line switches:

      /a       - allow high-ascii characters that may be accented letters
                 in a particular language; default is to treat them as spaces.
                 See "Foreign Language Support" elsewhere in this document,
                 and the following switch...
      /e[filespec]
               - use an exclusion list to specify words that you wish to
                 leave out of the index.  If no filespec is given the program
                 looks for EXCWORDS.TXT, first in the current directory, then
                 in the project directory, and finally (if still not found)
                 in the directory containing OHINDEX.EXE.  If you do give a
                 filespec the file may be named whatever you please; the
                 filespec can include a drive/path if different from the
                 current directory.  See "Making an Exclusion List" below.
       /h      - include hyphenated words; default is to treat a hyphen like
                 a space, e.g. indexing "helter-skelter" as two words,
                 "helter" and "skelter".  If you use the /h switch the index
                 treats "helter-skelter" as one word (unless the hyphen is
                 on a line-break).
      /i[filespec]
               - use an inclusion list to specify word *combinations* that
                 you wish to be indexed.  If no filespec is given the program
                 looks for INCWORDS.TXT, first in the current directory, then
                 in the project directory, and finally (if still not found)
                 in the directory containing OHINDEX.EXE.  If you do give a
                 filespec the file may be named whatever you please; the
                 filespec can include a drive/path if different from the
                 current directory.  See "Making an Inclusion List" below.
       /m[#]   - specify the minimum length of words to include; default
                 is 3 (leaving out "he", "it", "to"...).  For example,
                 /m4 sets a minimum word length of 4 characters.
       /n      - include numbers (and words beginning with numbers); default
                 is to leave them out.  For example, "2001" is normally
                 left out, as is "21st".  When number-words are included, the
                 characters ",-.:" (comma, hyphen, period, and colon) are
                 permitted inside them; for example, "1,1", "2-2", "3.3",
                 and "4:4" would all be indexed as words.
       /q      - swiftly creates the ALLWORDS.LOG (see below), but does not
                 store location data or build the final index (IDX file).
                 Uses half or less memory than a normal run; see
                 "Memory and Disk Requirements" below.
      /s[filespec]
               - strip accents from high-ascii characters according to a
                 conversion list; default is to leave them alone.  Use of this
                 switch automatically turns on the /a switch, but the reverse
                 is not true.  Foreign-language users MUST make use of this
                 feature to ensure correct sorting.  If no filespec is given
                 we use a default conversion list.  See "Foreign Language
                 Support" elsewhere in this document.
      /t[#]    - specify the maximum number of location records to store for
                 a single word; default is 6000.  The top limit for this
                 number is 12000 due to memory demands in the Reader when
                 Boolean search is provided.  See "Memory and Disk
                 Requirements" below.

    Switches must be given before the project name, separated by a space,
    and you can use "-" instead of "/".

    Examples:

        OHINDEX /q test
            Performs a "quick" index of the TEST project, but only so far as
            to make the ALLWORDS.LOG; it does not store location data or build
            the final index.  Uses half or less memory than a normal run.
            See "Memory and Disk Requirements" below.

        OHINDEX /m4 /n /s /a /e test
            Sets a minimum word length of 4 characters; includes numbers
            and hyphenated words (unless broken by a line-break); allows for
            words containing accented (high-ascii) letters and strips the
            accents according to the default conversion list; and uses the
            exclusion list EXCWORDS.TXT, which may be in the current directory,
            in the project directory, or in the directory containing the
            Indexer.  Indexes the TEST project.

        OHINDEX /emylist.doc test.prj
            Uses the exclusion list MYLIST.DOC in the current directory, and
            indexes the TEST project.  Uses the default settings of 3 for
            the minimum word length, leaves out numbers (and words beginning
            with numbers), and breaks hyphenated words into multiple words.


REGISTRATION REQUIREMENTS:
==========================

    OHINDEX.EXE is part of the Orpheus Professional feature set.  If you
    are registered at the standard level, you are welcome to evaluate the
    Indexer, but you may not distribute the IDX files with your finished works.
    (If the Reader opens an IDX file with a work that was not assembled by a
    user with an Orpheus Professional licence, it considers that work to have
    been created with a *shareware* copy of the software, and displays the
    "unregistered shareware" warning.)

    (If you do have an Orpheus Professional licence and the Indexer says
    you don't, make sure it can find your OHREG.KEY file.  It should be in
    the same directory as OHINDEX.EXE and your other Orpheus system files.)

    Please contact Hyperion Softword if you wish to upgrade to Orpheus
    Professional.


PROJECT REQUIREMENTS:
=====================

    Before you can index a project it must pass through the first two stages
    of project building:  compilation and link-verification.  (These are
    performed through OH.EXE's Build Project dialog, on the Project Menu.)
    To ensure that index data corresponds precisely to the contents of your
    finished work, the Indexer works with the compiled versions of your cards,
    which are stored in CMP files.  Just like the assembler (the final stage
    of project building), it uses the NODELIST file to select only those
    cards that belong in the finished work.  Only Text cards are indexed.

    Because the index gives the exact location of every instance of every
    word that is included in the index, you have to do your part to keep
    the index and HTX synchronized.  *Whenever you update the HTX you should
    regenerate the index immediately afterwards.*


MEMORY AND DISK REQUIREMENTS:
=============================

    The amount of memory required to index a project depends on how many
    unique words are included.  The amount of disk space required depends
    on the total word count being included.  During processing a large
    amount of temporary data is swapped to disk; since this can easily
    amount to the total uncompiled size of your project, you will need
    plenty of free disk space.

    An example:  at approximately half a megabyte of uncompiled text,
    the online Help for Orpheus contains over 55000 words of 3 or more
    characters.  Of that total the Indexer identifies some 3241 unique
    words.  On a normal run (no exclusions or other switches) it uses
    a maximum of about 128000 bytes of RAM.  On a "quick" run (with
    the /q switch), memory use falls by more than half, to under 61000
    bytes of RAM.

    The /q switch tells the Indexer not to store any data about the
    locations of words.  This frees up a substantial amount of memory
    and reduces disk use to a minimum.  If your project contains
    tens of thousands of unique words, the Indexer may not be able to
    get through to the end (on a normal run) without running out of memory.
    If this happens, try again with the /q switch.  This will at least
    generate an ALLWORDS.LOG which you can use to make an exclusion list;
    both of these are discussed below.  Providing the Indexer with a
    substantial exclusion list will free up a proportional amount of
    memory, and of course will slim down your final IDX file.

    You can reduce the size of the final output file by using the /t switch
    to set the maximum number of location records to store for a single word.
    The default for this variable is 6000.  It could be argued that any
    word occurring over a certain maximum is too common to demand a search
    under any circumstances.  What that maximum is depends on the project and
    your expectations; it could be as low as 500 or as high as 12000 -- the
    highest currently permitted.

        The top limit is 12000 due to memory demands in the Reader.  With
        searches on a single word the Reader uses the same very modest amount
        of memory no matter how many records there are.  However, with
        multi-word or Boolean searches (planned for future development),
        memory use is exactly proportional to the number of records.


INDEX ORDER:
============

    The index is alphabetically sorted.  However, if numbers are included,
    such as "21st" or "2001", they are given first (in ascending order).  If
    high-ascii characters are included and the /s switch isn't used, any words
    that begin with an accented letter are given last (after the last letter
    of the regular alphabet).  Since sorting is by ascii-value, words
    beginning with accented characters may not be in the order expected for a
    particular language.  Use of /s switch corrects this phenomenon (see
    "Foreign Language Support").


OUTPUT FILE 1 - ALLWORDS.LOG:
=============================

    The first product of the Indexer is a file named ALLWORDS.LOG, which
    is placed in the project directory of the project being indexed.  This
    is a plain text file containing a complete list, in alphabetical order, of
    the words included in the index.  Each word is given on a separate line,
    followed by a space and a number; the number is the word's "hit count" --
    how many times it was encountered.  You can view this file with any
    text editor or file viewer, or even load it into OH.EXE.

    The ALLWORDS.LOG is generated for your use, not the Indexer's.  You may
    delete it if you wish, but it does have an important purpose -- to
    enable you to generate an exclusion list, discussed below.  When you use
    the /q switch on the command line, the Indexer *only* makes ALLWORDS.LOG.


MAKING AN EXCLUSION LIST:
=========================

    While developing the Indexer I tested it on the online Help for Orpheus.
    Here is a sample of the ALLWORDS.LOG from the preliminary output:

                            abandoned 1
                            ability 16
                            able 13
                            abort 6
                            aborting 3
                            aborts 1
                            absence 3
                            absolutely 1

    Obviously, none of those words has anything to do with a significant
    topic in Help.  Storing them in the index, with their location records,
    would require some 300 bytes of dataspace; storing the word "would"
    (with some 122 location records) would require over 600 bytes of
    dataspace.  Multiply by thousands for popular words like "then" and
    "there", and watch your disk fill up with useless data.

    The solution is to make an exclusion list:  a text file in the same
    format as the ALLWORDS.LOG, listing all of the words that you do NOT
    want in your index.  Please note the following specifications:

        * The exclusion list must give one word per line; anything after
          the first word on a line will be ignored.  Therefore, you can
          copy lines directly from the ALLWORDS.LOG into your list without
          having to remove the numbers.

        * Lines beginning with ";" or "/" are ignored.

        * The exclusion list can use both uppercase and lowercase IF you are
          sorting it by hand OR are using a sort utility that can ignore case
          (see next).  Though the Indexer itself ignores case (and renders all
          indexed words in lowercase), you are safer to use all lowercase.

        * The exclusion list must be in strict alphabetical order, as in
          ALLWORDS.LOG.  Please note that using a sort utility on a mixed-case
          file may not have the results you intend, unless the utility can
          accept a switch to treat upper and lowercase letters as identical.
          If you are including numbers but wish to exclude specific numbers
          (or words beginning with numbers), give them first and in ascending
          order.  If you are including high-ascii characters but wish to
          exclude specific words beginning with high-ascii characters, give
          them last.

        * If you are using accented characters (e.g. in French or Dutch)
          and are using accent-stripping with the /s switch, the exclusion
          list MUST NOT contain any accents.  This is because accent-stripping
          is applied to the text being indexed *before* the test for exclusion.
          If you intend to build your exclusion list using ALLWORDS.LOG, do so
          with accent-stripping enabled right from the start, since the LOG
          will then contain no accents.  If you obtain a word list from some
          other source, you may need to use a word processor to convert the
          accented characters if any.

        * The Indexer will only use your exclusion list if you tell it to,
          with the /e switch on the command line.

        * The exclusion list may be as short as a single word, but cannot be
          longer than 65535 bytes.  I may increase capacity if there is
          demand for it.

    A simple way to create an exclusion list is to run the Indexer once on
    your project (use the /q switch to do this quickly), then load the
    resulting ALLWORDS.LOG in OH.EXE or any text editor, and do one of the
    following:

          - either copy selected lines into a separate EXCWORDS.TXT file...
          - or *delete* from the LOG whatever words you do want in the
            index, leaving behind all those that you don't.  Be sure to
            rename the resultant file EXCWORDS.TXT to prevent the Indexer
            from overwriting it later.

    The default filename for the exclusion list is EXCWORDS.TXT.  If you
    use the /e switch without a filespec, the Indexer looks for EXCWORDS.TXT
    in the current directory; if the file is not there, it looks in the
    project directory for the project being indexed; and if not there, it
    looks in the directory containing OHINDEX.EXE (in case that is different).
    If you do give a filespec the Indexer looks for exactly the file you
    specify.

    Once you have created an exclusion list you can continue to extend it
    for use with the same project or with other projects.  It doesn't matter
    if the list contains words that are not even used by a project.  What
    does matter is that the list be in strict sorted order as discussed above.
    If the list falls out of order or if you are uncertain how to sort
    accented characters, you can easily sort the file by using the DOS
    SORT command.  (See DOS help on SORT.EXE for details.)


MAKING AN INCLUSION LIST:
=========================

    For special purposes you may wish to include word combinations in the
    index.  This can be done with the aid of an inclusion list, using the
    /i switch on the command-line together with a file named INCWORDS.TXT
    (created by you with any text editor).  For example, in a legal work
    references to articles of the law may take the form "Article 2" or
    "Art. 2", and it would be useful to have the index list all occurrences
    of "Art." or "Article" *together with* those numbers.  Somewhat
    differently, a work on beverages might include references to drinks
    whose names consist of more than one word, such as "Harvey Wallbanger".

    An inclusion list to handle these examples would go like this:

        "art. "+
        "article "+
        "harvey wallbanger"

    The rules are obvious:  the main part of the combination must be
    enclosed in quotation marks.  In the case of "harvey wallbanger", that's
    all there is.  In the other two cases, adding a "+" plus sign tells
    the Indexer to add to the combination any additional characters UP TO
    THE END OF THE WORD.  Since "art. " and "article " each end with a
    space, this means that anything AFTER the space will be treated as
    belonging to the same word.  Thus "Art. 2" will be included, as will
    "Article Fifty-nine" and so on.

        NOTE:  In some cases you will need to add other switches to make
        sure that everything is included as desired.  Adding the /h switch
        ensures that hyphenated words are allowed.  Adding the /n switch
        ensures that numbers are allowed, and that a number-word may include
        such characters as "-,.:".  Thus, with the /n switch and the "art. "+
        entry in the inclusion list, "Art. 2.20.17" would be indexed as
        if it were a single word.

    Please note the following additional specifications:

        * The inclusion list must give one entry per line.  No leading
          spaces are allowed, and the entry must be enclosed in quotation
          marks, optionally followed by a "+" sign as discussed above.

        * Lines beginning with ";" or "/" are ignored.

        * The inclusion list can use both uppercase and lowercase, IF you are
          sorting it by hand OR are using a sort utility that can ignore
          case.  Though the Indexer itself ignores case (and renders all
          indexed words in lowercase), you are safer to use all lowercase.

        * The inclusion list must be in strict alphabetical order, as in
          ALLWORDS.LOG and the exclusion list, *except* in the case of
          multiple-word entries sharing the same initial word or words;
          this exception is discussed below.  Remember that a sort utility
          may not consider an uppercase letter to be the same as its lowercase
          equivalent, producing strange results.

        * If you are using accented characters (e.g. in French or Dutch)
          and are using accent-stripping with the /s switch, the inclusion
          list MUST NOT contain any accents.  This is because accent-stripping
          is applied to the text being indexed *before* the test for inclusion.

        * The Indexer will only use your inclusion list if you tell it to,
          with the /i switch on the command line.

        * The inclusion list may be as short as a single line, but may
          not be longer than 12288 bytes.

    The default filename for the inclusion list is INCWORDS.TXT.  If you
    use the /i switch without a filespec, the Indexer looks for INCWORDS.TXT
    in the current directory; if the file is not there, it looks in the
    project directory for the project being indexed; and if not there, it
    looks in the directory containing OHINDEX.EXE (in case that is different).
    If you do give a filespec the Indexer looks for exactly the file you
    specify.

    If you use multiple-word entries sharing the same initial word or words,
    please note the following exception to the rule of alphabetical order.
    The following example shows the standard alphabetical ordering of three
    entries; in this order, all but the first entry will be ignored:

        "notarized"
        "notarized writs"
        "notarized writs and deeds"

    In processing a card with these words and phrases, the Indexer would see
    "notarized" first in the inclusion list, and would not bother checking
    for a longer inclusion.  To ensure that all entries are noticed, you must
    reverse their order in the list:

        "notarized writs and deeds"
        "notarized writs"
        "notarized"

    The Indexer will not recognize multiple-word entries that, in the text,
    are interrupted by a line-break.


OUTPUT FILE 2 - projname.IDX:
=============================

    On conclusion of a normal run (without the /q switch), the Indexer
    generates the final index to your project.  This file has the same
    name as your project but with an extension of ".IDX".  The IDX file
    can then be used by the Reader, as discussed below.  (Please note
    that you may only distribute the IDX with the finished version of
    your work if you are registered at the level of Orpheus Professional.
    See "Registration Requirements" above.)


READER INTERFACE - FULL-TEXT SEARCH:
====================================

    The Reader interface is illustrated by the Search dialog in online
    Help.


FOREIGN LANGUAGE SUPPORT:
=========================

    The Indexer works by default in terms of the English language, which does
    not use accented (high-ascii) characters.  You can modify this behavior
    by using the /a or /s switch (or both) on the command line, as shown
    under "Command-line switches" in the "Usage" section at the top of this
    document.

    The /a switch turns on inclusion of high-ascii characters within words,
    on the assumption that they may be accented letters.  (Only characters
    that MIGHT be letters are so included; those that are linedraw characters
    or other symbols on one or more of the code-pages that I have examined
    are left out.)  Note that words containing high-ascii characters will not
    be correctly sorted if only the /a switch is used.  Specifically, all
    words beginning with a high-ascii character are grouped after the last
    regular letter of the alphabet, "z".

    The /s switch corrects this problem by stripping the accents prior to
    sorting.  (The /s switch automatically turns on the /a switch.)  Note
    that to the computer there is no connection whatever between a given
    low-ascii letter and its high-ascii accented version; with a different
    code-page enabled, the high-ascii character may not even be a letter.
    Therefore, conversion is performed according to a "strip list" consisting
    of pairs of related characters:  a high-ascii accented character followed
    by the regular letter of the alphabet to which it should be converted.
    Here is the default striplist:

        cueaaaaceeeiiiaaeeeooouuyouaiounn

    In other words, "" will convert to "c", "" will convert to "u", and
    so on.  You can provide your own striplist if you wish, as outlined
    below.  There are two reasons for doing this:  one would be because
    the default list does not include characters used in your language.
    Another would be because your language uses only a few such characters,
    and you will get much better performance with a shorter striplist.

    To make your own striplist, follow the example above, placing the entire
    list on the first line of a text file.  The Indexer considers the list
    to end at the first space or line-break if either of these occurs before
    the end of the file.

        Please note that while the first character in each pair can be
        whatever you like, the second must be a LOWERCASE letter of the
        alphabet in order for the conversion to have the desired effect.  Do
        NOT make a list like "CUEAAA", because it simply won't work.

    To tell the Indexer to use your striplist instead of the default, add
    the name of the file to the /s switch on the command line, without any
    intervening spaces, e.g.  "ohindex /sstripper.txt test" to index the
    TEST project while using STRIPPER.TXT for character conversion.

    When accent-stripping is enabled, the final IDX file contains a copy of
    the striplist used.  The same list is then applied to any input typed in
    by the user, so that "caon" will come out as "canon" and be correctly
    located in the index.

    NOTE: certain languages may use letters that the Indexer excludes even
    with the /a switch.  For users registered at the Orpheus Professional
    level, I will be happy to extend the Indexer's intelligence; all I need
    is a list of the desired ascii values or a photocopy of the code-page
    listing from your DOS manual.  (This offer applies to languages based
    on the Roman alphabet; I can't promise anything about Russian for
    example.)  Code-pages currently supported include:

                        437    English
                        850    Multilingual (Latin I)
                        852    Slavic (Latin II)
                        860    Portugal
                        863    Canadian-French
                        865    Nordic

