
Simlis v1.3 -- Similarity sort of input lines.

Usage:
  Simlis [-dpvar] [-sxx] [-cx] [-o...;...] [-lx..] [-wddd[,DDD]]
                                                   < infile [{>|>>} outfile]


Options: 

 -d   :displays similarity score (0..1000) as output line header.

 -sxx :skips assumed (fixed length xx) header data in every line.

 -p   :makes process item position sensitive:
      Simlis takes into account the relative item position within the lines.
      Assume 2 lines:
      "dog bites man"   and
      "man bites dog"
      only the -p option would see a difference (lower grade similarity).

 -cx  :everything before first character x (if present) within a line
      will be ignored.

 -o....;...;...  :strings given here are ignored in evaluation,
      max. 10 strings, max 14 chars. each.

 -lx.. :uses given character(s) x as line delimiter(s) generating 'phrases' 
      from it, ignoring physical separators CR LF on input and on 
      output. 

 -v   :operates case sensitive. Default is insensitive.
      For case insensitive procedure the Windows ANSI character set
      is used, even in DOS version.

 -wddd[,DDD] :range of byte values to process, given as decimals,
      possible maximum is -w2,255. Characters outside will be ignored.
      Default range is -w65,255 (65 is character 'A'). 
      Exception: linefeed control characters (-w10) will be omitted because
      of their special meaning on input.
      For full binary operation set -v too.
      Hint: setting values below 33 may reduce selectivity.

 -a   :uses an algorithm, based on 9-gram units empirically observed in
      text phrases, as an experimental function (to use together with 
      -l option).

 -r   :shows input line/phrase sequence number (position in input stream, 
       numbered 1,2,3...) on every sorted output line, enclosed in '#' chars.
       Using this, lines can be retrieved better within the original corpus.


Executable:
   simlis.exe:   Windows 9x/NT version
 
 
Notes:
   Simlis v1.3 is a command line application with filter functionality
   that sorts input lines depending on the similarity of the items contained
   (words, character/byte sequences, etc). The program works independently of
   item positions, element length and sequence. It accepts physical lines
   (CRLF separated) or 'phrases' (e.g. sentences, records) automatically
   generated based on given delimiters. It takes every line from standard
   input, evaluates the similarity in relation to all other lines und inserts
   it in the appropriate line position of standard output.  Simlis is well
   suited for sorting address lists, directories (identifying identical or
   similar entries not to achieve from "normal" sort), and even for cluster-,
   literature - and linguistic - analyses.  In any case output goes to
   stderr device (usually  display) too, no matter to where standard output
   is redirected.

   Simlis sacrifices speed to precision, not using any temporary work file:
   don't expect fastest execution imaginable. 


Restrictions: 
   Note maximum values for input lines and line length displayed 
   on startup screen.
   
Examples:

 Simlis -d <readme.txt
     readme.txt ist sorted and as header of each output line the
     internal calculated 'similarity score' (range 0...1000) is given.
     This score means degree of similarity to the respective previous 
     and following line.

 Simlis  -pr -w48 <readme.txt >readme.srt
     readme.txt ist sorted, including the contained numerics starting
     with 48 = digit '0'; output goes to file readme.srt.
     -p produces a more logical result (if input is structured accordingly).
     -r outputs the original line position (numbered 1,2,3..) from the
        input stream, before every sorted line, enclosed in # characters.
       
 Simlis -s33 -c: -ofree;freeware;win;windows <00_index.txt       
     If 00_index.txt is a SIMTEL formatted file list, the lines will be kind
     of keyword sorted, ignoring the (here irrelevant) leading 33 bytes 
     headers. The first colon ':' (if present) within each line  starts
     operation, everything before is ignored. 
     The words "free", "freeware", "win", "windows" are ignored too,
     assuming these word should not have special relevance for distinction.
     
 Simlis -l.;:!? <ebook.txt
     physical ebook.txt lines are converted to phrases according to
     the 5 stop characters given with -l option.
     All operations work on these phrases thereafter, even output.


For demonstrating the special language analysis capabilities:
     see example simex.txt, included here.

     
Status of the program and distribution:
   Simlis v1.3 is Freeware.
   It can be freely distributed in its unmodified form and 
   be included in any software collection such as CD-ROM's 
   but may NOT be sold.  


Installation:
   As you can read this you are ready to run it right here,  
   no need for further installation procedures.


History:
   vers. 1.3:
   New option -r
   
   vers. 1.2: 
   New options -l, -w, -a, -v
   16-bit DOS, 32-bit Windows version.
   
   vers. 1.1: for MS-DOS
   New options -c and -o,
   faster operation and other minor improvements.
   
   vers. 1.0: for MS-DOS
   First published version.


Compatibility:
   The same Simlis version should be available for SCO-UNIX, LINUX, HP-UX.

Comments, suggestions, requests for information to:
   joda@sdf.lonestar.org
File access:
   ftp://sdf.lonestar.org/pub/users/joda/simlis

Legal Stuff:
   Copyright 2000, Joachim Dathe ("The author")

THE SOFTWARE IS PROVIDED "AS-IS" AND WITHOUT WARRANTY OF ANY KIND, EXPRESSED,
IMPLIED OR OTHERWISE, INCLUDING AND WITHOUT LIMITATION, ANY WARRANTY OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.  IN NO EVENT SHALL THE
AUTHOR BE LIABLE FOR ANY SPECIAL, INCIDENTAL, INDIRECT OR CONSEQUENTIAL DAMAGES
WHATSOEVER (INCLUDING, WITHOUT LIMITATION, DAMAGE FOR LOSS OF PROFITS,
BUSINESS INTERRUPTION, LOSS OF INFORMATION, OR ANY OTHER LOSS) , WHETHER OR NOT
ADVISED OF THE POSSIBILITY OF DAMAGES, AND ON ANY THEORY OF LIABILITY, ARISING
OUT OF OR IN CONNECTION WITH THE USE OR INABILITY TO USE THIS SOFTWARE.



