FindinSite-MS: Search engine for an ASP.NET website   .
  search
Powered by FindinSite-MS
. Home | Installation | Indexing | Control Panel | Web services | Advanced | Purchasing .
. .
  Indexing / Advanced | File types | Charset support | PDF support

 

findinsite-ms indexing advanced options


findinsite-ms has an indexing facility, controlled in the Control Panel.

The Create new indexing run wizard makes it easy to index a web site to build a search database. However, you can also specify many advanced options to control the indexing process.

Enter any advanced options at stage 4 of the indexing wizard, when prompted to Enter any advanced options. In the box below, type in any settings, one per line, with each line having a name=value. For example, to enable indexing of text files with file extensions .txt and .bat, enter this:

ParseTXT=true
TXT_Files=*.txt,*.bat
If you want to remove an option, then simply delete the relevant line from the Advanced options box.


Advanced option list

Name Description Default
Description The search database description Taken from the first page title found
ScanType Indicates how findinsite-ms finds files to index:
dir Scan all files in ScanDirectory to a depth of ScanDirLevels
file Scan by following links from ScanPathname
url Scan by following links from ScanURL
url
ScanDirectory The directory used to find files if ScanType is dir  
ScanDirLevels The number of directory levels to scan if ScanType is dir.  Use a number in the range 0 to 255, or all. all
ScanPathname The initial file scanned if ScanType is file  
ScanURL The initial URL scanned if ScanType is url Set in wizard
ParseHTML Specify true if you want to scan HTML web pages, or false if not. true
HTML_Files The file specification for HTML files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.htm,*.html,*.asp,*.aspx
ParseTXT Specify true if you want to scan TXT text files, or false if not. false
TXT_Files The file specification for TXT files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.txt
ParsePDF Specify true if you want to scan PDF text files, or false if not. false
PDF_Files The file specification for PDF files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.pdf
PDF_Passwords Specify a comma-separated list of passwords to open PDF files.  
PDF_ReportCharacterDecodeProblems Specify true if you want to have any PDF character decode problems listed, or false if not. false
ParseDOC Specify true if you want to scan DOC Word document files, or false if not. false
DOC_Files The file specification for DOC files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.doc, *.docx, *.docm
ParseXLS Specify true if you want to scan XLS Excel spreadsheet files, or false if not. false
XLS_Files The file specification for XLS files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.xls, *.xlsx, *.xlsm
ParsePPT Specify true if you want to scan PPT PowerPoint presentation files, or false if not. false
PPT_Files The file specification for PPT files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.ppt, *.pptx, *.pptm
ParsePUB Specify true if you want to scan PUB Publisher files, or false if not. false
PUB_Files The file specification for PUB files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.pub
ParseImage Specify true if you want to scan JPEG images for meta-data, or false if not. false
Image_Files The file specification for JPEG files, using * and ? wildcards as needed.  Separate individual specifiers with a comma. *.jpg,*.jpeg,*.tif,*.tiff
CaseSignificant If finding files by following links, then the case of filenames is ignored if false.  If true then findinsite-ms views test.htm and Test.htm as separate files.
Windows always seems to ignore filename letter cases.  In Unix, filename case must be correct.
Windows: false
StoreStopWords If false, findinsite-ms does not include words specified in StopWordFile. true
StopWordFile The pathname of the file containing stop words, with one word per line in UTF-8 format.  
NoTitleIgnorePageLinks If finding files by following links and this property is set to true, then links are not followed if a page has no title. false
ParseUpHierarchy If finding files by following links and this property is set to true, then links are followed to directories above the initial file. false
StorePositions If true then findinsite-ms stores word positions so that "adjacent word" searches will work. true
StoreLoneWords If true then findinsite-ms stores a word's position even if the two surrounding words are stop words. true
UseNoBaseURLs Determines whether to include a Base URL prefix for each page in the search database false
UseMetaDescriptionAsAbstract If true then the page abstract will be taken from the page META description tag. true
UseMetaAbstractAsAbstract If true then the page abstract will be taken from the (new) page META abstract tag. true
AbstractWords If building the abstract from the words in a file, this property indicates the number of words to use. 0
Include A list of file specifications to read and include in the search database. See below All files will be included
Exclude A list of file specifications to read but exclude from the search database. See below No files will be excluded
HardExclude A list of file specifications to exclude from the search database. See below No files will be hard excluded
UserAgent The name of the UserAgent to use when indexing FindInSiteBot/version
ObeyRobots Whether to read indexing instructions from ROBOTS.TXT true
Credentials A list of username/password credentials. See below No usernames/passwords
MaxURLLength The maximum URL length. Set to 0 for no limit. 1024
FieldsToExclude A comma-separated list of fields to ignore (case-insensitive) No fields ignored
rule1, rule2, etc Optional indexing rules, see below No indexing rules

Include, Exclude and HardExclude files

The Include, Exclude and HardExclude properties provide an optional list of file-specs to determine the files to include or exclude in the search database. Those that match HardExclude are not read at all - this is equivalent to being listed in the site ROBOTS.TXT file. Otherwise, note that all files are still read - however only those matching Include and not matching Exclude are included in the search database.

The initial list of acceptable files is determined by the HTML_Files, TXT_Files, etc. Then:

  • If a HardExclude file-spec set is given, then any files meeting one of the given file-specs will not be read or indexed.
  • If an Include file-spec set is given, then only files meeting one of the given file-specs will be indexed.
  • If an Exclude file-spec set is given, then any files meeting one of the given file-specs will not be indexed.
  • Note that the Includes are processed first and the Excludes afterwards, so an Exclude file-spec takes precedence.

    An individual file-spec can include zero or more * or ? wildcard characters, where ? matches exactly one character, and * matches zero or more characters. For example file???.ht* would match:
        file001.htm, file101.html and file111.ht
    but not
        file1001.htm

    A list of file-specs can be given directly in the property, or indirectly in a file.

    Direct file-specs

    Direct file-specs are semi-colon separated, eg:
    Include=iso*;*12*
    Exclude=file???.ht*

    This specifies two Include file-specs and one Exclude file-spec.

    Indirect file-specs in a file

    An indirect value consists of @ followed by a file name, where file-specs are specified one per line in plain text. The above direct example may be expressed indirectly as follows:
    [email protected]
    [email protected]

    where includes.txt contains:
    iso*
    *12*

    and excludes.txt contains:
    file???.ht*

    If an indirect file cannot be opened, an error message is reported.

    Username/password credentials

    If the web site being indexed requires one or more usernames/passwords, then pass this information in the Credentials property. findinsite-ms indexing supports "basic", "digest" and "NTLM" (Integrated Windows Authentication) authentication.

    The Credentials property must consist of a semi-colon separated list of credentials. Each credential contains comma-separated fields: a username, a password and an optional path. Spaces are trimmed at the ends of all fields. To use a blank password, specify a period (.) in that field.

    For example, for a single username (uname) and password (pwd), use this:

    Credentials=uname,pwd

    Only one credential can be supplied for each path on the web site. Therefore, if you are using more than one credential, then the paths must be different. Suppose you are indexing www.example.org. If username/password uname1/pwd1 is required for directory www.example.org/manager/ and uname2/pwd2 is required for all other directories, then use this:

    Credentials=uname1,pwd1,manager/ ; uname2,pwd2

    Indexing rules

    Indexing rules provide a limited means of altering aspects of the indexing process. You can specify several rules, each named rule1, rule2, etc.

    Each rule must have one or more conditions, and must have one action. For example, this rule has two conditions (that the file being indexed is a PDF, and its URL starts with Default) and one action (store its referer as the file URL):

    rule1=C:type==PDF;C:url==Default;A:url=referer;
    • Each rule consists of several elements, each separated by a semi-colon (;)
    • Condition elements start with C:
    • The Action element starts with A:
    • If all the conditions are true, then the action is performed

    Conditions that start with C:type== check that the file is a certain type, from this list: html pdf txt doc xls ppt image pub.

    Conditions that start with C:url== check that the file starts with the subsequent characters. Note that the "base URL" should not be included here, ie if the indexing run started at http://www.example.com/subdir/ and you want to check for files that start http://www.example.com/subdir/another/ then use condition C:url==another/

    The Action A:url=referer sets the URL for this page to its referer page, if it exists. In practice this restricts this rule to standard URL indexing runs. Note that as a consequence, the referer URL will appear twice in the search database.


    Example

    	Description=My web site
    	ScanType=url
    	ScanURL=http://www.mycompany.com/
    	ParseHTML=true
    	HTML_Files=*.htm,*.html,*.asp
    	ParseTXT=false
    	ParsePDF=true
    	PDF_Files=*.pdf
    	CaseSignificant=false
    	StoreStopWords=true
    	StopWordFile=
    	NoTitleIgnorePageLinks=true
    	ParseUpHierarchy=false
    	StorePositions=true
    	StoreLoneWords=true
    	UseMetaDescriptionAsAbstract=true
    	UseMetaAbstractAsAbstract=true
    	AbstractWords=0
    	Include=
    	Exclude=

    Tune up and go faster
  •   All site Copyright © 1996-2011 PHD Computer Consultants Ltd, PHDCC   Privacy  

    Last modified: 27 October 2008.