The
FullText - a general description
General info
-
The fulltext module was developed as a natural extension to language processing
modules. Its purpose is to:
-
index various documents and similar data (dictionaries... ),
-
serve other linguistic modules for indexing of text files and producing
word lists.
-
The module is programmed in plain 'C', the technology is proprietary, developed
just for this module. It consists of two parts:
-
indexing module - it reads input data and manages index files. It does
work with language data.
-
search module uses language data (morphology, translation and hierarchy
modules). Currently it works with Czech, English and Slovak languages,
soon expected extensions are German, French, Swedish, Russian and others.
General features
-
Currently the system works in two modes:
-
Document mode - the input is either files or texts in ASCII, HTML or RTF
format - the options are described in FT_ff.htm
-
Dictionary mode - the input is XML files with specified dictionary structure
- described in FT_Dic.htm
Planned mode:
-
Relational database mode - similar to Document mode, new command for multidocument
indexing
-
Maximal word size is 255 bytes; on longer word the file is refused (otherwise
the Fulltext would index arbitrary, even binary files).
-
Internal character coding is non-standard,
UNICODE compatible DB code (it is similar to UTF-8 coding). Its advantages
are improved readibility and suitability for natural language data processing.
-
The system can work in safe mode, which means that each request is performed
as a transaction with roll-back+new-attempt on transaction failure.
-
During search the system uses morphology and also searches for translations
to other languages - these options are selectable from command; the range
of available translations is given by Database content.
-
Planned option: enhanced
search
-
The system is not able to search for exact expressions containing separators
(i.e. contaning more than one word). This type of search can be realized
by proximity operator.
-
Maximal file size is limited by available memory.
-
Maximal number of collections: unlimited.
-
Maximal number of files per collection: 100000.
-
Maximal file size: 2 GB.
Query format
Generally the indexing is word oriented, i.e.
the index does not store expressions containing separators. Such an expression,
like "British Commonwealth" is indexed as two words and can be found as
"British" && "Commonwealth", maybe with proximity option set.
There are two formats available:
-
"Simple"
- it is older and may be used by older applications:
-
supports simultaneously four sets of operators:
-
~, |, &
- simplified operators
-
~, ||, &&
- 'C' like operators
-
NOT, OR, AND - English language operators
-
NE, NEBO, A - Czech language operators
-
double quotes (") can be used around words in order to distinguish words
from operators
-
all operators (except of the leading "~ ") must be surrounded by spaces
-
parentheses are not accepted (in this version)
-
evaluation is sequential (without priorities)
-
examples:
-
"word_1 & word_2 & word_3" means:
word_1 AND word_2 AND word_3
-
"~ word_1 & word_2"
means: NOT word_1 AND word_2
-
"word_1 & ~ word_2"
means: word_1 AND NOT word_2
-
"word_1 & word_2 | word_3" means:
word_1 AND word_2 OR word_3
-
"~ word_1 | word_2"
invalid: NOT must be combined with AND
-
"Full"
- it is newer and recommended for use:
-
Basic operators:
-
! - negation
(~ is
not used as it can appear as word character more easily than !)
-
&& -
and
-
|| - or
-
Except of !
operator, all operators are separated by a space from other text
-
Operator priorities are regular - highest for !,
then &&
and lowest for ||
-
Currently the operator
@@ is available in dictionary mode only.
Its meaning is the same as for &&
with additional requirement of "to be near". This operator connects two
words - it cannot connect word to expression or two expressions. The maximal
distance for @@
is either fixed to 64 Bytes or its evaluation mode is given by query
parameters - then the distance value can be specified by user and also
can be interpreted as words - see dictionary mode.
-
The query can use (
and
)
parentheses
-
Wild card characters ?,
*
are allowed with the following limitations:
-
must be allowed explicitly by query parameter
-
the query must be single word
-
* is allowed
as the last character only
-
All words should be sorrounded by plain double quotes
even if the query is just one word. In command line the query should be
with the quotes doubled, i.e. as ...,T="(
""kobyla""
|| ""seno""
) && ""kolna""",...
. It is up to the aplication whether the quotes will be required or the
missing ones will be added
-
The words in a query can be equipped with weight
parameter. The parameter is append to a word with two ^
separators. The default value is 100. An example is "hora"^60^.
-
The words in a query can be equipped with parameter
of case sensitivity (there is also a setting valid for whole expression).
The parameter is append to a word with another ^
separator. An example is "kopec"
&& "Pardubice"^66^C^
- this codes "kopec"
with weight 100, not case sensitive and "Pardubice"
with weight 66, case sensitive.
-
Examples (the quotes are optional):
-
"diagram"
-
"podzemními"
&& "chodbami"
-
("hora"
|| "kopec")
&& ("žhavá"
|| "doutnající")
-
"hora"
@@ "žhavá"
|| "hora"
@@ "doutnající"
|| "kopec"
@@ "žhavý"
|| "kopec"
@@ "doutnající"
-
("lichokopytník"
|| "kopytník")
&& !"kůň"
Use of Database in FullText
-
The Database in not used during indexing process
-
It is used during evaluation of user request - then the morphology and
translation facility prapares list of words to search for in index.
Assumed modes of FullText
use
-
Network version: The FullText Server will
run as a standalone application on a server (or peer) machine. The application
can be spawned from a shell able to restart the FullText Server. The shell
could be another program or an NT service. Other applications will communicate
with the server via files in shared directories. This solution is independent
on actual networking system.
-
Standalone version: The FullText Server will
be spawned from an application as a separate application. Other applications
will communicate with the server via files in shared directories. (Not
prepared yet: if the server will maintain client applications counter,
it will be able to exit safely after last user application will stop using
it.)
-
Program function:The module is used either
as .DLL or linked to a program. The program then can use not only low level
FullText functions for indexing and seatch, but also special purpose functions
for:
-
Search of catalogs and dictionaries,
-
Synchronization (matching, alignment) of text files,
-
Matching of text file with a word or terminology list,
-
Linguistic analysis of text.
Registry keys
The description is stored in
E_FileCm.htm.
File naming rules in the shared
directories
The description is stored in
E_FileCm.htm.
Portability
-
FullText index file formats are independent on platform
-
Program code is written in ANSI C; currently the
program is developed under:
-
MSVC 6.0 for Windows
-
GNU C for Linux
-
Metrowerks CodeWarrior for Mac OS
Platforms supported
-
Windows - DLL, OBJ modules, LIB, EXE, service (on Windows NT)
-
Linux - OBJ modules, EXE
-
Mac OS - OBJ modules, EXE