WordNet is a lexical database connecting English words/expressions to categories representing their meanings. It can also be seen as an ontology for natural language since the categories are connected by various kinds of semantic links, e.g.
generalization, similar, exclusion, member, part and
To ease and guide knowledge re-use/sharing/retrieval/entering, I initialized the knowledge base (KB) of WebKB-2 with the content of WordNet 1.7 related to nouns: 108,000 nouns and 74,500 categories referred by nouns (in accordance with my lexical recommendations, I ignored information regarding verbs, adverbs and adjectives).
A first problem was that, although WordNet categories have intuitive names
(English nouns or nominal expressions), they do not have intuitive identifiers
(the WordNet API mainly uses numbers). Intuitive identifiers are mandatory for
permitting people to read, write and update knowledge statements in text files,
i.e. outside the graphical interface of a particular tool. This is a minimal
requirement for knowledge sharing/re-use and also greatly simplifies the
development of knowledge-based tools.
Hence, I designed an algorithm to create intuitive identifiers for WordNet categories based on their names. This algorithm combines various heuristics I learnt from many trials. Although the final version worked quite well, I still had to update a few generated category identifiers manually. This algorithm is detailed below.
A second problem was that WordNet has a poorly structured top-level and does
not always classify categories according to distinctions that are important for
the use of these categories in knowledge representations. To permit this use and
some semantic checkings on it, I have inserted WordNet top-level categories and
some medium-level categories into a top-level ontology synthesizing and
complementing various other top-level ontologies. This has led us to break a few
WordNet generalization links. Click
here for rationales behind my top-level ontology. Click here
to explore the first specializations of my top-level concept type
pm#thing, and here
to explore the specializations of my top-level relation type
A third problem was that WordNet confuses
instanceOf links into
generalization links, or in
other words, does not distinguish types from individuals (categories that cannot
have subtypes or instances). This distinction is important for knowledge
checking -- although instanceOf
link should not be over-used -- to avoid forcing arbitrary choices or to
compensate for wrong choices, WebKB-2 permits the use of certain types without
quantifiers (i.e. as if they were individuals) within statements. I have
isolated 6211 true individuals within WordNet 1.7. Click
here for a list.
A fourth problem was that WordNet 1.7 countains inconsistencies and
redundancies. Conversely, some categories for common English words are missing.
here for a list of my semantic corrections (more than 300) and additions (more
than 150). It should be noted that most of the inconsistencies I corrected
were automatically detected thanks to the exclusive links in my top-level
ontology (and as mentioned above, the generalization of WordNet categories by
categories in my top-level ontology). Two kinds of links,
location ('l') had to be introduced to correct certain
erroneous uses of the generalization link.
A fifth problem was that some categories did not have explicit enough names, or their ordering was not correct (category names in WordNet are ordered by decreasing frequency of use, but this ordering is generated from a few concordance files and therefore can be misleading). Click here for a list of the lexical modifications that I made to the WordNet ontology.
A sixth problem for knowledge representation is the lack of structuration of
WordNet and the fact that many categories have a lexical rather than semantic
nature. Some structuration was added via semantic links (the above cited 161
additions). I also added sub-annotations at the beginning of some category
$(value)$ to represent the fact that the category
represent a value, and
$(artificial)$ to represent that it has a
lexical nature and/or should not be used for knowledge representation. Click
here for a list of value/artificial sub-annotations.
Finally, it should be noted that the semantics of the links
member, substance and
object in WordNet is not always clear
or inconsistent. For instance, does a
part link from the category
airplane to the category
wing mean that "any airplane
has for part at least 1 wing" or "all airplanes have for part the same wing",
"any wing is part of a plane", "a wing is part of any plane", etc. For graph
matching (and hence inferencing) in WebKB-2, I have assumed the first
interpretation is correct; however, this is just an heuristic.
I integrated WordNet 1.7. in January 2002. When representing knowledge between January and June, I sometimes made some updates to the key names of the WordNet categories, and occasionally corrected some links, but more and more rarely. The WordNet part of the KB (and my top-level ontology) can now be considered quite stable. Hence, the identifiers can be used by people in their own files, and support knowledge sharing.
It is best to explore and filter parts of the ontology of WebKB-2 via my Category Search tool. However, if you do need all the ontology, it is downladable in four formats:
More top-level ontologies, e.g. from the SUO Library and the DAML Library, will be incorporated into WebKB-2 knowledge base.
This work has now been published.
It has been done to help principled and manual knowledge representation. It is
insufficient for the inter-operation of fully automatic software agents, e.g.
for e-commerce or database integration purposes; this article by R.
Colomb gives some of the reasons why general automatic inter-operation
(not pre-programmed business-to-business inter-operation) is not going to
happen anytime soon.
My work is also very insufficient to help knowledge-based automatic natural language processing. One of the steps in this direction are provided by the ThoughTreasure TM project and its downloadable resources. The Cyc and OpenCyc projects should of course also be cited. See also the pages about the Open Mind projects and the Natural Language Processing group at USC/ISI.
WordNet connects words to categories representing the meanings of these words. Each category has at least one name (word) and each name may be shared by several categories (since a word may have several meanings) Category keys (or "key names") need to be chosen for uniquely representing categories. (I use the expression "key name" instead of "category identifier" because in WebKB an identifier for a category is generally composed of a user identifier, a key name, and optionally other names separated by "__").
In the WordNet API and database files, a category is referred either with the
offset of its description in one of the database files (e.g. the offset
"12558316" for the category with names "Friday" and "Fri"), or some sense
indices which are the names of the category with some suffixes to make
unique key names (e.g. "friday%1:28:00::" and "fri%1:28:00::"; the "1" after the
"%" indicates that the name is a noun; the "28" is a number for the
lexicographer file containing this name; "00" is the order of the category in
the list of the categories sharing this name in this lexicographer file).
Given WebKB only stores categories representing the meaning of nouns (i.e. categories having nouns as names), I could have adapted sense indices to make relatively readable key names, e.g.
#Fri-28. However, I experienced that knowledge is not easy to read
or write when all the category identifiers have such suffixes.
Ideally, the key name of a category should look like one of the English words or expressions most commonly used for referring to what the category represents, and be unambiguous enough for a human reader to distinguish its meaning from the meanings of other categories. In WordNet, the most common name for a category is the first in the list of its names, but less ambiguous names may appear after. When one of the other names is a compound name beginning or ending with the first name (as "Steve_Martin" begins with "Steve" and ends with "Martin"), it constitutes a better choice for a key name than the first name.
Hence, here are the first rules (ordered by decreasing order of priority)
that I chose to generate key names:
1) when the 1st name of a category begins or ends one of the other names, select this other name as key name (unless it is shared by another category without generated key name yet);
2) select the 1st name of a category as key (unless it is shared by another category without generated key name yet);
3) try the first two rules on the 2nd name instead of the 1st;
4) try the first two rules on the 3rd name instead of the 1st;
To respect the decreasing order of priority of these rules, I have scanned
the KB many times (each time, testing all remaining categories without key
name), allowing the test of a lower priority rule only when the application of
rules of greater prority did not lead to any more change. (The order of the
rules was also respected when testing each category). This may not be an
efficient approach but it was efficient enough given WebKB-2 could scan the
whole KB quite quickly (0.45 second in average).
The application of the first two rules (i.e. trying to use only the 1st name of each category) permitted the affectation of key names to 75% of categories (56074 out of 74488). The gradual use of the other category names permitted to reach 84% of affected categories (62873 out of 74488). This means that each category in the remaining 16% shared all its names with another category (being in this 16% too).
To go further, I had to generate suffixes. I used numbers when I integrated
WordNet 1.6 but, when using categories in knowledge representations, I realized
that this option was not user-friendly enough and that a much clearer option was
to use the key name of the first supertype. Such suffixes often help people to
guess the meaning of a category without having to access all its supertypes.
However, I did not want to give a key name with a suffixe to all remaining
unaffected categories. Hence, I added the following rules (by decreasing order
of priority and with a lower priority than the previous rules) to select the
categories to which key names with suffix would be affected:
- select the category with a frequency-of-use number far lower than the other categories sharing all the same names (this number is given by WordNet and represents the frequency of appearence of the category in a few concordance documents; it is an indication but not of paramount importance; "far lower" was first set at 30 and then to decreasing values);
- select the category with a far lower number of subtypes than the other categories sharing all the same names (actually, in these last two rules, I used combinations of gradually decreasing values of frequency-of-use and number of subtypes; I also tried to reduce the affectation of suffixes to subtypes of
#action, as these types are more
frequently used than the others in knowledge representations).
After several additional scans of the KB with all the rules, there were still
a few dozens of categories that were unaffected. To fix this, I added more
precise names to these categories and/or re-ordered their names (some of my
lexical additions to WordNet come from this phase). I also had to correct
some attributions of suffixes and some choices of key names. For example,
#Republic_of_Singapore, instead of
been selected as key name (in application of the 1st rule) but
#Singapore is a more convenient identifier, while the island of
Singapore and the capital of Singapore are better referred to via
#Singapore. To fix that, before re-running the key name affectation
procedure from scratch, I semi-automatically pre-affected suffixes to many
categories, especially the specializations of
example, I added the suffixes ".capital", ".city", ".island", ".country", and
".colony" to desambiguify many category names. However, instead of using the
generalizing category for creating the suffix, I sometimes followed the
partOf link. For example, in WordNet 1.7,
three instances with unique name "Bangor" but part of different regions. Hence,
I named them
#Bangor.Northern_Ireland, #Bangor.Wales and
#Bangor.Maine. I have not listed these manual and automatic
additions of suffixes in my
lexical additions to WordNet. However, you can click
here for the current list of 5944 WordNet categories having been affected a key
name with a suffix.