I am intrigued with the idea of giving computers a modicum of common sense, or in other words a practical knowledge of everyday things. This would have huge benefits, for instance, much smarter ways of searching for information, and more flexible user interfaces to applications. While it might sound easy, this is in fact very difficult and has defeated traditional approaches based upon mathematical logic and AI (artificial intelligence). More recently, work on speech recognition and natural language processing using statistical methods have shown great promise. Statistical approaches offer a way out of the combinatorial explosion faced by AI, and I am excited by Dan Sperber's work on relevancy theory and the potential for applying statistical learning techniques to semantics. Unfortunately, there is a lot to do before it will be possible to realise this in practice.
My long term aim is to understand this better and to work with others to put it into practice in the form of a multi-user conversational agent accessible over the Web, so that we can harness the power of the Web to allow volunteers to teach the system common sense knowledge by conversing with it in written English (and eventually other languages). This would be under an open source license, and free for all to share. For some existing work on common sense, see Henry Lieberman's MIT Media Lab course: Common Sense Reasoning for Interactive Applications, with links to the Open Mind Initiative, and Doug Lenat's work on Cyc amongst others. See also the Common Sense Computing Initiative at the MIT Media Lab.
After a gap of several years, I have restarted work on this project, beginning with extensive reading of research papers in statistical natural language processing, machine learning and data mining, and related work in cognitive science. I am particularly interested in the potential for combining natural language, cognitive science and the Semantic Web.
Cognitive science focuses on the study of how information is represented and transformed in the brain with a strong emphasis on experimental results. Cognitive architectures such as ACT-R and CHREST provide valuable insights into human cognition, and point the way to new kinds of information systems. There are indications that these architectures could be strengthened by incorporation of ideas from work on machine learning and data mining.
Following the literature search, I have started coding as a way to test my understanding of the various techniques. The British National Corpus enabled me to explore statistical models for part of speech tagging, and I am now working on a bottom-up chart parser for broad coverage of written English. The parser won't attempt to deal with all of the ambiguities, e.g. those caused by prepositional attachments, and instead relies on a semantic processor to find the most natural interpretations. I plan to explore the use of WordNet for determining conceptual matches, as well as models of human cognition from cognitive science.
The natural language processor will translate English to a semantic representation represented as labeled arcs with a subject, predicate and object. Much of current work on the Semantic Web is heavily influenced by formal logic. This is generally a poor match for natural language semantics, but I am heartened by the collection of papers in Natural Language Processing and Knowledge Representation, published in 2000 by the AAAI Press together with MIT Press. I plan to explore techniques for inference and learning using triple-based representations, together with models of human cognition as a way to deal with scaling issues.
The following is outdated, but still serves to give a general feeling for where I am headed.
The system will be designed to support multiple simultaneous conversations, either one on one, or as part of chat rooms where the system is one participant amongst many people. The use of text rather than speech avoids the costs and problems inherent in using speech, although in principle, a speech interface could be added.
For initial work on proving the ideas, popular AI scripting languages seem like a reasonable choice, e.g. python. scheme or prolog. Later as experience is gained, it will become easier to understand what architecture is going to be needed for a scalable solution. One issue is the relationship between short and long term memory, and the indexing mechanisms needed to support the very large amount of information needed for an adequate treatment of common sense reasoning. A further issue is safeguarding information held about individuals contributing to the system. Information learned from one person may need to be kept private and not shared with other people.
The first step is morphological analysis and part of speech identification. At its simplest, this is just a matter of looking up words in a dictionary. In practice, the preceding words and the conversational focus will be used to determine the most likely interpretations of each word, based upon prior training against a tagged corpus of written texts. This covers the recognition of compound nouns and named entities. Further study is needed on detailed requirements for representing word senses efficiently.
One idea is for there to be a unique name for each word sense. Some grammars annotate lexical entries with attributes that indicate gender, cardinality and many other properties. Words are often used in ways that are highly dependent on the context. This suggests that a collection of concepts may be needed to capture the fluid nature of word meanings, and that relationships between words should be expressed as relationships between such collections. Further work is need to understand this better.
The next step is parsing. Natural language is typically highly ambigous and a long sentence may have many hundreds of alternative parses. To avoid this, the system will be trained to rank grammar rules according to the context. This enables the use of the "A*" algorithm to find the most likely parses. The initial training will be done against a tagged corpus (generally known as a "tree-bank"). There are a wide range of grammatical formalisms and the choice will be influenced by the design of the lexicon as well as the availability of training materials.
Resolution of deixis, anaphora and prepositional attachment will be addressed at a higher level, using the semantic context provided by the current conversation. This is a departure from traditional statistical natural language parsing techniques, but is essential for an adequate treatment of common sense and relevancy theory. The most likely meaning of a word depends on the semantic context, and not just the preceding few words. Parsing and semantic processing are thus intertwined, each one feeding off the other.
The purpose of parsing is to enable the system to draw appropriate inferences. This involves the construction of statements that represent meaning within the current context. In the simplest case, the meaning of the utterance follows directly from the composition of the literal meanings of the words. In other cases, the meaning is highly dependent on understanding the context and potential goals of the person making the utterance. Idiomatic phrases should be recognized as such, bypassing the normal parsing and semantic interpretation mechanisms.
Semantics will be represented in terms of a labelled graph or semantic network. The nodes in a semantic network could be explicit concepts or compound concepts that are themselves queries against the system's short and long term knowledgebase. This virtualizes concepts. In general the concepts used in assertions or rules are virtual, so that knowledge retrieval is founded on a semantic indexing mechanism.
A major issue is how the system learns semantics. There could be a core based upon training against a tagged corpus, but I believe that some kind of bootstrapping process will be needed. This could start with very simple concepts, and is likely to need manual work at a level below the conversational interface.
How does the system react to the semantics of natural language input and carry out the reasoning needed to generate an appropriate response? What kinds of mental states are involved? Relevancy theory describes notions of how inference can be minimized to match the current situation. With the huge amount of information available, inference is likely to drown in a combinatorial explosion unless some means is found to contain it and channel it in a useful direction. The solution seems to involve the use of relevant contexts. I believe that much thinking is in terms of constructing and acting out stories, but how to expand on this idea?
Deirdre Wilson and Dan Sperber's relevancy theory introduces relevance as key to minimizing the effort needed to understanding an utterance. The greater the relevance, the less the effort that is needed. In a cooperative dialog, the speaker will be expected to make his or her utterances as easy as possible for the listener to understand. That means that the intended meaning of each utterance should be maximally relevant to the listener.
They go on to describe sub-tasks in the comprehension process:
- Constructing an appropriate hypothesis about explicit content (in relevance-theoretic terms, explicatures) via decoding disambiguation, reference resolution, and other pragmatic enrichment processes.
- Constructing an appropriate hypothesis about the intended contextual assumptions (in relevance-theoretic terms, implicated premises).
- Constructing an appropriate hypothesis about the intended contextual implications (in relevance-theoretic terms, implicated conclusions).
These are developed in parallel against a background of expectations which may be revised or elaborated as the utterance unfolds. In particular, the hearer may bring to the comprehension process not only a general presumption of relevance, but more specific expectations about how the utterance will be relevant to him (what cognitive effects it is likely to achieve), and these may contribute, via backwards inference, to the identification of explicatures and implicated premises.
The paper includes worked examples that illustrate the kinds of inference involved, e.g.
Peter: Did John pay back the money he owed you?
Mary: No. He forgot to go to the bank.
This involves the realization that Mary is probably a friend of John. Mary prefers to be repaid in cash and not with a personal cheque. John didn't have enough cash on him to pay it back to Mary. John intended to get some more cash from the bank, but forgot to visit the bank. As a result he wasn't able to pay Mary. He is likely to pay her soon after a trip to the bank. If he forgets again, Mary will be upset and he will be embarrassed, neither of which he wants to happen.
Borrowing money from friends is a common occurrence and this makes the above example easy to understand. The process of understanding can be considered as constructing a story that explains the utterances. This new story can be based upon remembered stories involving yourself or others. It should be as simple as possible to explain the utterances.
From relevance theory we have the idea that a speaker will be expected to make his or her utterances as easy as possible for the listener to understand. Sometimes the speaker will fail to do so, and a competent understander will usually be able to cope with such failings. Social contexts often influence how things are put. Speakers may use indirect statements when a direct statement might be seen as impolite (either by the immediate listener or someone else nearby).
Sometimes people have ulterior motives and may seek to deceive or to plant an idea that benefits themselves but not the recipient. One example of this is advertising messages that aim to make you think that smoking will make you appear sexy. A sophistocated listener will seek to understand the what benefits the speaker would gain if you were to accept the idea implied by their statement. Is the speaker sincere, or deceitful? If sincere, is he to be trusted in this matter?
Sperber's understanding verbal understanding describes these and related issues at some length. I particularly liked his summary:
Full-fledged communicative competence involves, for the speaker, being capable of having at least third-order meta-representational communicative intentions, and, for the hearer, being capable of making at least fourth-order meta-representational attributions of such communicative intentions. In fact, when irony, reported speech, and other meta-representational contents are taken into consideration, it becomes apparent that communicators juggle quite easily with still more complex meta-representations.
An adequate treatment of common sense needs to cover such sophistocated use of language, and this has obvious implications for the way meta-representational intents are expressed as semantic networks. First order representations are clearly inadequate!
According to Merriam Webster, an episode is a usually brief unit of action in a dramatic or literary work. It is a developed situation that is integral to but separable from a continuous narrative. It is an event that is distinctive and separate although part of a larger series. Episodic reasoning is thus about how things change in time. We can reason about the causes of changes associated with events and about the properties that hold throughout an interval. The frame problem of AI relates to the challenges in keeping track of what changes and what doesn't in any given situation when trying to reason about plans of actions. Humans appear to address this problem by exploiting learned regularities that influence what is most relevant. We very rarely work things out from first principles.
When you look across a room and decide to take a closer look out of the window, you use non-verbal reasoning to plan a route across the room, for instance walking around a dining table in the center of the room. Common sense reasoning isn't restricted to verbal reasoning and plays an important role in how you make sense of what you are looking at in everyday scenes. Understanding images and movement is made easier because of the statistical and causal regularities in how things fit together.
Statistical models of syntax and semantics can be applied to vision in an analogous way to how they apply to spoken and written language. The syntax describes the visual textures and optical flow that delineate the image. The semantics describe the objects making up the scene and how they fit together. Some examples include, the patterns of movement of someone walking, the effects of perspective on the changing appearence of buildings as we walk past them, and possible forms a pair of spectacles can take when resting on a table. Our effortless understanding of images relies on access to a vast amount of everyday knowledge structured at many different levels. Once again relevancy theory is needed to limit processing to what is most relevant in any situation.
A text-based conversational agent will need some awareness of non-verbal reasoning, but my hope is that others will be inspired to work on this once the basic idea of statistical treatment of semantics has been demonstrated for verbal reasoning.
Optical illusions reveal the existence of the shortcuts we use to efficiently process images. The shortcuts work well under normal circumstances, but break down when the assumptions are changed. Analogous illusions can be devised for verbal reasoning. I would like to collect examples of these.
Studying a small number of examples in greater detail:
The system first needs to construct a semantic representation of the utterance. This involves modelling the mental states of the listener and the current state of the conversation. The utterance has to be broken down into a sequence of pieces that can then be mapped into natural language.
The next step is to identify natural language templates that match the utterance and to instantiate them. At this point a statistical language model can be used to propose words.
Lots and lots of background reading!
Dave Raggett, email: dsr@w3.org, phone: +44 1225 866 240, last updated $Date: 2010/03/05 13:00:10 $