REPORT.doc

NATURAL LANGUAGE PROCESSING
INTRODUCTION:
Natural Language Processing (NLP) is an area of research and application that
explores how computers can be used to understand and manipulate natural language text
or speech to do useful things. NLP researchers aim to gather knowledge on how human
beings understand and use language so that appropriate tools and techniques can be
developed to make computer systems understand and manipulate natural languages to
perform the desired tasks. The foundations of NLP lie in a number of disciplines, viz.
computer and information sciences, linguistics, mathematics, electrical and electronic
engineering, artificial intelligence and robotics, psychology, etc.
NLP has become quite prominent due to the proliferation of the World Wide Web and
digital libraries. Several researchers have pointed out the need for appropriate research in
facilitating multi- or cross-lingual information retrieval, including multilingual text processing
and multilingual user interface systems, in order to exploit the full benefit of the www and digital
libraries.
APPROACHES TO NATURAL LANGUAGE PROCESSING:
Natural language processing approaches fall roughly into four categories:
symbolic, statistical, connectionist, and hybrid.
Symbolic approaches perform deep analysis of linguistic phenomena and are
based on explicit representation of facts about language through well-understood
knowledge representation schemes and associated algorithms. A good example of
symbolic approaches is seen in logic or rule-based systems. Another example of symbolic
approaches is semantic networks. Symbolic approaches have been used for a few decades
in a variety of research areas and applications such as information extraction, text
categorization, ambiguity resolution, and lexical acquisition. Typical techniques include:
explanation-based learning, rule-based learning, inductive logic programming, decision
trees, conceptual clustering, and K nearest neighbor algorithms.
Statistical approaches employ various mathematical techniques and often use
large text corpora to develop approximate generalized models of linguistic phenomena
based on actual examples of these phenomena provided by the text corpora without
adding significant linguistic or world knowledge. A frequently used statistical model is

the Hidden Markov Model (HMM) inherited from the speech community. Statistical
approaches have typically been used in tasks such as speech recognition, lexical
acquisition, parsing, part-of-speech tagging, collocations, statistical machine translation,
and statistical grammar learning, and so on.
Connectionist approaches also develop generalized models from examples of
linguistic phenomena. What separates connectionism from other statistical methods is
that connectionist models combine statistical learning with various theories of
representation - thus the connectionist representations allow transformation, inference,
and manipulation of logic formulae. They perform well at tasks such as word-sense
disambiguation, language generation, and limited inference. These models are well suited
for natural language processing tasks such as syntactic parsing, limited domain
translation tasks, and associative retrieval.
NATURAL LANGUAGE UNDERSTANDING:
Liddy (1998) and Feldman (1999) suggest that in order to understand natural languages, it is
important to be able to distinguish among the following seven interdependent levels, that people
use to extract meaning from text or spoken languages:
• Phonetic or phonological level that deals with pronunciation
• Morphological level that deals with the smallest parts of words, that carry a meaning, and
suffixes and prefixes
• Lexical level that deals with lexical meaning of words and parts of speech analyses
• Syntactic level that deals with grammar and structure of sentences
• Semantic level that deals with the meaning of words and sentences
• Discourse level that deals with the structure of different kinds of text using document
structures and
• Pragmatic level that deals with the knowledge that comes from the outside world, i.e., from
outside the contents of the document.
A natural language processing system may involve all or some of these levels of analysis.
MAJOR TASKS IN NLP:
The following is a list of some of the most commonly researched tasks in NLP.
Note that some of these tasks have direct real-world applications, while others more
commonly serve as subtasks that are used to aid in solving larger tasks. What
distinguishes these tasks from other potential and actual NLP tasks is not only the volume

of research devoted to them but the fact that for each one there is typically a well-defined
problem setting, a standard metric for evaluating the task, standard corpora on which the
task can be evaluated, and competitions devoted to the specific task.
 Automatic summarization: Produce a readable summary of a chunk of text.
Often used to provide summaries of text of a known type, such as articles in the
financial section of a newspaper.
 Coreference resolution: Given a sentence or larger chunk of text, determine
which words ("mentions") refer to the same objects ("entities"). Anaphora
resolution is a specific example of this task, and is specifically concerned with
matching up pronouns with the nouns or names that they refer to. The more
general task of coreference resolution also includes identify so-called "bridging
relationships" involving referring expressions. For example, in a sentence such as
"He entered John's house through the front door", "the front door" is a referring
expression and the bridging relationship to be identified is the fact that the door
being referred to is the front door of John's house (rather than of some other
structure that might also be referred to).
 Machine translation: Automatically translate text from one human language to
another. This is one of the most difficult problems, and is a member of a class of
problems colloquially termed "AI-complete", i.e. requiring all of the different
types of knowledge that humans possess (grammar, semantics, facts about the real
world, etc.) in order to solve properly.
 Morphological segmentation: Separate words into individual morphemes and
identify the class of the morphemes. The difficulty of this task depends greatly on
the complexity of the morphology (i.e. the structure of words) of the language
being considered. English has fairly simple morphology, especially inflectional
morphology, and thus it is often possible to ignore this task entirely and simply
model all possible forms of a word (e.g. "open, opens, opened, opening") as
separate words. In languages such as Turkish, however, such an approach is not
possible, as each dictionary entry has thousands of possible word forms.

 Named Entity Recognition : It is also known as entity identification and entity
extraction is a subtask of information extraction that seeks to locate and classify
atomic elements in text into predefined categories such as the names of persons,
organizations, locations, expressions of times, quantities, monetary values,
percentages, etc.
 Natural language generation: Convert information from computer databases
into readable human language.
 Optical Character Recognition (OCR): It is the mechanic or electronic
translation of scanned images of handwritten, type written or printed text into
machine-encoded text. It is widely used to convert books and documents into
electronic files, to computerize a record-keeping system in an office, or to publish
the text on a website. OCR makes it possible to edit the text, search for a word or
phrase, store it more compactly, display or print a copy free of scanning artifacts,
and apply techniques such as machine translation, text-to-speech and text
mining to it. OCR is a field of research in pattern recognition, artificial
intelligence and computer vision.
 Part-of-speech tagging: Given a sentence, determine the part of speech for each
word. Many words, especially common ones, can serve as multiple parts of
speech. For example, "book" can be a noun ("the book on the table") or verb ("to
book a flight"); "set" can be a noun, verb or adjective; and "out" can be any of at
least five different parts of speech. Note that some languages have more such
ambiguity than others.
 Parsing: It is an (ordered, rooted) tree that represents the syntactic structure of
a string according to some formal grammar. In a parse tree, the interior nodes are
labeled by non-terminals of the grammar, while the leaf nodes are labeled
by terminals of the grammar. A parse tree is made up of nodes and branches. The
image below represents a linguistic parse tree, here representing
the English sentence "John hit the ball". (The parse tree here is a greatly
simplified one; for more information, see X-bar theory.) The parse tree is the

entire structure, starting from S and ending in each of the leaf nodes (John, hit,
the, ball). We use the following abbreviations in the example:
 S for sentence, the top-level structure in this example
 NP for noun phrase. The first (leftmost) NP, a single noun "John", serves as
the subject of the sentence. The second one is the object of the sentence.
 VP for verb phrase, which serves as the predicate
 V for verb. In this case, it's a transitive verb "hit".
 Det for determiner, in this instance the definite article "the"
 N for noun.
 Question Answering: Given a human-language question, determine its answer.
Typical questions have a specific right answer (such as "What is the capital of
Canada?"), but sometimes open-ended questions are also considered (such as
"What is the meaning of life?").
 Relationship extraction: Given a chunk of text, identify the relationships among
named entities (e.g. who is the wife of whom).
 Sentence breaking (also known as sentence boundary disambiguation): Given a
chunk of text, find the sentence boundaries. Sentence boundaries are often marked
by periods or other punctuation marks, but these same characters can serve other
purposes (e.g. marking abbreviations).
 Sentiment analysis: Extract subjective information usually from a set of
documents, often using online reviews to determine "polarity" about specific
objects. It is especially useful for identifying trends of public opinion in the social
media, for the purpose of marketing.

 Speech recognition: Given a sound clip of a person or people speaking,
determine the textual representation of the speech. In natural speech there are
hardly any pauses between successive words, and thus speech segmentation is a
necessary subtask of speech recognition. Note also that in most spoken languages,
the sounds representing successive letters blend into each other in a process
termed co-articulation, so the conversion of the analog signal to discrete
characters can be a very difficult process.
 Speech segmentation: Given a sound clip of a person or people speaking,
separate it into words. A subtask of speech recognition and typically grouped with
it.
 Topic segmentation and recognition: Given a chunk of text, separate it into
segments each of which is devoted to a topic, and identify the topic of the
segment.
 Word segmentation: Separate a chunk of continuous text into separate words.
For a language like English, this is fairly trivial, since words are usually separated
by spaces. However, some written languages like Chinese, Japanese and Thai do
not mark word boundaries in such a fashion, and in those languages text
segmentation is a significant task requiring knowledge of
the vocabulary and morphology of words in the language.
 Word sense disambiguation: Many words have more than one meaning; we
have to select the meaning which makes the most sense in context. For this
problem, we are typically given a list of words and associated word senses, e.g.
from a dictionary or from an online resource such as Word Net.
 Stemming is the process for reducing inflected (or sometimes derived) words to
their stem, base or root form—generally a written word form. Many search
engines treat words with the same stem as synonyms as a kind of query
broadening, a process called conflation. Stemming programs are commonly
referred to as stemming algorithms or stemmers. A stemming algorithm reduces
the words "fishing", "fished", "fish", and "fisher" to the root word, "fish".

 Text simplification is an operation used in natural language processing to
modify, enhance, classify or otherwise process an existing corpus of human-
readable text in such a way that the grammar and structure of the prose is greatly
simplified, while the underlying meaning and information remains the same. Text
simplification is an important area of research, because natural human languages
ordinarily contain complex compound constructions that are not easily processed
through automation.
 Text to speech: Speech synthesis is the artificial production of human speech. A
computer system used for this purpose is called a speech synthesizer, and can be
implemented in software or hardware. A text-to-speech (TTS) system converts
normal language text into speech; other systems render symbolic linguistic
representations like phonetic transcriptions into speech.
 Text proofing : Proofreading is the reading of a galley proof or computer
monitor to detect and correct production-errors of text or art. Proofreaders are
expected to be consistently accurate by default because they occupy the last stage
of typographic production before publication.
 Natural Language User Interfaces (LUI) are a type of computer human
interface where linguistic phenomena such as verbs, phrases and clauses act as UI
controls for creating, selecting and modifying data in software applications.
 Query expansion (QE) is the process of reformulating a seed query to improve
retrieval performance in information retrieval operations. In the context of
web search engines, query expansion involves evaluating a user's input and
expanding the search query to match additional documents.
 True casing is the problem in natural language processing (NLP) of determining
the proper capitalization of words where such information is unavailable.
In some cases, sets of related tasks are grouped into subfields of NLP that are often
considered separately from NLP as a whole. Examples include:
 Information retrieval (IR): This is concerned with storing, searching and
retrieving information. It is a separate field within computer science (closer to

databases), but IR relies on some NLP methods (for example, stemming). Some
current research and applications seek to bridge the gap between IR and NLP.
 Information extraction (IE): This is concerned in general with the extraction of
semantic information from text. This covers tasks such as named entity
recognition, coreference resolution, relationship extraction, etc.
 Dialogue Systems: perhaps the omnipresent application of the future, in the
systems envisioned by large providers of end-user applications. Dialogue systems,
which usually focus on a narrowly defined application (e.g. your refrigerator or
home sound system), currently utilize the phonetic and lexical levels of language.
It is believe that utilization of all the levels of language processing explained
above offer the potential for truly habitable dialogue systems.
APPLICATIONS:
Natural language processing can be used in the automatic scoring of the
students’ papers. Automatic scoring is a help to teachers of giving students exact marks
in examination. Chinese subjective questions scoring algorithms are based on similarity
calculation and natural language processing. Forward maximum matching algorithm and
semantic similarity computation and contrary degree calculation can overcome problems
of Chinese subjective questions scoring. Forward maximum matching algorithm is a
convenient segmentation method.
Mohd Ibrahim, Rodina Ahmad Proposes a method and a tool to facilitate
requirements analysis process and class diagram extraction from textual requirements
supporting natural language processing and Domain Ontology techniques. Requirement
Analysis and Class Diagram Extraction (RACE) tool system is decomposed in to internal
and external components subsystems. They are a) Open NLP parser b) Race stemming
algorithm c) Word Net d) Concept extraction engine e) Domain ontology f) Class
extraction engine g) Race concept management (UI).
The natural language processing software which transforms a natural language
sentence into a series of commands for a mobile service robot. A prototype application
is presented which allows severely disabled persons bound to a wheel chair to control the
chair and their home environment by natural language. The spoken sentence is parsed by

Finite State Transducer Network (FSTN) , which checks the grammatical structure and
vocabulary. The output of FSTN then is processed by transformation module. First the
sentence is split into the basic components “instruction”, “intervention”, and “question”.
The splitting algorithm is based on the recognition of grammatical word pattern. These
patterns are stored and checked on semantic completeness. This process is based on the
environment and semantic database contained within the environment module. In case of
missing information, the system checks the context memory to find missing data. Incase
of no entry can be found, the question/answer module is initiated and a question is
generated. Finally, a control sequence is generated and transmitted to the controller.
Mladen Stanojevic and Sanja Vranes propose a model that could be used to
describe roughly the process of understanding as it happens in our brain. They developed
a Hierarchical Semantic Form (HSF) for the representation of semantics and the
corresponding SOUL (Space of Universal Links) learning algorithm. The proposed HSF
which resembles hierarchically organized neural network and represents a hierarchical
equivalent of plain text forms, where all semantic categories are explicitly represented
and hierarchically organized. To validate the ideas of HSF and SOUL we have developed
a prototype of Semantic Web Service for Flight Information Service (FIS).
Flight schedule query system based on NLP is developed in order to increase the
capability of natural language processing from time to time. Here there are two types of
methodology they are requirement prototype and evolution prototype. The prototype
consists of 4 procedures that are initial investigation, system analysis, system design and
system implementation. The system design process is emphasizes on 3 main aspects
which are design of knowledge base, inference engine and user interface.
The Email-Based Intrusion Detection System to find Social Engineering using
Natural language processing(EBIDS-SENLP) does this by reading plain text emails and
sending emails natural language body to OntoSem(Ontology). However, OntoSem needs
pretraining to look for certain segments of language as concepts of deception to
meaningfully analyze the raw text and tell EBIDS what types of deception the e-mail
could contain.
The E-democracy European network project applied NLP to improve
communication between public administrations and their citizens- are a major application

field for NLP and language engineering. Our goal was twofold: to test whether we could
meet e-democracy requirements using advanced linguistic technologies and to test
whether Augmented Phrase Structure Grammars (APSGs) were robust and well-assessed
enough to use in a real world (and highly sensitive) environment. To improve
communication the set of modules as a toolset comprising 5 toolsets address guesser,
answer tree, style enhancer, multi language helper and natural language map.
In clinical application, the NLP and Vision Processing (VP) might be harnessed
together in to a single system. The key to our integration is use of a single target
representation. Their combination is more powerful than either in isolation because they
mutually disambiguate and confirm one another.
Speech based information service systems are much friendlier and more
accessible than keyboard type-in based one. Here a new way of error detection and
correction using Comprehensive Information based natural language processing. A
module of text post-processing is added after a Speech Recognizer. In this module,
syntactic, semantic and pragmatic information are analyzed to check the result sentence
text of the Speech Recognizer. If there are some errors, the module also tries to correct
them.

ALGORITHM FOR CI BASED SENTENCE PROCESSING
To enhance the readability of Japanese texts by machine-assisted language control
with automatic paraphrasing. Technically plain Japanese is a controlled language and its
specification includes controlled spelling, controlled vocabulary, controlled grammar and
limitation of quantitative parameters such as sentence length.
Natural Languages Interfaces to Data Bases” (NLIDB) is a system that allows a
user to interact with a database by expressing himself in a natural language such as
English or any other. This system prepares an “expert system” implemented in prolog
which it can identify synonymous words in any language. It first parses the input
sentences, and then the natural language expressions are transformed to SQL language.
Understanding natural language by machine includes Syntactic knowledge (shallow
parsing, syntactic parsing, and lexical cohesion), semantic knowledge, expert system,
prolog, semantic analysis, implementing proposal determinant system.
Conclusions:
While NLP is a relatively recent area of research and application, as compared to
other information technology approaches, there have been sufficient successes to date
that suggest that NLP-based information access technologies will continue to be a major
area of research and development in information systems now and far into the future.

REPORT ON NATURAL
LANGUAGE PROCESSING

REPORT.doc

Recommended

Recommended

More Related Content

Similar to REPORT.doc

Similar to REPORT.doc (20)

Recently uploaded

Recently uploaded (20)

REPORT.doc