Nepali Chunker

Final Project Summary
Gandaki College of Engineering And Sciences, Pokhara, Nepal

Chunker is the system that finds the phrases present in the tagged sentence depending on some POS
(Parts of Speech) rules. This is the part of the Natural Language Processing which is useful to find the
syntactical structure of the sentence. This system was a part of Nepali Parser that was outsourced to
Gandaki   College   of   Engineering   and   Sciences   by   Madan   Puraskar   Pustakalaya
(www.madanpuraskar.org), a non profit organization working on Nepalinux (www.nepalinux.org) and
other Nepali language related software.

The Chunker in this context takes some tagged input sentence and then performs chunking or finding
the relation between words or between phrases and represents them  in parenthesized notation.    The
system operates on the basis of certain rules.  The input sentence should be given in Unicode value of
the word following a slash and then the tag of the word in capital English letters.  The need of Unicode
was   felt  because   it  brings  uniformity   in  representing   the  words   although  using  Unicode   is  not   a
compulsion as the tags are important rather than the words themselves.  Two consecutive words can
have only a single spacing between them. It then finds the phrases from the input words according to
those rules.  In simple, the generated phrases are called as chunks.



The Chunker being developed is domain specific and chunks simple sentences like those we can find in
simple school books for primary level.   Complex sentences are not considered as the context of this
project.  To have a broad domain coverage well analyzed and complete set of rules need to be provided.
Such   a   system  is   useful   for  many   other   systems   like  Machine   Translator,  Grammar  Checking,
information retrieval etc.

Chunker   is   a   natural   language   processing   technique   that   attempts   to   provide   some   machine
understanding of   the  structure of  a  sentence,  but  without  parsing  it   fully  into a parsed  tree  form.
Chunking is also called partial parsing or shallow parsing. The output of a chunker is a division of the
text’s sentence into series of words that together consist a grammatical unit. It is mostly either noun,
verb,  or  preposition phrase,  with  less­frequent  occurrences  of   adverb,   adjective,   clause,  and other phrases. The output is different from that of a fully parsed tree because it consists of series of words
that do not overlap and that do not contain each other. This makes chunking an easier Natural Language
Processing task than full parsing.

One way of providing artificial intelligence to the system is by providing each and every rule to the
system. The system is not allowed to look anywhere beside the provided rules. This method is limited
to  the rules provided and manual  rule addition should be done.  However,   this method of parsing is
easier and result can be seen quickly about the system's performance.

The list of PS­rules is as follows (provided by Madan Puraskar Pustakalaya):
1.  NP -> noun
2.  NP -> det + noun
3.  NP -> adj + noun
4. NP -> det + adj + noun
5.  NP -> det + int + adj + noun
6. AP -> adjective
7. AP -> int + adjective
8. AdvP -> adverb
9. AdvP -> int + adverb
10. PoP -> postposition
11. Pop -> NP + postposition
12. VP -> verb
13. VP -> adverb + verb
14. VP -> NP + verb
15. VP -> NP + AdvP + verb
16. VP -> AdvP + NP + verb
17. VP -> NP + NP + verb
18. VP -> NP + PoP + verb
19. S -> NP + VPLR(0) ALOGRITHM

LR parer algorithm is the standard bottom up parsing algorithm with the time complexity of O(n3). The
action-­goto  table  is  the most   important  part  of   the algorithm  looking at  which  the shift  or   reduce
function is done. However, the content of action­goto table differs if there is any change in the number
of rules or change in the rules itself. Since, there were limited rules (around 17), there is greater chance
of rules to be increased. So, action­goto table constructed above with the rules provided was not fruit­
full for implementation. Hence, it was decided to go with the Finite State Automata. The rules are kept
in one file. Doing this if any line numbered left hand side of the rule matches while processing, it will
be  reduce by  the word with  the same  line numbered right  hand side phrase name of  the rule.  The
advantage  of  doing  this   is   that,  we can add any number  of   rule as  needed  in  the  future without
bothering about the reconstruction of action­goto table. However, the basic concept remains the same.
Only instead of looking into the action­goto table, we will be looking at the Finite State Automata.  The
general outline is specified below:

Stage 1
a) Look for patterns in the order:
int+adj , adj, int+adv, adv  /* these tags *not present at end of
comprehensive  *rules.
*/
/* After dealing with these patterns */
b) if they are present then we replace them with their
corresponding phrases.
// These phrases do not include other  phrases but are
themselves
// included in other phrases
c) if not present then go to stage 2.The standard LR algorithm doesn't talk about stage1. While tracing out the algorithm along with the rules, it was found that this algorithm was unable to find out smaller chunks or phrases present in the sentence for our domain. It was due to the structure of the rules and variant in the language domain. It was analyzed  that  some smaller  phrase  rules didn't   fall   in  the bigger  phrase  rules directly  rather  a reduced phrase rule was used. Hence, for finding out adjective and adverb phrase with the combination

int+adj, adj, int+adv, adv this stage has been introduced.
Stage 2
Read the input sentence from the file;
do
(
Tokenize the word from the sentence;
Split each word to get the word value and the provided tag;
)while(there are words);
//Get the Local Tags
For each word,find the local tag,that is used to compare with the
rules;
Initialize the Stage1 object from the words information;
//now perform stage1 reduction
Look if the small phrases like AP,AdvP can be obtained from the
Stage1 objects;
if(found)
reduce those objects and store the new phrase information in it
else
leave it as usual
//initialize Stage2 objects
Find the Stage1 objects that are not reduced and create that many
Stage2 objects;
For each Stage2 object Initialize each Stage2 object to the non
reduced Stage1 object;
//Stage2 reduction
loop start:
For each Stage2 object, take its name and see if some rule
matches;
if(found){
push the object number in stack;
//this is called as shift action
if(tags_in_this_rule==stack_size)
{
store the information of phrase just formed;
empty the stack;
set last_index,input_index and rule_index each to
zero;
//this is called as reduce action
}
else
{
//rule can have more tags, needs more comparison
advance the input_index and rule_index each by 1;
keep the context of this rule and the objects in stack;
repeat the loop from next Stage2  object
with this information;
}
}
else
{
//no rule matched with this word
advance the input_index and last_index each by one;
Empty the stack;
}
end loop;
Print the information about each word from Stage2 object that is not reduced.


Comments

Popular posts from this blog

Simple Invoice Creation With Jasper Report

Dynamic Image in Jasper Report

Auto Increment Oracle Table Id Mapping With JPA Entity