Nepali Chunker
Final Project Summary
Gandaki College of Engineering And Sciences, Pokhara, Nepal
Chunker is the system that finds the phrases present in the tagged sentence depending on some POS
(Parts of Speech) rules. This is the part of the Natural Language Processing which is useful to find the
syntactical structure of the sentence. This system was a part of Nepali Parser that was outsourced to
Gandaki College of Engineering and Sciences by Madan Puraskar Pustakalaya
(www.madanpuraskar.org), a non profit organization working on Nepalinux (www.nepalinux.org) and
other Nepali language related software.
The Chunker in this context takes some tagged input sentence and then performs chunking or finding
the relation between words or between phrases and represents them in parenthesized notation. The
system operates on the basis of certain rules. The input sentence should be given in Unicode value of
the word following a slash and then the tag of the word in capital English letters. The need of Unicode
was felt because it brings uniformity in representing the words although using Unicode is not a
compulsion as the tags are important rather than the words themselves. Two consecutive words can
have only a single spacing between them. It then finds the phrases from the input words according to
those rules. In simple, the generated phrases are called as chunks.
The Chunker being developed is domain specific and chunks simple sentences like those we can find in
simple school books for primary level. Complex sentences are not considered as the context of this
project. To have a broad domain coverage well analyzed and complete set of rules need to be provided.
Such a system is useful for many other systems like Machine Translator, Grammar Checking,
information retrieval etc.
Chunker is a natural language processing technique that attempts to provide some machine
understanding of the structure of a sentence, but without parsing it fully into a parsed tree form.
Chunking is also called partial parsing or shallow parsing. The output of a chunker is a division of the
text’s sentence into series of words that together consist a grammatical unit. It is mostly either noun,
verb, or preposition phrase, with lessfrequent occurrences of adverb, adjective, clause, and other phrases. The output is different from that of a fully parsed tree because it consists of series of words
that do not overlap and that do not contain each other. This makes chunking an easier Natural Language
Processing task than full parsing.
One way of providing artificial intelligence to the system is by providing each and every rule to the
system. The system is not allowed to look anywhere beside the provided rules. This method is limited
to the rules provided and manual rule addition should be done. However, this method of parsing is
easier and result can be seen quickly about the system's performance.
The list of PSrules is as follows (provided by Madan Puraskar Pustakalaya):
1. NP -> noun
2. NP -> det + noun
3. NP -> adj + noun
4. NP -> det + adj + noun
5. NP -> det + int + adj + noun
6. AP -> adjective
7. AP -> int + adjective
8. AdvP -> adverb
9. AdvP -> int + adverb
10. PoP -> postposition
11. Pop -> NP + postposition
12. VP -> verb
13. VP -> adverb + verb
14. VP -> NP + verb
15. VP -> NP + AdvP + verb
16. VP -> AdvP + NP + verb
17. VP -> NP + NP + verb
18. VP -> NP + PoP + verb
19. S -> NP + VPLR(0) ALOGRITHM
LR parer algorithm is the standard bottom up parsing algorithm with the time complexity of O(n3). The
action-goto table is the most important part of the algorithm looking at which the shift or reduce
function is done. However, the content of actiongoto table differs if there is any change in the number
of rules or change in the rules itself. Since, there were limited rules (around 17), there is greater chance
of rules to be increased. So, actiongoto table constructed above with the rules provided was not fruit
full for implementation. Hence, it was decided to go with the Finite State Automata. The rules are kept
in one file. Doing this if any line numbered left hand side of the rule matches while processing, it will
be reduce by the word with the same line numbered right hand side phrase name of the rule. The
advantage of doing this is that, we can add any number of rule as needed in the future without
bothering about the reconstruction of actiongoto table. However, the basic concept remains the same.
Only instead of looking into the actiongoto table, we will be looking at the Finite State Automata. The
general outline is specified below:
Stage 1
a) Look for patterns in the order:
int+adj , adj, int+adv, adv /* these tags *not present at end of
comprehensive *rules.
*/
/* After dealing with these patterns */
b) if they are present then we replace them with their
corresponding phrases.
// These phrases do not include other phrases but are
themselves
// included in other phrases
c) if not present then go to stage 2.The standard LR algorithm doesn't talk about stage1. While tracing out the algorithm along with the rules, it was found that this algorithm was unable to find out smaller chunks or phrases present in the sentence for our domain. It was due to the structure of the rules and variant in the language domain. It was analyzed that some smaller phrase rules didn't fall in the bigger phrase rules directly rather a reduced phrase rule was used. Hence, for finding out adjective and adverb phrase with the combination
int+adj, adj, int+adv, adv this stage has been introduced.
Stage 2
Read the input sentence from the file;
do
(
Tokenize the word from the sentence;
Split each word to get the word value and the provided tag;
)while(there are words);
//Get the Local Tags
For each word,find the local tag,that is used to compare with the
rules;
Initialize the Stage1 object from the words information;
//now perform stage1 reduction
Look if the small phrases like AP,AdvP can be obtained from the
Stage1 objects;
if(found)
reduce those objects and store the new phrase information in it
else
leave it as usual
//initialize Stage2 objects
Find the Stage1 objects that are not reduced and create that many
Stage2 objects;
For each Stage2 object Initialize each Stage2 object to the non
reduced Stage1 object;
//Stage2 reduction
loop start:
For each Stage2 object, take its name and see if some rule
matches;
if(found){
push the object number in stack;
//this is called as shift action
if(tags_in_this_rule==stack_size)
{
store the information of phrase just formed;
empty the stack;
set last_index,input_index and rule_index each to
zero;
//this is called as reduce action
}
else
{
//rule can have more tags, needs more comparison
advance the input_index and rule_index each by 1;
keep the context of this rule and the objects in stack;
repeat the loop from next Stage2 object
with this information;
}
}
else
{
//no rule matched with this word
advance the input_index and last_index each by one;
Empty the stack;
}
end loop;
Print the information about each word from Stage2 object that is not reduced.
Gandaki College of Engineering And Sciences, Pokhara, Nepal
Chunker is the system that finds the phrases present in the tagged sentence depending on some POS
(Parts of Speech) rules. This is the part of the Natural Language Processing which is useful to find the
syntactical structure of the sentence. This system was a part of Nepali Parser that was outsourced to
Gandaki College of Engineering and Sciences by Madan Puraskar Pustakalaya
(www.madanpuraskar.org), a non profit organization working on Nepalinux (www.nepalinux.org) and
other Nepali language related software.
The Chunker in this context takes some tagged input sentence and then performs chunking or finding
the relation between words or between phrases and represents them in parenthesized notation. The
system operates on the basis of certain rules. The input sentence should be given in Unicode value of
the word following a slash and then the tag of the word in capital English letters. The need of Unicode
was felt because it brings uniformity in representing the words although using Unicode is not a
compulsion as the tags are important rather than the words themselves. Two consecutive words can
have only a single spacing between them. It then finds the phrases from the input words according to
those rules. In simple, the generated phrases are called as chunks.
The Chunker being developed is domain specific and chunks simple sentences like those we can find in
simple school books for primary level. Complex sentences are not considered as the context of this
project. To have a broad domain coverage well analyzed and complete set of rules need to be provided.
Such a system is useful for many other systems like Machine Translator, Grammar Checking,
information retrieval etc.
Chunker is a natural language processing technique that attempts to provide some machine
understanding of the structure of a sentence, but without parsing it fully into a parsed tree form.
Chunking is also called partial parsing or shallow parsing. The output of a chunker is a division of the
text’s sentence into series of words that together consist a grammatical unit. It is mostly either noun,
verb, or preposition phrase, with lessfrequent occurrences of adverb, adjective, clause, and other phrases. The output is different from that of a fully parsed tree because it consists of series of words
that do not overlap and that do not contain each other. This makes chunking an easier Natural Language
Processing task than full parsing.
One way of providing artificial intelligence to the system is by providing each and every rule to the
system. The system is not allowed to look anywhere beside the provided rules. This method is limited
to the rules provided and manual rule addition should be done. However, this method of parsing is
easier and result can be seen quickly about the system's performance.
The list of PSrules is as follows (provided by Madan Puraskar Pustakalaya):
1. NP -> noun
2. NP -> det + noun
3. NP -> adj + noun
4. NP -> det + adj + noun
5. NP -> det + int + adj + noun
6. AP -> adjective
7. AP -> int + adjective
8. AdvP -> adverb
9. AdvP -> int + adverb
10. PoP -> postposition
11. Pop -> NP + postposition
12. VP -> verb
13. VP -> adverb + verb
14. VP -> NP + verb
15. VP -> NP + AdvP + verb
16. VP -> AdvP + NP + verb
17. VP -> NP + NP + verb
18. VP -> NP + PoP + verb
19. S -> NP + VPLR(0) ALOGRITHM
LR parer algorithm is the standard bottom up parsing algorithm with the time complexity of O(n3). The
action-goto table is the most important part of the algorithm looking at which the shift or reduce
function is done. However, the content of actiongoto table differs if there is any change in the number
of rules or change in the rules itself. Since, there were limited rules (around 17), there is greater chance
of rules to be increased. So, actiongoto table constructed above with the rules provided was not fruit
full for implementation. Hence, it was decided to go with the Finite State Automata. The rules are kept
in one file. Doing this if any line numbered left hand side of the rule matches while processing, it will
be reduce by the word with the same line numbered right hand side phrase name of the rule. The
advantage of doing this is that, we can add any number of rule as needed in the future without
bothering about the reconstruction of actiongoto table. However, the basic concept remains the same.
Only instead of looking into the actiongoto table, we will be looking at the Finite State Automata. The
general outline is specified below:
Stage 1
a) Look for patterns in the order:
int+adj , adj, int+adv, adv /* these tags *not present at end of
comprehensive *rules.
*/
/* After dealing with these patterns */
b) if they are present then we replace them with their
corresponding phrases.
// These phrases do not include other phrases but are
themselves
// included in other phrases
c) if not present then go to stage 2.The standard LR algorithm doesn't talk about stage1. While tracing out the algorithm along with the rules, it was found that this algorithm was unable to find out smaller chunks or phrases present in the sentence for our domain. It was due to the structure of the rules and variant in the language domain. It was analyzed that some smaller phrase rules didn't fall in the bigger phrase rules directly rather a reduced phrase rule was used. Hence, for finding out adjective and adverb phrase with the combination
int+adj, adj, int+adv, adv this stage has been introduced.
Stage 2
Read the input sentence from the file;
do
(
Tokenize the word from the sentence;
Split each word to get the word value and the provided tag;
)while(there are words);
//Get the Local Tags
For each word,find the local tag,that is used to compare with the
rules;
Initialize the Stage1 object from the words information;
//now perform stage1 reduction
Look if the small phrases like AP,AdvP can be obtained from the
Stage1 objects;
if(found)
reduce those objects and store the new phrase information in it
else
leave it as usual
//initialize Stage2 objects
Find the Stage1 objects that are not reduced and create that many
Stage2 objects;
For each Stage2 object Initialize each Stage2 object to the non
reduced Stage1 object;
//Stage2 reduction
loop start:
For each Stage2 object, take its name and see if some rule
matches;
if(found){
push the object number in stack;
//this is called as shift action
if(tags_in_this_rule==stack_size)
{
store the information of phrase just formed;
empty the stack;
set last_index,input_index and rule_index each to
zero;
//this is called as reduce action
}
else
{
//rule can have more tags, needs more comparison
advance the input_index and rule_index each by 1;
keep the context of this rule and the objects in stack;
repeat the loop from next Stage2 object
with this information;
}
}
else
{
//no rule matched with this word
advance the input_index and last_index each by one;
Empty the stack;
}
end loop;
Print the information about each word from Stage2 object that is not reduced.
Comments
Post a Comment