Parts of Speech Tagging in NLTK

Knowledge Sharing

In corpus linguistics, POS Tagging (Parts of Speech Tagging) also called grammatical tagging is a process of marking up words in text format for a particular part of a speech based on its definition and context. It is responsible for text reading in a language and assigning some specific token (Parts of Speech) to each word. POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill’s tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms. Parts of speech tagging can be important for syntactic and semantic analysis.

Rule-based POS Tagging

One of the oldest techniques of tagging is rule-based POS tagging. Rule-based taggers use a dictionary or lexicon for getting possible tags for each word. If the word has more than one possible tag, then rule-based taggers use hand-written rules to identify the correct tag. Disambiguation can also be performed in rule-based tagging by analyzing the linguistic features of a word along with its preceding as well as following words. For example, suppose if the preceding word of a word is an article then word must be a noun.

Stochastic POS Tagging

Another technique of tagging is Stochastic POS Tagging. Now, the question that arises here is which model can be stochastic. The model that includes frequency or probability (statistics) can be called stochastic. Any number of different approaches to the problem of part-of-speech tagging can be referred to as stochastic tagger. The simplest stochastic tagger applies the following approaches for POS tagging:

Word Frequency Approach

In this approach, the stochastic taggers disambiguate the words based on the probability that a word occurs with a particular tag. We can also say that the tag encountered most frequently with the word in the training set is the one assigned to an ambiguous instance of that word.

Tag Sequence Probabilities

It is another approach of stochastic tagging, where the tagger calculates the probability of a given sequence of tags occurring. It is also called n-gram approach. It is called so because the best tag for a given word is determined by the probability at which it occurs with the n previous tags.

Part-of-speech tagging is harder than just having a list of words and their parts of speech, because some words can represent more than one part of speech at different times, and because some parts of speech are complex or unspoken, a large percentage of word-forms are ambiguous. For example, even “dogs”, which is usually thought of as just a plural noun, can also be a verb:

The sailor dogs the hatch

Correct grammatical tagging will reflect that “dogs” is here used as a verb, not as the more common plural noun. Grammatical context is one way to determine this; semantic analysis can also be used to infer that “sailor” and “hatch” implicate “dogs” as 1) in the nautical context and 2) an action applied to the object “hatch” (in this context, “dogs” is a nautical term meaning “fastens (a watertight door securely.

So, for something like the sentence above the word can has several semantic meanings. One being a model for question formation, another being a container for holding food or liquid, and yet another being a verb denoting the ability to do something.

Let’s learn with a NLTK Part of Speech example:

POS tag list:

CC     coordinating conjunction

CD    cardinal digit

DT    determiner

EX     existential there (like: “there is”

FW    foreign word

IN     preposition/subordinating conjunction

JJ     adjective     ‘big’

JJR   adjective, comparative    ‘bigger’

JJS   adjective, superlative      ‘biggest’

LS     list marker  1)

MD    modal could, will

NN    noun, singular ‘desk’

NNS  noun plural  ‘desks’

NNP  proper noun, singular     ‘Harrison’

NNPS proper noun, plural        ‘Americans’

PDT   predeterminer      ‘all the kids’

POS   possessive ending   parent\’s

PRP   personal pronoun  I, he, she

PRP$ possessive pronoun my, his, hers

RB    adverb       very, silently,

RBR   adverb, comparative      better

RBS   adverb, superlative best

RP    particle      give up

TO    to      go ‘to’ the store.

UH    interjection  errrrrrrrm

VB    verb, base form     take

VBD  verb, past tense    took

VBG  verb, gerund/present participle taking

VBN  verb, past participle       taken

VBP   verb, sing. present, non3d      take

VBZ  verb, 3rd person sing. present  takes

WDT  whdeterminer     which

WP    whpronoun who, what

WP$  possessive whpronoun    whose

WRB  whabverb  where, when

 

Why Part-of-Speech tagging?

Part-of-Speech tagging in itself may not be the solution to any particular NLP problem. It is however something that is done as a prerequisite to simplify a lot of different problems. Let us consider a few applications of POS tagging in various NLP tasks.

As you can see on line 5 of the code above, the .pos_tag() function needs to be passed a tokenized sentence for tagging. The tagging is done by way of a trained model in the NLTK library. The included POS tagger is not perfect but it does yield pretty accurate results. Using the same sentence as above the output is:

[(‘Can’, ‘MD’), (‘you’, ‘PRP’), (‘please’, ‘VB’), (‘buy’, ‘VB’), (‘me’, ‘PRP’), (‘an’, ‘DT’), (‘Arizona’, ‘NNP’), (‘Ice’, ‘NNP’), (‘Tea’, ‘NNP’), (‘?’, ‘.’), (‘It’, ‘PRP’), (“‘s”, ‘VBZ’), (‘$’, ‘$’), (‘0.99’, ‘CD’), (‘.’, ‘.’)]

Parts of speech tagging can be important for syntactic and semantic analysis. So, for something like the sentence above the word can has several semantic meanings. One being a model for question formation, another being a container for holding food or liquid, and yet another being a verb denoting the ability to do something. Giving a word such as this a specific meaning allows for the program to handle it in the correct manner in both semantic and syntactic analyses.

chakir.mahjoubi https://lexsense.net

Knowledge engineer with expertise in natural language processing, Chakir's work experience spans, language corpus creation, software localisation, data lineage, patent translation, glossary creation and statistical analysis of experimentally obtained results.

You May Also Like

More From Author