lexsense - semantic search

Knowledge Sharing

Natural Language Processing (NLP) is a subfield of computer science, artificial intelligence, information engineering, and human-computer interaction. This field focuses on how to program computers to process and analyse large amounts of natural language data. This article focuses on the current state of arts in the field of computational linguistics. It begins by briefly monitoring relevant trends in morphology, syntax, lexicology, semantics, stylistics, and pragmatics. Then, the chapter describes changes or special accents within formal Arabic and English syntax. After some evaluative remarks about the approach opted for, it continues with a linguistic description of literary Arabic for analysis purposes as well as an introduction to a formal description, pointing to some early results. The article hints at further perspectives for ongoing research and possible spinoffs such as a formalized description of Arabic syntax in formalized dependency rules as well as a subset thereof for information retrieval purposes.

Sentences with similar words can have completely different meanings or nuances depending on the way the words are placed and structured. This step is fundamental in text analytics, as we cannot afford to misinterpret the deeper meaning of a sentence if we want to gather truthful insights. A parser is able to determine, for example, the subject, the action, and the object in a sentence; for example, in the sentence “The company filed a lawsuit,” it should recognize that “the company” is the subject, “filed” is the verb, and “a lawsuit” is the object. What is Text Analysis? Widely used by knowledge-driven organizations, text Analysis is the process of converting large volumes of unstructured texts into meaningful content in order to extract useful information from it. The process can be thought of as slicing heaps of unstructured documents then interpret those text pieces to identify facts and relationships. The purpose of Text Analysis is to measure customer opinions, product reviews and feedback and provide search facility, sentimental analysis to support fact-based decision making. Text analysis involves the use of linguistic, statistical and machine learning techniques to extract information, evaluate and interpret the output then structure it into databases, data warehouses for the purpose of deriving patterns and topics of interest. Text analysis also involves syntactic analysis, lexical analysis, categorisation and clustering, tagging/annotation. It determines keywords, topics, categories and entities from millions of documents.

Why is Text Analytics important for? There are a range of ways that text analytics can help businesses, organizations, and event social movements.

Companies use Text Analysis to set the stage for a data-driven approach towards managing content, understanding customer trends, product performance, and service quality. This results in quick decision making, increases productivity and cost savings. In the fields of cultural studies and media studies, textual analysis is a key component of research, text analysis helps researchers explore a great deal of literature in a short time, extract what is relevant to their study.

Text Analysis assists in understanding general trends and opinions in society, enabling governments and political bodies in decision making. Text analytic techniques help search engines and information retrieval systems to improve their performance, thereby providing fast user experiences.

The moment textual sources are sliced into easy-to-automate data pieces, a whole new set of opportunities opens for processes like decision making, product development, marketing optimization, business intelligence and more. It turns out there are three major gains that businesses of all nature can reap through reap analytics. They are:

1- Understanding the tone of textual content. 2- Translating multilingual customer feedback.

Steps Involved with Text Analytics Text analysis is similar in nature to data mining, but with a focus on text rather than data. However, one of the first steps in the text analysis process is to organize and structure text documents so they can be subjected to both qualitative and quantitative analysis. There are different ways involved in preparing text documents for analysis. They are discussed in detail below.

Sentence Breaking Sentence boundary disambiguation (SBD), also known as sentence breaking attempts to identify sentence boundaries within textual contents and presents the information for further processing. Sentence Breaking is very important and the base of many other NLP functions and tasks (e.g. machine translation, parallel corpora, named entity extraction, part-of-speech tagging, etc.). As segmentation is often the first step needed to perform these NLP tasks, poor accuracy in segmentation can lead to poor end results. Sentence breaking uses a set of regular expression rules to decide where to break a text into sentences. However, the problem of deciding where a sentence begins and where it ends is still some issue in natural language processing for sentence boundary identification can be challenging due to the potential ambiguity of punctuation marks[iii]. In written English, a period may indicate the end of a sentence, or may denote an abbreviation, a decimal point, or an email address, among other possibilities. Question marks and exclamation marks can be similarly ambiguous due to use in emoticons, computer code, and slang. Syntactic parsing Parts of speech are linguistic categories (or word classes) assigned to words that signify their syntactic role. Basic categories include verbs, nouns and adjectives but these can be expanded to include additional morphosyntactic information. The assignment of such categories to words in a text adds a level of linguistic abstraction. Part of speech tagging assigns part of speech labels to tokens, such as whether they are verbs or nouns. Every token in a sentence is applied to a tag. For instance, in the sentence Marie was born in Paris. The word Marie is assigned the tag NNP. Part-of-speech is one of the most common annotations because of its use in many downstream NLP tasks. For instance, British Component of the International Corpus of English (ICE-GB) of 1 million words is POS tagged and syntactically parsed.

Chunking In cognitive psychology, chunking is a process by which individual pieces of an information set are broken down and then grouped together in a meaningful whole. So, Chunking is a process of extracting phrases from unstructured text, which means analysing a sentence to identify its own constituents (Noun Groups, Verbs, verb groups, etc.). However, it does not specify their internal structure, nor their role in the main sentence. Chunking works on top of POS tagging and uses POS-tags as input to provide chunks as an output. there is a standard set of Chunk tags like Noun Phrase (NP), Verb Phrase (VP), etc. Chunking segments and labels multi-token sequences as illustrated in the example: “we saw the yellow dog”) or in Arabic (“رأينا الكلب الأصفر”). The smaller boxes show the word-level tokenization and part-of-speech tagging, while the large boxes show higher-level chunking. Each of these larger boxes is called a chunk. We will consider Noun Phrase Chunking and we search for chunks corresponding to an individual noun phrase. In order to create NP chunk, we define the chunk grammar using POS tags. The rule states that whenever the chunk finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN) then the Noun Phrase (NP) chunk should be formed.

Stemming & Lemmatization In natural language processing, there may come a time when you want your programme to recognize that the words “ask” and “asked” are just different tenses of the same verb. This is where stemming or lemmatization comes in, But what’s the difference between the two? And what do they actually do?

Stemming is the process of eliminating affixes, suffixes, prefixes and infixes from a word in order to obtain a word stem. In other words, it is the act of reducing inflected words to their word stem. For instance, run, runs, ran and running are forms of the same set of words that are related through inflection, with run as the lemma. A word stem need not be the same root as a dictionary-based morphological root, it just is an equal to or smaller form of the word. Stemming algorithms are typically rule-based. You can view them as heuristic process that sort-of lops off the ends of words. A word is looked at and run through a series of conditionals that determine how to cut it down.

How is lemmatization different? Well, if we think of stemming as of where to snip a word based on how it looks, lemmatization is a more calculated process. It involves resolving words to their dictionary form. In fact, lemmatization is much more advanced than stemming because rather than just following rules, this process also takes into account context and part of speech to determine the lemma, or the root form of the word. Unlike stemming, lemmatization depends on correctly identifying the intended part of speech and meaning of a word in a sentence. In lemmatization, we use different normalization rules depending on a word’s lexical category (part of speech). Often lemmatizers use a rich lexical database like WordNet as a way to look up word meanings for a given part-of-speech use (Miller 1995) Miller, George A. 1995. “WordNet: A Lexical Database for English.” Commun. ACM 38 (11): 39–41. Let’s take a simple coding example. No doubt, lemmatization is better than stemming. Lemmatization requires a solid understanding of linguistics; hence it is computationally intensive. If speed is one thing you require, you should consider stemming. If you are trying to build a sentiment analysis or an email classifier, the base word is sufficient to build your model. In this case, as well, go for stemming. If, however, your model would actively interact with humans – say you are building a chatbot, language translation algorithm, etc, lemmatization would be a better option.

Lexical Chaining Lexical chaining is a sequence of adjacent words that captures a portion of the cohesive structure of the text. A chain can provide a context for the resolution of an ambiguous term and enable identification of the concept that the term represents. M.A.K Halliday & Ruqaiya Hasan note that lexical cohesion is phoric cohesion that is established through the structure of the lexis, or vocabulary, and hence (like substitution) at the lexicogrammatical level. The definition used for lexical cohesion states that coherence is a result of cohesion, not the other way around.[2][3] Cohesion is related to a set of words that belong together because of abstract or concrete relation. Coherence, on the other hand, is concerned with the actual meaning in the whole text.[1]

Rome → capital → city → inhabitant Wikipedia → resource → web

Morris and Hirst [1] introduce the term lexical chain as an expansion of lexical cohesion.[2] A text in which many of its sentences are semantically connected often produces a certain degree of continuity in its ideas. Cohesion glues text together and makes the difference between an unrelated set of sentences and a set of sentences forming a unified whole. HALLIDAY & HASAN 1994:3 Sentences are not born fully formed. They are the product of a complex process that requires first forming a conceptual representation that can be given linguistic form, then retrieving the right words related to that pre-linguistic message and putting them in the right configuration, and finally converting that bundle into a series of muscle movements that will result in the outward expression of the initial communicative intention (Levelt, 1989) Levelt, W. J. M. (1989). Speaking: From Intention to Articulation. Cambridge, MA: MIT Press. Concepts are associated in the mind of the user of language with particular groups of words. So, texts belonging to a particular area of meaning draw on a range of words specifically related to that area of meaning.

The use of lexical chains in natural language processing tasks has been widely studied in the literature. Morris and Hirst [1] is the first to bring the concept of lexical cohesion to computer systems via lexical chains. Barzilay et al [5] use lexical chains to produce summaries from texts. They propose a technique based on four steps: segmentation of original text, construction of lexical chains, identification of reliable chains, and extraction of significant sentences. Some authors use WordNet [7][8] to improve the search and evaluation of lexical chains. Budanitsky and Kirst [9][10] compare several measurements of semantic distance and relatedness using lexical chains in conjunction with WordNet. Their study concludes that the similarity measure of Jiang and Conrath[11] presents the best overall result. Moldovan and Adrian [12] study the use of lexical chains for finding topically related words for question answering systems. This is done considering the glosses for each synset in WordNet. According to their findings, topical relations via lexical chains improve the performance of question answering systems when combined with WordNet. McCarthy et al. [13] present a methodology to categorize and find the most predominant synsets in unlabeled texts using WordNet. Different from traditional approaches (e.g., BOW), they consider relationships between terms not occurring explicitly. Ercan and Cicekli [14] explore the effects of lexical chains in the keyword extraction task through a supervised machine learning perspective. In Wei et al. [15] combine lexical chains and WordNet to extract a set of semantically related words from texts and use them for clustering. Their approach uses an ontological hierarchical structure to provide a more accurate assessment of similarity between terms during the word sense disambiguation task.

Lexical cohesion is generally understood as “the cohesive effect [that is] achieved by the selection of vocabulary” (HALLIDAY & HASAN 1994:274). In general terms, cohesion can always be found between words that tend to occur in the same lexical environment and are in some way associated with each other., “any two lexical items having similar patterns of collocation – that is, tending to appear in similar contexts – will generate a cohesive force if they occur in adjacent sentences.

Conclusion text Analysis uses NLP and various advanced technologies to help get structured data. Text mining is now widely used by various companies who use text mining to have growth and to understand their audience better. There are various examples in the real-world where text mining can be used to retrieve the data. Various social media platforms and search engines, including Google, use text mining techniques to help users find their searches. This helps with getting to know what the users are searching for. Hope this article helps you understand various text mining algorithms, meaning, and also techniques.

[i] https://chattermill.com/blog/text-analytics/

[ii]

https://help.relativity.com/9.2/Content/Relativity/Analytics/Language_identification.htm [iii] https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation https://www.nltk.org/book/ch07.html https://en.wikipedia.org/wiki/List_of_emoticons

https://www.machinelearningplus.com/nlp/lemmatization-examples-python/ https://w3c.github.io/alreq/#h_fonts

M.A.K Halliday & Ruqaiya Hasan, R.: Cohesion in English. Longman (1976)