Automata: Enhancing Natural Language Processing

Ojas Ketkar
8 min readApr 25, 2023

--

We all know the influence that Artificial Intelligence has had on technology. We are able to communicate with machines using our language, no matter which country we are from. No matter which dialect we have, the machines always tend to give results which are pretty much accurate. Do you ever wonder how innovations like Amazon’s Alexa or Google’s Dot, Windows’ Cortana or Apple’s Siri or even the latest innovation, which has garnered around a million users in just 3 months, the famous ChatGPT understands our human language?

Natural Language Processing, as it turns out, is the fundamental application that drives all these fascinating and innovative developments. The computer takes in the data, whether it be in text, video, or audio format, converts it using some rules into its own language, and then processes the data to produce the intended result.

So, what exactly is Natural-Language-Processing?

The goal of the “artificial intelligence” subfield of computer science known as “natural language processing” (NLP) is to give computers the ability to perceive spoken and written language similarly to how humans do.

To automatically produce the appropriate language, NLP integrates deep learning models, computational statistics and lexical analysis, and the grammar rules of a particular language. These systems can comprehend regular human language in written or audio format, as well as the intention or sentiment of the speaker or writer.

What is Automata Theory?

Automata theory is the area of computer science that focuses on helping researchers understand how basic machines really calculate a given function or a problem. The primary goal is to create tools that computer scientists may use to define and examine the dynamic behavior of discrete systems.

The characteristics of such machines broadly includes :-
1) Input
2) Output
3) States

There are 4 major families of automation :-
1)Finite State-Automaton
2)Pushdown-Automaton
3)Linear-Bounded-Automaton
4)Turing Machines

How does NLP exactly work?

Computers can interpret language or audio information picked up from the environment via microphones thanks to natural language processing. The computer then interprets the input in a manner that is quite similar to that of the brain.

There are majorly two phases to NLP: Data preprocessing and Algorithm development

1) Data Preprocessing : Data preprocessing is the term used to describe the methodical process of ‘cleaning’ up the input data or changing it so the computer can process it. Because of the enormous number of languages and dialects that people speak, data input is now done in either text or audio format. The more communication styles that individuals use, the more difficult it is for a machine to retrieve data in a certain format. Therefore, it requires some preprocessing to convert it into a usable format and identify qualities that the algorithm might improve. There are various methods for doing this:

a. Tokenization — Breaking down the data into smaller units

b. Stop word removal — The most common words are removed from the .txt so that unique words that offer the most information are highlighted

c. Lemmatization or stemming — The words are converted to their root forms to process

d. Part-of-speech tagging — Words are marked depending on which part-of-speech they belong

2) Algorithm Development:

Once the data has been preprocessed, we need to develop the best algorithm possible for the computer to work on the preprocessed data. There are majorly two ways to do so:

a. Rule Based System — This system depends on carefully designed linguistic rules

b. Machine-Learning Based System — To process data, various machine learning-based models are employed. The ideal algorithm is created by combining deep learning, machine learning, and neural networks, each of which has its own set of rules.

So how does NLP relate with Automata Theory?

There are many ways in which NLP is used along with automata theory:

1. Lexicon

A lexicon is a person’s, a group’s, or a language’s vocabulary that includes all of the essential grammatical building blocks. We may say that it serves as a kind of representation of the speaker’s vocabulary.

The size of the whole lexicon can be reduced by employing Automata Theory and NLP to represent lexicons, which increases processing system performance and processing speed.. Ex.

Example of Lexicon

It has proved to thereby rescue the memory usage by the lexicon.

2. Part-of-speech-Tagging

The process of Part-of-speech-tagging involves linking words to grammatical categories. The most popular categories are nouns, verbs, adjectives, and adverbs, among others.

Although many systems rely on simply its results to proceed, it is still relatively early in the processing process. As a result, the outcomes are still rather ambiguous.

Ex. Let’s take a string ‘I saw her cat’. The results that come out from this system include :
‘I have seen her animal which is a cat’
‘I have seen her when she was catting’
Both of which don’t make perfect sense.

3.Context-Free Grammars

Grammars that are context-free are simply regular grammars with less limitations. But these grammars are more like push-down automata than finite-state automata. Every time we reach a rule on a transition, we can evaluate it in the automata corresponding to the rule rather than using just one automaton. We return to the prior state and continue treatment there once we reach an accepted state in the sub-automaton.

Just to give you a better idea of what all of these mean, have a look at the two examples given below :-

Context Free Grammar and its equivalent automaton

Automata Theory and Natural Language Processing are frequently seen as complementary fields of study. Algorithms for several NLP applications were created using the ideas of automata.

1. Rule-Based POS Tagging

Part of speech (POS) One of the earliest methods of tagging is tagging. To find potential tags for each word, they consult dictionaries or lexicons. By examining a word’s linguistic characteristics and those of its related words, rule-based tagging also allows for disambiguation. For instance, a word must be a noun if its previous word is an article. As the name implies, all of this type of information is coded in the form of rules in rule-based POS tagging. These guidelines could either −

1. Rules based on context pattern

2. Or, the representation of lexically ambiguous sentences intersected with Regular expressions that were compiled into finite-state automata.

import nltk
sentence = "We wrote this script to use in a blog"

tokens = nltk.word_tokenize(sentence)


pos_tags = nltk.pos_tag(tokens)

print(pos_tags)

2) FSA (Finite-State Automata) and NLP

When the input data is sequential and can be processed one symbol at a time, like text, FSA models are very useful. One common use of FSA in NLP is tokenization, which is the process of breaking down a text into separate tokens or words.

The 5 tuple machine of Finite automaton
import nltk
from nltk.tokenize import word_tokenize

sentence = "Lorem ipsum dolor sit amet, consectetur adipiscing elit"

tokens = word_tokenize(sentence)

print(tokens)

There are multiple applications of NLP in various different domains, like
Translation :

Example of translation

The translation process involves dissecting each word in the source text to determine its meaning. The next stage is to create a target language sentence that accurately captures the meaning of the source language sentence using a statistical or rule-based model.

Rule-based, statistical, and neural machine translation are just a few of the translation strategies utilized in NLP. Rule-based models produce translations by using dictionary searches and predetermined grammatical rules. Large datasets of previously translated texts are used by statistical algorithms to create translations. On the other hand, neural machine translation makes use of artificial neural networks to train itself to translate texts.

By utilizing methods like named entity identification and sentiment analysis, NLP systems can additionally take into consideration the context and meaning of words to increase the accuracy of translations.

An another use case of the combination of NLP and Automata would be malicious URL detection —

Malicious URL Detection

Steps involved in identifying a malicious URL are as follows:

  1. Data collection: Collect a dataset of URLs, with labels indicating whether each URL is malicious or benign.
  2. Feature extraction: Extract relevant features from the URLs, such as the length of the URL, the presence of certain keywords, and the domain name.
  3. Data preprocessing: Preprocess the extracted features to prepare them for NLP tasks, such as tokenization, normalization, and vectorization.
    Model training: Train a machine learning model on the preprocessed features and labels using NLP techniques, such as text classification or sequence labeling.
  4. Model evaluation: Evaluate the performance of the trained model on a separate test dataset, using metrics such as accuracy, precision, recall, and F1 score.
  5. Model deployment: Deploy the trained model in a production environment, where it can be used to detect malicious URLs in real-time.
  6. Model monitoring: Continuously monitor the performance of the deployed model and update it as needed to maintain its accuracy and effectiveness.

Automata Theory and NLP seem to complement each other. There are a ton more instances in our daily lives where the two of them can be applied together. Automata theory is the most popular method for building Natural Language Processing systems even though there are other approaches since it is more dependable and provides a clearer grasp of how the system is put together.

Team Members:
Ojas Ketkar
Pranav Joshi
Akash
Atharva Jayappa

Guide:
Prof. Anant Kaulage

--

--