Bias Identifier in NLP Datasets (For LLMs), Version 1.0.1, Latest Release
by DarkFLAME in Circuits > Software
54 Views, 1 Favorites, 0 Comments
Bias Identifier in NLP Datasets (For LLMs), Version 1.0.1, Latest Release
In our advancing world, AI is being implemented everywhere. For AI ethics and human values, we need to minimize bias in datasets and models. My project is a Deep learning model which identifies bias in datasets and corrects or removes the statements according to the level of bias or negative sentiment. AI bias is typically seen when data is scraped from the web and not filtered. My project fills the gap of removing bias in datasets, which can be later utilised for training NLP Models.
Supplies
import nltk
import spacy
import pandas as pd
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from transformers import pipeline, GPT2LMHeadModel, GPT2Tokenizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
import json
These are the libraries you will be required.
Hardware:
I have used
i5-13600KF
16 GB LPDDR5x
Though, I have not used CUDA Cores
I Have used Jupyter Notebook from Anaconda
Dataset Collection
Downloads
Programming
The artificial world now encompasses the fast-changing world where it has infiltrated every sector and dramatically changed our everyday lives. However, with all its great applications, it also poses some of its own challenges, especially in terms of AI ethics and human values. Mainly, the challenge arises from the bias in the datasets and models, thus resulting in unfair and discriminatory results. Thus, my focus for the project will be on the issue.
My project is an attempt to develop a deep learning model that detects bias in datasets, corrects or removes the biased statements, depending upon the level of bias and negative sentiment. The key to this project relies on using BERT, which is one of the state-of-the-art models for natural language understanding, where it performs sentiment analysis. Specifically, the model assesses the sentences in terms of the bias and negativity toward the specific topic and determines to edit or discard the sentences that have extreme negative attitudes to ensure that the final dataset is neutral and not biased.
Preprocessing of the Text Data
The beginning of the process involves preprocessing of the text. This involves tokenizing text for words, removal of stop words, and lemmatization where the text is reduced into its base or root forms. A pre-trained sentiment analysis model carries out the sentiment analysis by classifying sentences as positive, negative, or neutral. Any sentence whose negative sentiment score exceeds a predefined threshold is flagged for further action.
It makes use of a machine-learning classifier trained on a set of labeled sentences that it believes are biased and which it does not consider to be biased. The SVM algorithm was used in the classifiers. N-gram feature vectors were used to process the text data. Once the model has identified a sentence that is biased, then it uses a set of predefined rules to either make the sentence less biased by modifying it or remove a sentence that is highly biased in either direction.
Here's the code for the software:
DONT FORGET TO INSTALL PYTORCH
you can use CUDA cores instead, I've used my CPU
Learn More
I will keep updating the software,
by the time you can learn more on NLP
Here are some resources:
https://drive.google.com/drive/folders/1lBY3wXj4REH_m46lGBODVCqmJaSgpH0z?usp=sharing
https://drive.google.com/drive/folders/1G6dY1Yh0LozHoBz06pKzZ8xokuC0USuH?usp=sharing
https://drive.google.com/drive/folders/1HGMva0OmjSz1cWAG4XeR3iHmmlOeienY?usp=sharing
https://drive.google.com/drive/folders/19-1RFIwXjmmMVvfgWt0Uuj2al0S_YQgP?usp=sharing
I'll be back with some resources, till the time, keep learning ;)
I would suggest 3blue1brown's neural network videos for more in-depth learning.
Demonstration
Thanks guys, for viewing my project.
Please suggest some improvements for the next version, I will be very greatful
Bias Identifier in NLP Datasets (For LLMs), Latest Release, V1.0.1
With Tkinter GUI, The program has improved UI for easy access and convenience.
Thanks.