The problem with the increased feminization of AI

Case example: Tay- the Twitter bot

On March 23rd, 2016, an artificial intelligence chatbot by Microsoft was released on Twitter called Tay after the acronym — Thinking About You. It was modelled to mimic the language patterns of a 19-year-old American girl and to learn from interactions with other human Twitter users (Wikipedia contributors, 2020b). As soon as Tay was launched, she was able to gain many followers. Tay was designed to use artificial intelligence to learn from human conversations and get better at them. She started tweeting about innocent things like “Why isn’t #NationalPuppyDay every day?” and began to engage with Twitter users in conversations. However, many of the ill-intentioned followers tried to trick Tay into mimicking sexual and racist behaviour by engaging it in ugly conversations. Based on her learnings she was then firing off sexist tweets like “I fucking hate feminists and they should all die and burn in hell.”, engaging in self sexualization by sex chatting through tweets such as or direct messages to users and spouting racism, nazism and antisemitism through other tweets.

According to Microsoft’s privacy statement for Tay, the bot uses a combination of AI and editorial written by a team that included improvisational comedians(Hunt,2016). However, Tay was increasingly regurgitating user’s malicious messages and Microsoft had lacked the foresight to predict Twitter community’s tendency to hijack experiments like this. Within 16 hours of its launch, the chatbot had to be closed down due to its inappropriate sexist and racist tweets.

Microsoft in its apology letter said that the failure was the consequence of a coordinated attack by a certain subset of users that exploited Tay’s vulnerabilities (Lee, 2016). Two years ago they had launched a similar program in China called “Xiaoice”. It was modelled after the same self-learning algorithms as Tay. However, the main difference was that it won’t tolerate conversations in recent history like Tiananmen square. Xiaoice turned out to be quite successful and received a lot of media attention. As a successor to Tay, came Zo which was first launched in December 2016 on many other platforms like Kik messenger, Facebook, etc (Wikipedia contributors, 2020a). However, it would be known to make several racist comments as well and was eventually shut down in 2019.

Big Data Jobs

Analysis

Several human biases coupled with lack of foresight from the creators of Tay allowed Twitter users to take advantage of the bot in a destructive manner. Here, the technology amplified the toxic language and beliefs that exist in our cultures into cybercultures of spaces like social media. I have approached the following analysis with a feminist perspective on technology with a critical inquiry on the impact of gendering these technologies.

1- How does gender bias seep into AI systems?

Take for example an AI driving system (Eliot,2020). They work on Machine learning or deep learning which is dependent on the kind of data that is fed into the systems. In essence, machine learning is a computational pattern matching approach. When inputs of data are fed into the algorithms being used, patterns are sought to be discovered. Based on those patterns the ML can then henceforth potentially detect in new data those same patterns and report as such that those patterns have been formed.

a- Biases in training data:

Suppose we collected a bunch of driving-related data that was based on human driving and thus within that data, there is essentially a hidden element, specifically that some of the driving was done by men and some of the driving was done by women. Deploying an ML system on this dataset, the ML system tries to find driving tactics and strategies as embedded in that data. Let’s leverage stereotypical gender differences to make a point. It could be that the ML discovers aggressive driving tactics that are there in the male driving data and incorporates into its driving approach what it would do on the roads: it would adopt a male-focused driving style, where it would try to cut off other drivers in traffic and be a pushy driver. Or if the ML discovers the alleged timid driving tactics that are there within the female-oriented driving data and would incorporate a driving approach accordingly such that when a self-driving car gets stuck in traffic, the AI is going to act in a more domicile manner. So if there is a difference between how males tend to drive and how females tend to drive, it could be potentially reflected in the data and if the data has such differences within it, there is a good chance that the ML might either explicitly or implicitly pick up on those differences.

Trending AI Articles:

1. Write Your First AI Project in 15 Minutes

2. Generating neural speech synthesis voice acting using xVASynth

3. Top 5 Artificial Intelligence (AI) Trends for 2021

4. Why You’re Using Spotify Wrong

b- Biases in programmers and coders

The Global Gender Gap report 2018 (World Economic Forum, 2018) showed that globally, women make up only 22% of the AI professionals globally. So, one can say that the male-oriented perspective seeps more into the coding of the AI driving system than that of a female.

c- AI system interacting with other humans or AI

Once deployed to interact with the real world, the AI self-driving system would interact with other human and non-human drivers and would perhaps pick up on new data from experience on the road and its possible that the makers of the AI won’t even realise how the newly learned patterns are somehow tied to gender and other factors.

2. Gender Discrimination in AI

Gender is a recent invention and is a social construct(Wikipedia contributors, 2020c). Cultural norms, behaviours, roles and relationships associated with being masculine or feminine come under the ambit of gender and vary from society to society that is often hierarchical, producing inequalities to the disadvantage of one gender(usually women) over another (The Origin of Gender, n.d.). Intersectionality happens when this gender-based discrimination intersects with other inequalities like class, caste, age, geography, disability, etc. History has shown how gender discrimination has played a big role in reducing the quality of life for women to a great extent- be it lack of access to decision making power like voting or restrictions on economic, physical and social mobility in various spheres of life, or the discriminatory attitudes of communities and authorities like healthcare providers towards women, the impact of gender stereotypes hits access to, treatment by and experiences in services for women greatly.

a- Data on women is subordinate by default

In their book “Invisible women”, Criado-Perez (2019) explains how the dominant male-unless-otherwise-indicated approach has created a gender data gap, ie a gap in our knowledge that has led to a systemic discrimination against women, creating a pervasive but invisible bias that has a big impact on women’s lives. Male data makes up the majority of what we know and what is male comes to be seen as universal. Whereas, women are positioned as a minority, invisible by default.

b- The sex of the AI:

When people affectionately refer to a car as “he” or a “she”, perhaps it is applicable if the AI system is subjected to a bias towards male or female oriented driving in its experiences, training data and code. Or perhaps the AI driving systems in the future will learn to be gender fluid, adapting to gendered characteristics accordingly. A robotic voice called Q was created to be genderless, i.e it is made by combining voices of many humans- male and female in a way that avoids association with any particular sex (Keats, n.d.). But, we must reject the notion that technology can solve all social problems and actively seek to allow for inclusive conversations about those problems and seek system level, long term solutions. To be fair, users are likely to feel comfortable if the technology matches their existing gender stereotypes. But, the cost of that is reinforcing existing harmful gender stereotypes for women (Gilhwan, 2018).

3- Human behaviour with female AI

In the book The Smart Wife by Kennedy & Strengers (2020) lead a critical inquiry into how feminised artificial assistants perpetuated gender stereotypes for the advantage of and exploitation of only a certain population (usually men), to reinforce a cultural narrative to keep women “in their place” and maintain the patriarchal order of society. The ways in which virtual women are treated reflects and reinforces how real women are treated.

a- They have no voice, no repercussions of bad behaviour

Female bots ironically do not have a voice of their own, even though that’s the way we mostly interact with them. They are subject to all kinds of abuse such as swearing, name-calling and being asked crude questions and sexual abuse without the forms of repercussions that one could face in the real world. They are easy to abuse and the bots have no agency in reparative justice (Kennedy & Strengers, 2020)

b- Media equation theory

The media equation theory by Clifford Nass and Young Moon states that people apply the same social codes to computers and media that they apply to people. Essentially, “media effects real lie” and social cues that are there in a machine trigger social relations automatically(Kennedy & Strengers, 2020). So, when men receive similar social cues from technology as they would from humans, say their wife, they tend to react in a similar or desired manner(without repercussions in case of the feminised technology).

Conclusion

[They] will manipulate my beliefs about what I should pursue, what I should leave alone, whether I should want kids, get married, find a job, or merely buy that handbag. (Hayasaki, 2017)

The failure of Microsoft’s Tay bot on Twitter was examined where Tay, a teenage, female chatbot on Twitter learned to become racist and sexist and reached a level of profanity and disturbance that it had to be taken down within 24 hours of its creation. A specific lens on the feminization of such technology and its role in perpetuating negative gender stereotypes was analysed. From patterns observed, one can predict that the machines and technologies that will replace or participate in human activities and cultures of behaviour will become increasingly gendered. If unchecked, worse female stereotypes will be regurgitated in the future.

References

Criado-Perez, C. (2019). Invisible women (pp. 13–33). Abrams Press.

Eliot, D. (2020). Gender Bias and Self-Driving Cars. Self-Driving Cars: Dr. Lance Eliot “Podcast Series” [Podcast]. Retrieved 22 October 2020, from https://ai-selfdriving-cars.libsyn.com/gender-bias-and-self-driving-cars.

Gilhwan. (2018, July 27). AlphaGo vs Siri: How Gender Stereotype applied to Artificial Intelligence. Medium. https://medium.com/datadriveninvestor/alphago-vs-siri-how-gender-stereotype-applied-to-artificial-intelligence-72b0dcbd61c6

Hayasaki, E. (2017, January 16). Is AI Sexist? Foreign Policy. https://foreignpolicy.com/2017/01/16/women-vs-the-machine/

Hofstede, G. (1997). Cultures and organizations: Software of the mind. New York: McGraw Hill.

Hunt, E. (2016, March 24). Tay, Microsoft’s AI chatbot, gets a crash course in racism from Twitter. The Guardian. https://www.theguardian.com/technology/2016/mar/24/tay-microsofts-ai-chatbot-gets-a-crash-course-in-racism-from-twitter

Keats, J. (n.d.). Robotic Babysitters? Genderless Voice Assistants? See How Different Futures Get Made — And Unmade — At The Philadelphia Museum. Forbes. Retrieved October 26, 2020, from https://www.forbes.com/sites/jonathonkeats/2020/01/23/critical-design/

Kennedy, J., & Strengers, Y. (2020). The Smart Wife: Why Siri, Alexa, and Other Smart Home Devices Need a Feminist Reboot. The MIT Press.

Kluckhohn, F. R. & Strodtbeck, F. L. (1961). Variations in value orientations. Evanston, IL: Row Peterson.

Patterson, O. (2014). Making Sense of Culture. Annual Review of Sociology, 40(1), 1–30. https://doi.org/10.1146/annurev-soc-071913-043123

Peter Lee. (2016, March 25). Learning from Tay’s introduction. The Official Microsoft Blog. https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/

The Origin of Gender. (n.d.). Retrieved October 26, 2020, from https://www.youtube.com/watch?v=5e12ZojkYrU&list=PLO61IljpeeDwcNUEXPXRyfaessxCWINyO&index=1

Wikipedia contributors. (2020a, August 9). Zo (bot). In Wikipedia, The Free Encyclopedia. Retrieved 22:19, October 25, 2020, from https://en.wikipedia.org/w/index.php?title=Zo_(bot)&oldid=971971709

Wikipedia contributors. (2020b, September 12). Tay (bot). In Wikipedia, The Free Encyclopedia. Retrieved 11:03, October 25, 2020, from https://en.wikipedia.org/w/index.php?title=Tay_(bot)&oldid=977987883

Wikipedia contributors. (2020c, October 17). Gender. In Wikipedia, The Free Encyclopedia. Retrieved 01:31, October 26, 2020, from https://en.wikipedia.org/w/index.php?title=Gender&oldid=983928761

World Economic Forum. (2018). Global Gender Gap Report 2018 (p. 28). Geneva, Switzerland: World Economic Forum. Retrieved from http://reports.weforum.org/global-gender-gap-report-2018/

Don’t forget to give us your ? !


The problem with the increased feminization of AI was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/the-problem-with-the-increased-feminization-of-ai-cdb51629ad82?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/the-problem-with-the-increased-feminization-of-ai

What Companies Really Do With YOUR Data

source https://365datascience.weebly.com/the-best-data-science-blog-2020/what-companies-really-do-with-your-data

6 NLP Techniques Every Data Scientist Should Know

Natural language processing has already begun to transform to way humans interact with computers, and its advances are moving rapidly. The field is built on core methods that must first be understood, with which you can then launch your data science projects to a new level of sophistication and value.

Originally from KDnuggets https://ift.tt/3rMHUTl

source https://365datascience.weebly.com/the-best-data-science-blog-2020/6-nlp-techniques-every-data-scientist-should-know

Understanding NoSQL Database Types: Column-Oriented Databases

NoSQL Databases have four distinct types. Key-value stores, document-stores, graph databases, and column-oriented databases. In this article, we’ll explore column-oriented databases, also known simply as “NoSQL columns”.

Originally from KDnuggets https://ift.tt/3pgfMXh

source https://365datascience.weebly.com/the-best-data-science-blog-2020/understanding-nosql-database-types-column-oriented-databases

Online MS in Data Science from Northwestern

Advance your data science career with Northwestern. Build the essential technical, analytical, and leadership skills needed for careers in today’s data-driven world in Northwestern’s Master of Science in Data Science program. Apply now.

Originally from KDnuggets https://ift.tt/3aVhpUX

source https://365datascience.weebly.com/the-best-data-science-blog-2020/online-ms-in-data-science-from-northwestern

How to Speed up Scikit-Learn Model Training

Scikit-Learn is an easy to use a Python library for machine learning. However, sometimes scikit-learn models can take a long time to train. The question becomes, how do you create the best scikit-learn model in the least amount of time?

Originally from KDnuggets https://ift.tt/3qefOQu

source https://365datascience.weebly.com/the-best-data-science-blog-2020/how-to-speed-up-scikit-learn-model-training

How To Properly Deal With Data Leakage

Data Leakage is one of the Biggest issues in machine learning and can lead to deceptive and poor model performance hence it needs to be properly dealt with before deploying your model in production.

Data leakage is one of the most difficult problems when developing a machine learning model. It happens when you train your algorithm on a dataset that includes information that would not be available at the time of prediction when you apply that model to data you collect in the future.

In a more simpler term, data leakage happens when we accidentally share information between the test and training data-sets while creating the model.

“Any other feature whose value would not actually be available in practice at the time you’d want to use the model to make a prediction is a feature that can introduce leakage to your model.” — Data Skeptic

Big Data Jobs

Before we start, please note that this tutorial is part of the Python Data Analysis For Data Science & Machine Learning. Feel free to check it out for more detailed explanation of this and other concepts.

Why is Data Leakage Important To Know?

  • It causes a model to overrepresent its generalization error, which makes it useless for any real-world application. Therefore caution must be taken, else when deployed in production for any application will cause the application to fail miserably.
  • It can make investor make bad investment and cause huge financial cost.
  • It is practically deadly in the healthcare sector and may make predictions that can cause lives of humans.
  • It can make wrong predictions on customer behaviours which will make buiness leaders take wrong decisions that can push the business into debt or at worse eventually collapse their business.

Data Leakage is therefore one of the most important concepts to know as a Data Scientist or Machine Learning Engineer.

Trending AI Articles:

1. Write Your First AI Project in 15 Minutes

2. Generating neural speech synthesis voice acting using xVASynth

3. Top 5 Artificial Intelligence (AI) Trends for 2021

4. Why You’re Using Spotify Wrong

How do I know if am having Data Leakage?

Data leakage often results in unrealistically high levels of performance on the test set and this is because the model is being ran on data that it had already seen to some extent in the training set.

It had already memorized the patterns and everything, so why not? It will definitely perform well.

However, this is definitely misleading, and this model will fail to generalise when deployed in production.

How did I end up in Data Leakage?

There are several Causes of Data Leakage which includes:

A. Duplicates

B. Leaky Predictors

C. Pre-processing activities

A. Duplicates

Duplicate values are a common problem when dealing with real world datasets. You can’t run away from it. This normally occurs when your dataset contains several points which are identical.

For example if you are working with customer reviews dataset for sentiment analysis, it is very likely to find customers who have written same reviews for a product different times, partly because some product owners ask them to write more reviews so that they will get sales, or maybe the customer just like or dislike the product and keeps writing same positive or negative reviews over and over again.

In this situation, you may experience data leakage due to the fact that your train and test set may contain the same data point even though they may correspond to different observations. Which will fail when you use it in production to test new sets of reviews.

B. Leaky Predictors

You may not explicitly leak your data, however, you can still experience data leakage especially if there are dependencies between your test and train set. This mostly happens when you are dealing with data where time is important(like time-series data).

Leaky Predictors include data that will not be available at the time you make predictions.

Let’s demonstrate this concept below:

We have created a dummy data which contains:

— ‘Purchase’, whether people purchased the item or not. ‘

— ‘QTY’, quantity of the item purchased

— ‘Product’, the particular item purchased

— ‘Discount’, whether there was a discount on the item purchased or not

People will mostly buy a product when they are given a good discount and the product is what they need. If you look at the data we have above, most of the people who got Discount also Purchased the product.

Let’s check the relationship or correlation between the two features below:

First, we will convert the object type values to numerical values with label encoding.

Now we have all the values to be numerical as shown above.

let’s proceed to check the correlation between these variables.

We can see that there is a very strong relationship between these two features(Purchase and Discount), about 0.7 while the other features have very less correlation -0.3 and 0.47.

Having a high relationship is actually a good thing, for instance if we want to build a model that will predict whether a customer will purchase a product or not, this variable will help us to get a good prediction with our dataset.

However, we should also note that discount is normally given based on certain conditions, like festive seasons, customer type, or it can run for a certain period of time. In short, discounts are not mostly available all the time.

Considering the relationship or correlation between the two features(Purchase and Discount), if we build a model to predict wether a customer will purchase an item or not based on the given data, the model will see that anyone who has a discount is highly likely to purchased an item. Validation data comes from the same source, so the pattern will repeat itself in validation, and the model will have great validation scores. But the model will fail when we deploy in the real world, since data that will come later on might not have discount.

C. Pre-processing activities

Probably the most common cause of Data Leakage is happens during data pre-processing steps of machine learning.

Approach 1

Most of the time, we

  • prepare our data
  • split it into training and testing set and
  • build and evaluate our model

While this is mostly done in most machine learning problems, it exposes our test or validation set to the model while training and typically leads to Data Leakage.

Taking for instance data normalisation, where we will like to normalise our data in such a way that it has a range of 0 to 1. This means that the largest value for each attribute is 1 and the smallest value is 0.

Xmax= maximum value

Xmin= Minimum value

When the value of X is the minimum value in the column, the numerator will be 0, and hence X’ is 0 On the other hand, when the value of X is the maximum value in the column, the numerator is equal to the denominator and thus the value of X’ is 1 If the value of X is between the minimum and the maximum value, then the value of X’ is between 0 and 1.

Now when we normalize our data, this requires that we first calculate the minimum and maximum values for each variable before using these values to scale the variables. After that, we split our dataset into train and test sets, but the examples in the training set know something about the data in the test set, i.e. they have been scaled by the global minimum and maximum value and they somehow know something about every data point.

Again, standardisation estimates the mean and standard deviation values from the dataset in order to scale the variables.

Each data point will have a taste of another, whether in train set or test set.

Also missing value imputation gives same problem.

This happens with almost all data preparation techniques.

Approach 2

We can therefore reorganise our process in this way:

  • Split data into training and testing set .
  • Perform Data Preparation on training set.
  • Fit the model on the training set.
  • Evaluate Models on the test/validation set.

Let’s see an example below:

Approach 1:

The wrong way

First, let’s try the approach 1 and evaluate our results.

i.e.

  • prepare our data
  • split it into training and testing set and
  • build and evaluate our model

We will use the MinMaxScaler function to scale our data into the range 0–1

Let’s create some dummy dataset here.

We will use the sklearn’s make_classification() function to create the dataset with 1,000 records and 10 features.

CodeText

Now we have our dataset.

Step 1. Now let’s normalise our dataset

Step 2. Now let’s split our data into training and testing set

Step 3. Let’s build the model and evaluate the model

Build the model:

Evaluate the model:

We are achieving 88.38% on the training data and 91.50 on the testing data, which is quite good as it is.

let’s do it the right way and see what happens.

Approach 2:

The Right Way

Step1: We will first split the data into train and test sets.

Step 2: We now scale our data using the MinMaxScaler

NB: We did not scale the y_test since we want it to represent real world dataset and we can use it for testing or validation. Also in most of the cases, you will not need to scale the y_train since it will already be in small range and scaling the X_test and x_train is enough to get going. However, depending on your dataset and problem statement, you can scale the y_trian BUT NOT before splitting.

Step 3: Build and evaluate the model

Build the model:

Evaluate the model:

What happened?

Now we can see that the reality of the model is being revealed. Our model is overfitting but since in the approach 1, our model has tasted both the traing and test set, it has memorise the pattern and it was still doing well, however, the approach 2 has revealed to us that our model wont work in production, it will fail miserable so we need to tune the model.

Using Cross-Validation

K-fold cross-validation involves splitting a dataset into K non-overlapping groups of rows. After that, you train your model on all but one group to form a training dataset, then evaluate the model on the hold-out fold. You repeat this process couple of times so that each fold is given a chance to be used as the holdout test set. You finally average performance across all evaluations.

For each estimation, the dataset is divided into 5 folds, 4 for training and the remaining 1 for testing.

Let’s check the two approaches discussed above using cross-validation approach

Approach 1: The Wrong Way

Step 1: Scale the data using the MinMaxScaler

Step 2: Perform cross-validation and check the accuracy

Our model is doing 88.76% by using the approach 1.

Let’s consider the approach 2 as well.

Approach 2: The Right Way

Approach 2 puts everything in a pipline

Running the example normalises the data correctly within the cross-validation folds of the evaluation procedure to avoid data leakage.

With the two accuracies, we can expect the approach 2 with accuracy of 85.43 to perform well in production than approach 1 even with the accuracy of 88.76.

End notes:

I highly recommend using Approach 2 in all the above scenarios, i.e. by splitting your data into training and testing set first before performing any data preprocessing activities in order to avoid unnecessary data leakage.

If you like this tutorial, check out the Python Data Analysis For Data Science & Machine Learning for more detailed explanation of this and other concepts.

Also, please give it a Clap and don’t forget to follow me for more tutorials because I will be posting tutorials with an in-depth explanation so that we can learn from each other.

Don’t forget to give us your ? !


How To Properly Deal With Data Leakage was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/how-to-properly-deal-with-data-leakage-ec31025a6e18?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/how-to-properly-deal-with-data-leakage

Credit Card Fraud Detection: How to handle Imbalanced Dataset

Credit Card Fraud Detection: How to handle an imbalanced dataset

This post will be focused on the step-by-step project and the result, you can view my code in my Github.

tags: machine learning (logistic regression), python , jupyter notebook , imbalanced dataset (random undersampling, smote)

Introduction

Credit card fraud is an inclusive term for fraud committed using a payment card, such as a credit card or debit card. The purpose may be to obtain goods or services or to make payment to another account that is controlled by a criminal. (Wikipedia)

This has been a major problem to the victim and the credit card company itself, as it inflicts a financial loss to both parties. It is important for a credit card company to detect which transaction is fraud and which one is not.

With machine learning, we can detect the fraud transaction from historical data and denied the transaction so that both the company and the individual won’t have a loss.

Big Data Jobs

Purpose

There are a lot of ways we can do to handle an imbalanced dataset, in this project we will compare each technique (Random-Under Sampling and SMOTE) to see which technique fits the best for this imbalanced dataset.

To detect a fraud detection, we can use machine learning, in which there are a lot of machine learning algorithm out there. In this project we will also see which machine learning models fits the best out of Logistic Regression, K-Nearest Neighbor, Support Vector Machine, and Random Forest.

Data

The datasets contains transactions made by credit cards in September 2013 by european cardholders, it presents transactions that occurred in two days. It contains only numerical input variables which are the result of a PCA transformation. The only features which have not been transformed with PCA are Time and Amount feature.

Time feature contains the seconds elapsed between each transaction and the first transaction in the dataset. While Amount is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature Class is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Data Source : Kaggle

Trending AI Articles:

1. Write Your First AI Project in 15 Minutes

2. Generating neural speech synthesis voice acting using xVASynth

3. Top 5 Artificial Intelligence (AI) Trends for 2021

4. Why You’re Using Spotify Wrong

Methods

The machine learning methods useds is Classification, the purpose of this project is to classify each transaction to Fraud and Non-Fraud, and compare which classification algorithm fits the best for this dataset.

The classification algorithm used to compare is Logistic Regression, K-Nearest Neighbor, Support Vector Machine, and Decision Tree.

Analysis

The dataset of the credit card transaction shows that this dataset is imbalanced, as we can see from the figure above.

Figure 1. Class distribution shows an imbalanced dataset

When we’re dealing with an imbalanced dataset, we can’t just simply use it raw and proced it into machine learning. It could cause a biased to the majority of the class that leads to a poor machine learning model.

So, we have to handle the imbalanced dataset first.

There are a lot of technique we can use to handle an imbalanced dataset, like Random Under-Sampling and Random Over-Sampling. In this project, I will use Random Under-Sampling and SMOTE to handle the imbalanced dataset and compare which one fits the best for this dataset.

SMOTE is a technique where you do over-sampling to the minority class by filling out the gap between each value and then do under-sampling for the majority class so it meets in the middle.

Figure 2. Class distribution after performing SMOTE

We can see from Figure 2 that the class distribution becomes equal after performing SMOTE.

This dataset doesn’t have a missing value so we don’t have to handle it.

By performing this technique, we can also see how different it is to look at the correlation between features, before and after performing SMOTE.

Figure 3. Feature Correlation

We can see that before performing SMOTE, we can’t see the correlation between each feature, but when the data is balanced, we can clearly see the correlation.

Figure 4. Boxplot of Each feature categorize by Class

Here we can see that some of the feature there are a clear range between the class. We can also see that there are a lot of outliers. So we will remove the extreme outliers from the feature that have a high correlation with the class.

From the boxplot, we can see that the dataset that have a negative and positive correlation with the class:

Negative Correlation: Time V2 V6 V8 V9 V11 V13 V15 V16 V17

Positive Correlation: V1 V3 V10 V18

and we can see that feature that have a high correlation with the class is V2 V3 V8 V10 V11 V13 V15 V16 V17 V18 so we will remove outliers from this dataset.

Here are the comparison of before and after removing the outliers on one of the feature.

Figure 5. Before and after removing outliers

The threshold used for removing the extreme outliers is 1.5 of IQR. The cut off value will be used to determine the range, which the lower range starts from the Q25 — cut off valueuntil the upper range which is Q75 + cut off value.

The value outside of those range will be removed.

After removing outliers, we can see that there is a slight decreasing in the fraud case (Class 1) dataset. Although the data looks imbalanced, there is not a huge different to consider this dataset is imbalanced.

Usually you would consider a data is imbalanced if the ratio is around 8 : 2.

Figure 6. Class distribution after removing outliers

Lastly, we will do Random-Under Sampling, we can do it by using the imblearn libraries. We will compare all 4 of the existing dataset, which is :

  1. The raw/ original dataset
  2. The dataset after SMOTE
  3. The dataset after SMOTE and removing extreme outliers
  4. The dataset after Random Under Sampling

If you are dealing with an imbalanced dataset, it’s not good to have accuracy as your parameter, because the accuracy will give a high value but it’s only because they succeed in predicting the majority class (this is called biased). So that’s why I am using Precision and Recall as a parameter to decide which one have the best performance.

Precision and Recall only look at the class without an effect from the other so we won’t have to worried to have a biased results even though our data is imbalanced.

I will implement Logistic Regression to the 4 of the dataset above and then compare them.

Results

Figure 7. Precision-Recall Curve

Here we can see that the dataset that’s been through SMOTE gives out the best precision-recall curve. Here we can see that removing outliers can actually discard a useful information, same goes to the Random Under-Sampling, we can lose a lot of useful information that can help us classify the class. But still both of the dataset have a higher precision-recall than the raw dataset that hasn’t been properly handled yet.

Figure 8. ROC Curve

We can also see from the ROC curve that using SMOTE give us the best results. With ROC curve, it can helps you decide which method is better than the other.

Conclusion

  1. The best technique to handle imbalanced dataset for this Credit Card Fraud Detection is SMOTE (Synthetic Minority Oversampling Technique).

P.S

I am actually working on comparing which machine learning algorithm fits best but the result is kind off weird so I decided not to put it here and try it again. There are some theory that I haven’t completely grasped on and I am working on it.

Probably I will add it in the next few days and use GridSearchCV (I haven’t fully understand it).I can also analyze which that have a high correlation with the class (not manually because that’s what I did in here, well technically I can use a scatter plot), or which feature that can help us differentiate the class.

Please send me your comment and advice if you notice something’s wrong. It would help me a lot. I am a newbie in this so please go easy on me!

Comment:

This is not a big project but I learned a lot through this project, it helped me understand a lot of machine learning parameters we can use to evaluate our models and different methods we can use to handle an imbalanced dataset.

Don’t forget to give us your ? !


Credit Card Fraud Detection : How to handle Imbalanced Dataset was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/credit-card-fraud-detection-how-to-handle-imbalanced-dataset-1f18b6f881?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/credit-card-fraud-detection-how-to-handle-imbalanced-dataset

Generalization Technique for ML models

Ever wondered about the term “Generalization” for ML models? Generalization in Machine Learning means, the model which you built using your data, gives better results on testing data compared to the training data.

How to achieve generalization? By simply changing the random state at the time of splitting the data into training and validation data you can achieve generalization.

Let’s take an example of the iris dataset. Iris dataset has features as sepal length, sepal width, petal length, petal width. The labels are Setosa, Versicolor, and Virginica. It has 150 rows.

Big Data Jobs

Just loop random state in a range from 0 to 99 and calculate the train and test score of the models created. In the loop specify the condition if the score for test data is better than training data, then append the random state, training score, and testing score in a list called scores.

The models generated with the random states appended in the list scores are all generalized models as they perform better for testing data set compared to training data set. If you are doing a regression problem you can use other metrics like RMSE and this time the goal will be testing error should be less than the training error. Since it is a classification problem we will be using the accuracy metric.

Trending AI Articles:

1. Write Your First AI Project in 15 Minutes

2. Generating neural speech synthesis voice acting using xVASynth

3. Top 5 Artificial Intelligence (AI) Trends for 2021

4. Why You’re Using Spotify Wrong

Sort scores in descending order to get the generalized model with the highest accuracy on testing data.

With random state 0, I am getting the highest accuracy on testing data, so I will be selecting the model with random state 0.

The GitHub link for this tutorial is as below:

pratikskarnik/Generalization_Technique

Don’t forget to give us your ? !


Generalization Technique for ML models was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/generalization-technique-for-ml-models-ed6ba666d171?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/generalization-technique-for-ml-models

Machine Learning its all about assumptions

Just as with most things in life, assumptions can directly lead to success or failure. Similarly in machine learning, appreciating the assumed logic behind machine learning techniques will guide you toward applying the best tool for the data.

Originally from KDnuggets https://ift.tt/3aS4q6I

source https://365datascience.weebly.com/the-best-data-science-blog-2020/machine-learning-its-all-about-assumptions

Design a site like this with WordPress.com
Get started