Avoid overfitting using cross-validation

`Source

Folding Validation sets using Cross-Validation!

This article is divided into 3 main parts:

1 — Overfitting in Transfer learning

2 — Avoiding overfitting using k-fold cross-validation

3 — Coding part

Transfer Learning is a term that has crossed the field of deep learning lately and been used so far.

A quick recall about transfer learning: Using pre-trained models to train yours in case you don’t have enough dataset for the new dataset.

For a detailed explanation about transfer learning, read the following article about Transfer Learning.

Machine Learning Jobs

A Kaggle competition caught my attention weeks ago that I felt intrigued to give it a try. It was centered on image classification and the candidate is up to choose one of two topics: heart disease and grocery items.

So, I chose the grocery items. They were 19 classes with significantly very low data!

For this reason, using Transfer learning is a must. But, a problem that anyone will encounter is overfitting. Overfitting can be simply put as the following:

The model can recognize the training dataset too well but lacks the ability to learn the dataset features, so it fails to predict new unseen data.

At first, the loss that the model produced was very high and the accuracy didn’t go above 0.1!. Analyzing the graphs produced to realize that it is an overfitting major problem. Hence, K-Fold Cross-validation was the best choice.

K-Fold Cross-Validation

Simply speaking, it is an algorithm that helps to divide the training dataset into k parts(folds). Within each epoch, (k-1) folds will be taken as training data, and the last part will be a testing part for predictions. The latter part is called the “holdout fold”. Each time, the holdout fold will change. A kind of shuffle k-times will take place within the k-parts.

How will this affect the model and alleviate the overfitting?

Actually, the problem with overfitting is that the model gets ‘over-familiar’ with the training data. To avoid such a scenario, we will use cross-validation.

Trending AI Articles:

1. Fundamentals of AI, ML and Deep Learning for Product Managers

2. The Unfortunate Power of Deep Learning

3. Graph Neural Network for 3D Object Detection in a Point Cloud

4. Know the biggest Notable difference between AI vs. Machine Learning

Source

Coding part || Fun part

The built-in K-fold function is found in sklearn library.

'''X : is the training dataset 
train_index: is the index of of the training set
same for the test index
'''
for train_index,test_index in KFold(n_split).split(X):              
    x_train,x_test=X[train_index],X[test_index]
y_train,y_test=Y[train_index],Y[test_index]
model=create_model()

The code above does the following :

  • iterates overall training and testing objects in the k fold.
  • creates a new training set and a new testing set
  • calls the create_model function to create the model and find the output

BUT, where is the create_model() function? WE WILL CREATE IT NOW 🙂

create_model function ( description of each line below the code)

def create_model():
   IMAGE_SIZE = [100, 100] #fixed image size
   vgg = VGG16(input_shape=IMAGE_SIZE + [3], weights='imagenet',     include_top=False) #get the weigths of imagenet that are used in vgg
#include top = false is to get all the layers of vgg except the one that takes the specific features of the model
   for layer in vgg.layers:
      layer.trainable = False #dont train the layers of vgg, because we need their weights fixed
   y1 = Flatten()(vgg.output)
   bn2 = BatchNormalization()(y1)
   y4 = Dense(37, activation='relu')(bn2)
   bn3 = BatchNormalization()(y1)
   prediction = Dense(3, activation='softmax')(bn3)
   model = Model(inputs=vgg.input, outputs=prediction)
   model.compile(loss='categorical_crossentropy',
 optimizer=optimizers.Adam(),
   metrics=['accuracy'])
   model.summary()
return model
  • Since we are dealing with the transfer learning image classification model, we decided to choose VGG16 for its ability to memorize and analyze small features in the images.
  • The image size is set to (100,100) for convenience between most of the images.
  • ‘include_top = false’ is a primary condition in transfer learning, it is the ability to get the most benefit of the trained model, and putting all trained weights in the needed model ( current one).
  • After that, we don’t want the VGG16 to train again from the beginning, because the weights must be frozen from the vgg. In this case, we will loop over all trainable layers and make them false.
  • Since the model is now ready, let’s start adding the top layer that was excluded from the last layer :

1 — Flatten(): that flattens all the outputs

2 — BatchNormalization(): normalizes the output of the layer to make them all convenient which helps the model to understand the results more ( all results between 0 and 1 is much better than that between 0 and 1000! )

3 — Dense(37,activation=’relu’) : why 37 units ? actually, it is a tough process that demands a lot of try and error to find out the best unit number for the hidden layers. At first, I started with 22 and found the worst results ever. Then, I tried 97 and found the same except in an opposite direction. So, I started going down 10 units, then 5, then 1, until I got this!

( Note that each model needs its specific hidden units number so make sure you go through try and error )

4 — One more BatchNormalization

5 — The final Dense that will use activation function as ‘softmax’ in order to find out the probability of each class studied before, 3 units because I was trying to train the model on 3 classes for results observation only.

Finally, let’s check out the results :

Training without K-fold cross-validation :

Test Loss: 2.23648738861084

Test accuracy: 0.36900368332862854

Training with K-fold cross-validation : ( k = 5)

Test Loss: 0.9668004512786865

Test accuracy: 0.6000000238418579

In conclusion, the cross-validation approach has proved that it is the best fit to avoid overfitting!

Don’t forget to give us your ? !


Avoid overfitting using cross-validation was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/avoid-overfitting-using-cross-validation-51241aa9bf8c?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/avoid-overfitting-using-cross-validation

The AI behind getting the first-ever picture of a black hole

AI AND UNIVERSE

A black hole is a massively condensed object tat the center of our universe whose gravitation does not let even light escape from it

Source: NASA

This year’s Nobel prize in physics has been awarded to Sir Roger Penrose (1/2), Reinhard Genzel (1/4), and Andrea Ghez (1/4) for their research on Blackhole. Even last year it was in astronomy and cosmology. These are exciting times for astronomy since the last one before that was in 2006.

There is a common trait in astronomy and AI. The work started sometime in the 20th century and was not proved then due to the limitation of the technology. And now when the technologies are developed, we are able to provide pieces of evidence.

Machine Learning Jobs

The noble prize was awarded last week and that I why I thought it is worth dedicating some time to their work. The winners used the general theory of relativity which talks about how gravity behaves in our universe and how space and time are bent and changed in the space.

The bending of time in space
Source: The Royal Swedish Academy of Sciences

Sir Roger Penrose used it to explain and prove the existence of a black hole. The gravity is so strong at the center of the black hole that everything including light particles condenses and forms a massive object, the concept being called the singularity. Penrose proved its existence mathematically back in 1969. But the Nobel prize committee likes the theory observationally or experimentally confirmed before awarding the prize. This is similar to when Einstein provided the theory of gravitational waves in the 1960s but the award was provided in 2017 for the ‘detection’ of gravitational waves. The incredible first-ever image of a supermassive black hole at the center of the M87 galaxy is what I think triggered the Nobel prize committee to think about consideration about the prize at the end of 2019.

Trending AI Articles:

1. Fundamentals of AI, ML and Deep Learning for Product Managers

2. The Unfortunate Power of Deep Learning

3. Graph Neural Network for 3D Object Detection in a Point Cloud

4. Know the biggest Notable difference between AI vs. Machine Learning

Combines pictures collected over time gathered from multiple light collectors around the world.
Combines pictures collected over time gathered from multiple light collectors around the world. Source: Nature.com

This image was created by a combined effort of multiple research centers and a huge array of radio telescope to see something so far away. But the question is why it was not awarded then and why to three people only? It is one of the traditional rules by which a Nobel prize can only be shared by 3 people max. This might not sound right in today’s world when the discoveries are so collaborative. Moreover, it was not just the image that proved the existence but the work of the other two Nobel laureates, which helped in proving the existence.

Where does AI come into the picture here?

It seems that the image which we got after so much of collaboration can only tell us so much due to its blurriness and size. It was created by a combination of 8 telescopes across the globe. As the earth rotated they helped fill in the image parts and built this. You can imagine it as something like in the below illustration in the center.

Multiple telescopes are used to fill in the picture over time
Multiple telescopes are used to fill in the picture over time. Source: Event horizon telescope and Veritasium

The computer vision AI algorithm was fed with lots of galactic and other images to make it learn how things look in the universe. This data combined with the data collected by the 8 telescopes helped us built the first-ever image of a black hole.

Now you will say how we can train the algorithm if we have never seen a black image! Will the algorithm not start giving us images of what we expect it to give? That’s right and that is why the team at MIT along with Katie Bouman took a different approach. They trained the algorithm on three different sets of data. First with some expected kind of sizes of a black hole, second with some other galactic images, and third with general images of cats, dogs, trees, selfies, house, buildings, and people. It came to their surprise that once the data from telescopes were fed to all the three trained algorithms. They all created almost similar images in all three cases as you can see in the below picture.

Different types of images create the same image of the black hole. Source: TED talk by Katie Bouman

This re-affirmed their assumptions and gave us the first-ever image of a black hole. It is not far when there will a direct image of the black hole taken from somewhere in space where there is less galactic dust and fewer chances of light diffraction while traveling millions of light-years till it reaches the light collectors (telescopes). I am pretty optimistic that we will soon be able to take the picture of this black hole and determine its size, composition, and physics around it.

Don’t forget to give us your ? !


The AI behind getting the first-ever picture of a ‘black hole’ was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/the-ai-behind-getting-the-first-ever-picture-of-a-black-hole-c483e8eb6a21?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/the-ai-behind-getting-the-first-ever-picture-of-a-black-hole

How to Explain Key Machine Learning Algorithms at an Interview

While preparing for interviews in Data Science, it is essential to clearly understand a range of machine learning models — with a concise explanation for each at the ready. Here, we summarize various machine learning models by highlighting the main points to help you communicate complex models.

Originally from KDnuggets https://ift.tt/357Z6Jo

source https://365datascience.weebly.com/the-best-data-science-blog-2020/how-to-explain-key-machine-learning-algorithms-at-an-interview

365 Data Use Cases: Data Science and Medical Imaging with Giles McMullen-Klein

Hi! My name is Giles. I’m an Oxford-trained medical physicist and data scientist turned Python instructor and the author of the 365 Python Programmer Bootcamp course.

I’m happy to join the 365 Data Use Cases series and, in this post, I’ll tell you a bit more about my favorite data use case: advanced medical imaging.

We’ve also made a video on the topic that you can watch below or just scroll down if you prefer reading.

Data Science and Advanced Medical Imaging: DMP & MRI

Until recently, I worked as a research scientist, and a medical physicist. That’s how I was introduced to DMP (Dextran‐magnetite particles) – a specialist type of contrast agent used in Magnetic Resonance Imaging (MRI). Without going into detail about how it works, it would be enough to tell you that DMP provides such a sensitive type of advanced medical imaging, that it enables you to image metabolism in a living body.

The fact that you can create a medical imaging system that could do that is quite amazing in itself. But what’s really useful about it is that most disease affects in some way the metabolism of a body, especially cancer and heart disease. And those were precisely the areas I was focused on. So, in my line of work, we would collect huge amounts of data by doing advanced medical imaging. And that enabled us to:

  • follow and watch metabolic pathways in areas of interest within a body
  • see how they behaved in a healthy body
  • and then compare that to how they behaved in a diseased body

The Role of Diagnostic Medical Imaging in Analysis

Overall, the aim was to see whether we could find any way of using the data that we gathered in the medical imaging analysis as a diagnostic for different types of cancers, for example. In the case of heart disease, we were looking to see whether we could somehow determine the amount of damage that had been done, for instance, following a heart attack. Of course, there were many challenges that we faced doing that MRI analysis but that is a topic for a whole new article.

But the data analysis toolset that you’re learning does not restrict you to a specific industry. In fact, it can be applied to any field that you’re interested in. Whether that’s a business application or a scientific application, you use many of the same methods. That said, data science is a great field to be getting into and there’s almost no limit to what you can achieve with it. So, if you’re eager to learn more about data science and Python in particular, check out my YouTube channel. And, if you’re looking to add some indispensable Python programming skills under your belt, you can sign up for my Python course for free.

The post 365 Data Use Cases: Data Science and Medical Imaging with Giles McMullen-Klein appeared first on 365 Data Science.

from 365 Data Science https://ift.tt/3o3k3O9

Best and Top 10 Data Science Online Resource Available For Every Aspiring Data Scientist

source https://365datascience.weebly.com/the-best-data-science-blog-2020/best-and-top-10-data-science-online-resource-available-for-every-aspiring-data-scientist

The chatbot system for knowledge management and information management

For a modern workforce, the areas of knowledge and information management are becoming important. It must be intuitive, quick, and seamless to locate, record, and know stuff in an environment where data is distributed, workers are constantly on the go, and career opportunities change rapidly.

The basic principles of Knowledge and Information Management are how effective a company’s product and its service information is managed, like activities combining with optimized search tool since the early 2000s. Yet it won’t be enough any longer. From this promising point in the age of AI and chatbots, if you’re not putting a strong and precise bot to work, you will be left behind.

We have the intellect of a computer processing but wanted it to be smart enough to understand the laymen’s queries through their keyword phrases rather than go through the educating steps to be precise or vice versa. We need to segregate these steps among Knowledge Management and Information Management.

Let me share the difference between knowledge management vs information management in a firm.

Knowledge Management: The process of generating, exchanging, using, and maintaining an organization’s resources and their insights is knowledge management. It refers to an integrative strategy that allows the best use of expertise to achieve organizational goals effectively.

Trending AI Articles:

1. Fundamentals of AI, ML and Deep Learning for Product Managers

2. The Unfortunate Power of Deep Learning

3. Graph Neural Network for 3D Object Detection in a Point Cloud

4. Know the biggest Notable difference between AI vs. Machine Learning

Information Management: Information management refers to a term of organizational operations. The collection, preservation, and transmission of work data from one or more sources to those in need, and its eventual disposal through archiving or destroying it.

Now is the time to bring your strategic plan into chatbots integrating AI. There are various methods to train a bot or the individuals handling the information acquired through the process and available knowledge.

Artificial Intelligence Jobs

1. Bots need to organize their information better.

The information needs to be structured which impacts how it is found and used by individuals or bots by providing another channel to reach out to your customers. Bots can be leveraged to increase customer engagement with timely tips and offers. Real-time customer communication of chatbots helps the customer find what he is looking for and evaluates different suggestions while having their requirements in parallel to current trends.

Your files can be structured well using a solid folder or metadata structure in a standardized site and repository hierarchy. The performance of the structure, of course, depends on the below points

a) The technique used from the outset and from the beginning to organize the information/data material.

b) How well the structure and content have been maintained by the owner of the hierarchy over time (including eliminating ROT if needed).

For knowledge management, a well-organized hierarchy that is intuitive, standardized, and timely will work well. The content is not organized by searching alone, the kind of idea/suggestions along with queries will have more information on the customer’s knowledge of the product/service. In an organic fashion, the search engine can have several results based on keyword matches, metadata refiners, and, of course, previous file popularity. When the user has no idea how to create the expected details (or cares not to waste time passing through a file structure), this may work well.

With bots, the available information is absolutely determined by the owner(s) of the bot, certain individuals who organize the bot’s data, and how it guides users to the source data they are searching for. For each department or division in an organization, a good bot includes answers to the most popular questions, answers the question being asked (rather than only providing a source for the answer), and links directly to the origin as a guide for more information.

The response is important because, well, it’s what the user was searching for. The reference is also powerful since, should they need it, it automatically leads the information researcher to the source. Bots optimize what’s available when it relates to the available knowledge and have high returns on the expenditure on comments.

2. Bots make high-value data predictable in order to identify.

Site structures, quest, and bots rate quite differently as compared to finding what you’re anticipating. For example, a user is assumed to scan the human resources database with a site layout to find data about their employee compensation. And it is not assured, even though this can be predicted. The web layout may be more difficult than expected, or, frankly, the user may suffer a bit of laziness and give up.

But even if they understand how to get to the information, clicking through folders, views, and plugins is still a challenge and can deter anyone from looking forward to a document they need. They eventually accept not getting the data (possibly affecting the quality of their work) or asking someone else for help (someone adds minimal overall value by using another person’s precious time on a task). In general, this impact on seeking information is okay, but not fantastic.

With the search that just suddenly combed everything you have exposure to; you’re stuck with performance. The user also must contend with international results that are just not relevant for a user who wants to look for best practices combined with a specialized search system (e.g. advertised results, customer refiners). From linking keywords (e.g. “office” for facility details or “Office” for its computer use) to outdated information, a significant amount of data in the search must be collected according to the essence of its indigenous application.

The response is important because, well, it’s what the consumer was searching for. The comparison is also powerful as it leads the knowledge seeker to the source immediately if they need it. Bots optimize what is available when it comes to the available data and provide better returns on the expenditure on responses.

3. Bots make it possible to predict high-value knowledge to find

Site hierarchies, quest, and bots rate very differently as compared to finding what you are anticipating. The material is usually well structured for your folder structures. It’s predictable to find what you want to find because if a file was there last week, it’s probably still there this week, possibly on the same web, library, or folder. The hierarchies are trustworthy best buddies that can be used again and again to find data once we know our orientation. It may not be simple for the owner to set up and manage, but users would be easy to use and enjoy a well-organized information system. You have a perfect balance with a bot: you prescribe what the answers are to the most wanted queries and provide resources as a solid way return (via link) to the actual sources.

Deciding what to include can be daunting at first. Combining the top, say, 50 most popular search queries from the search analytics of your intranet with a known list of FAQs per department or category in your organization is a simple way to start. We’ll mostly see of use of the bot when you have around three-quarters of those systematic reviews. To decide what else people, want to hear, collect any unanswered feedback from users. A bot strikes a balance among knowledge management vs information management.

4. Bots force you to always personalize the high-value information

The curation of information is crucial. Your home page on the intranet may include dynamic content, but inevitably someone with a plan has planned how the content will be presented and has decided what to do or not show. For the overall knowledge control, the same goes.

Bots offer you a happy middle ground where only the content that is important can be curated. Yeah, keeping records of things that happened seven years ago is valuable, but it’s doubtful you’ll have to see it often. On your site, that kind of file is curated. Search offers organic answers and can provide insights into what’s common through its analytics. But searching will only provide you with the source of the data.

If you want to know about the holiday policy, the employee handbook will probably be returned by search; but to locate the section on time off, you’ll have to sift through that paper. For comparison, a curated bot will address the issue about time off and connect to the employee handbook. But rather than the source, the customized reply is the answer the user was searching for. After you have found the source you needed, a curated bot skips the irritating step of having to read, digest, or further check for information.

5. Bots require minimal user training

Unfortunately, excellent Information Management can only come with a well-trained user base of today’s information processing sphere, like using the acquired knowledge for customizing the information and structuring it better. But everybody has their daily jobs, so no one likes having to take a class to learn anything as fundamental as directory structures and searching (required or not). (Although, indeed, they are expected to.)

On the user’s hand, it takes a while to understand and discover a good site and library layout. Ask the newest member of your team how long it took them to grasp the design of the knowledge in your team if you are suspicious of the evaluation. It takes time to understand even the best frameworks, and it takes time away from other work that could be accomplished.

Searching, one might say, has no learning curve. With an empty box with a magnifying glass beside it, everybody knows what to do. But it is not so easy. Of course, if you like, you can use out-of-the-box search that way, but any smart search setup with configured search refiners, pinned results, and more can be used to get the most out of it.

Conclusion

· The Knowledge Management and Information Management fields is a shake-up of actual, functional AI solutions that can get us over the limitations of site frameworks and search, but directly link users to the data they want. Point A to Point Z without any stops along the road. Big, immediate wins can be bought by a smart implementation process. Keep these steps in mind when you begin playing with bots in your organization:

· A bot is required. You can spend several thousands of dollars and months constructing one for development time, or with a bot, you can get up and running in a matter of hours. There’s also a free trial for 30 days to see if he suits your needs.

· Record the x-most popular search queries in your search analytics (I optimized design at 50) and revisit this list each month or so.

· Collect and record the list of common questions from each team, and knowledge base they have, or any cheat cards they use to find appropriate details.

· To answer these questions, customize your bots using a platform like Question and Answer Creator. Link to the sources in your replies.

· Collect communications from users that do not have a successful response. Apply a framework for reviews to consider consumer needs.

· Review your replies on a regular basis to ensure they are specific.

· Delete redundant answers. Modifications of records. Observe best practices.

Don’t forget to give us your ? !


The chatbot system for knowledge management and information management was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/the-chatbot-system-for-knowledge-management-and-information-management-ebedae0cecfd?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/the-chatbot-system-for-knowledge-management-and-information-management

DOE SMART Visualization Platform 1.5M Prize Challenge

The U.S. Department of Energy’s (DOE) Office of Fossil Energy (FE) will award up to $1.5 million to winning innovators in a prize challenge to support FE’s SMART initiative. Registration deadline to participate in the challenge is 11:59 p.m. EDT Friday, Jan 22, 2021.

Originally from KDnuggets https://ift.tt/3iZ34cg

source https://365datascience.weebly.com/the-best-data-science-blog-2020/doe-smart-visualization-platform-15m-prize-challenge

Optimizing the Levenshtein Distance for Measuring Text Similarity

For speeding up the calculation of the Levenshtein distance, this tutorial works on calculating using a vector rather than a matrix, which saves a lot of time. We’ll be coding in Java for this implementation.

Originally from KDnuggets https://ift.tt/37bw5iu

source https://365datascience.weebly.com/the-best-data-science-blog-2020/optimizing-the-levenshtein-distance-for-measuring-text-similarity

Design a site like this with WordPress.com
Get started