The first thing you need to know about taking a few shots is that it is all about practice. If you are a beginner, hitting even the easiest shots will likely prove challenging. But, as you practice, you will see your game improving. You will start to be able to sink more and more shots, and your overall confidence will increase.
Few Shot Help
Few shot learning is a part of machine learning. Machine learning is a method of data analytics that works based on automation. This is a branch of artificial intelligence and revolves around a system that can identify a pattern, understand the data, and can take decisions. And all these with minimum human intervention. This is the specialty of machine learning. Few shot help with learning, being a sub-part of machine learning classifies new data and works with data analytics.
Few shot learning is a new concept and research is going on to explore more productivity out of that. You can use few shot in the perspective of computer vision. A computer vision model works best with some training samples. If you work in the healthcare sector and you are facing a problem of bone illness categorization by x-ray. You might be lacking enough images to come to a solution. Few shot solves such types of issues very efficiently.
Variations of Few Shot
There are four (4) different variations found in the case of Few Shot Learning. These are:
- N-Shot Learning or NSL
- Few Shot Learning or FSL
- One Shot Learning or OSL
- Less than One or Zero Shot Learning or ZSL
When few shot help is discussed, it is N way K shot classification where N is the number of classes and K is the number of samples from each category. N shot learning is more seen as a branding concept apart from others. Hence, few shot, one shot and zero shot learning all are the subfields of N shot learning.
Zero shot learning or ZSL is used to classify the unseen classes without any examples related to training. Thinking it is crazy? But ask yourself. Is it possible for you to classify an object without seeing that? Probably no. It can only happen once you are aware of the property, function, and appearance of that. Then only you can have a general idea about the same. In the coming days, zero shot learning will be very effective and popular.
One Shot and Few Shot
From its name, it is obvious that there will be a single sample in each class for one shot learning. And, in the case of few shot learning, there are two to five samples per class. That is why it is more flexible than one shot learning.
Few Shot Learning Approaches
Let’s take some practical approaches to learn the few shot approaches in detail.
Let’s define a N way K shot classification problem.
- Training support set consisting of
N class labels and
K labeled images for each class.
- Q query images
We need to classify Q query images among N classes as per the problem. Naturally, N*K is the example we have currently. The major issue is; we don’t have enough training sample data to work on it.
The main step in few shot learning is to gain experience from a similar type of problem. This is one of the reasons that it falls under the category of meta learning segment. In the case of the traditional classification problem, always effort is given to classify the training data as well as test data. But in the case of meta learning, the emphasis is on learning to classify the training data set. Normally, there are two different approaches for solving few shot learning problems. Those are:
- Data level approach or DLA
- Parameter level approach or PLA
Data Level Approach
This is the simplest approach. It has a clear concept that if you don’t have enough data for building a structured model then you must add more data. There is no need to overfit or underfit anything. Simply ask for more data. That is why it is told to be the simplest option. This is the reason for various few shot learning solutions where additional information is collected from a large dataset. The best feature of this base dataset is that it does not consist of any class. Take an example – if you want to classify a particular bird species, then the base dataset can have various other images of different types of bird. Using data augmentation technique or using Generative Adversarial Networks (GAN), one can produce more data by our effort.
Parameter level approach
On the contrary, in the parameter level aspect, it is easier to outfit on few shot learning samples due to its high dimensional and extensive spaces. The parameter space is limited in this case to solve the problem. Regularization and proper loss functions should be used accordingly. And, the model performance can be enhanced by directing the same to the extensive parameter area. In case if any standard optimization algorithm is used, then it might not give a reliable result due to the small amount of other data, this is the best reason that the parameter level training is conducted to find the best route. This technique is called as meta learning
Meta Learning algorithm
In the case of the classical paradigm, an algorithm learns if the performance of the task improves. Whereas, in the case of few-shot learning meta-learning paradigm, there remains a set of tasks. The algorithm improves its learning at every task with experience along with the number of tasks it performed. This is called meta learning algorithm.
When the learning is related to the distance function over objects, then it is called metric learning. Metric learning algorithms generally learn how to compare among several data samples. For a few shot classifications, it classifies query samples just based on the similarity aspect.
Gradient based meta learning
In the case of a gradient-based meta learning approach, a meta learner need to be built along with a base learner. Meta learner model accesses all the episodes and a base learner model is initialized in each episode. And this is performed by the meta learner.
Algorithm for few shot image classifications. Certain meta learning algorithms are instrumental in solving several few shot learning problems. Here are those:
- MAML or Model Agnostic Meta Learning
- Prototypical Network
- Relation Network
- Matching Network
Few Shot Object Detection
Yolomaml is the best algorithm in the case of few shot object detections. It has two blended pieces – one is the MAML algorithm and the other is yolov3 object detection. MAML can be applied to a huge variety of deep natural networks and that is why it is well accepted by developers.
Few-shot learning github is another great source to go into the details. You can create several repositories containing clean and readable data that you can test as a part of few shot learning exercises. Normally it is supported by python and pytorch apart from other related infrastructure. Few-shot learning applications are always in testing mode to check whether the algorithms are working or not. And also to enhance the quality of the algorithms in such cases where constant modification is needed. The algorithm is developed using every instance and one can carry on testing till the final result is achieved.
Also, few-shot learning python is very much related to the entire development. Python is an object-oriented high-level programming language that has dynamic semantics. It is very easy to learn due to simple syntax and ultimately reduces the cost of any program. Even few-shot learning in NLP is also an emerging trend. Several experiments are carried forward in this aspect. We all know that NLP or natural language processing is a part of artificial intelligence that is connected between human language and computer interface. NLP makes it possible for computers to read, interpret and determine the important parts needed for any operation. Few shot learning has a deep association with the NLP and it carries several experiments to reach a milestone.
Few-shot learning tutorial and supervised learning
Whenever you are discussing few shot help, then automatically a comparison comes with supervised learning. Traditional learning involves a huge training set. But in the case of few shot help and learning, the query sample is not seen prior. It comes from an unknown class. And that is the main difference between few shot learning and traditional supervised learning.
Application of few shot help and learning
Already, few shot learning has developed a huge number of applications in the field of data science. It is into robotics and of course, computer vision. Few shot options are used in the image as well as in character recognition along with the classification of various items. It performs well in NLP applications like sentiment analysis, text classification, and obviously translation. Few shot elements can be used in robotics is a great way to train robots with different activities. We might find some very good applications in near future based on these.
In short, few shot learning is the future. It is a fast-developing and systematic approach. But lots of research and development need to be done for that. Then only it can reach millions of people very easily and people will be benefited out of the entire process.
Few Shot Learning
We tried to tackle the challenge of few-shot learning for our project, using first LSTMs and then pivoting to transformer architectures. The motivation for our project is simple. Over the past few years, deep neural networks have gotten exceedingly good at classifying images. For example, the convolutional network shown here could recognize that the image on the right is a giraffe with exceedingly high percent accuracy.
However, to do so, it needed to be trained on hundreds, if not thousands, of labeled training examples. On the other hand, a baby can see just one or two pictures of giraffes in a picture book and recognize the giraffe going forward. It does this by recognizing the important features of the image class, such as a long neck, brown spots, and a tail. Our goal is to instill similar intuition into a deep learning model and have it be able to extract the important features from an image class that only has a couple of examples. More formally, this is called N-way K-shot Learning, where you’re given n classes with k examples for each class, and you have to learn enough from these to recognize what class each image in the query set belongs to.
While the bright minds at Google might be able to handle the N-way K-shot learning problem, we decided to simplify things a bit and handle just the one-way k-shot learning, which means we’re only shown one new class of k images and have to recognize images in the query set as either belonging or not belonging to that class. Our approach to this problem is to use the convolutional neural network to extract feature embeddings for each of our k example images and our test image.
We then want some way of combining the k example feature embedding into a single representative embedding, which captures only the most important features. Then we will use some metric of distance to compare this merged feature embedding to the test images embedding to determine whether the two classes are the same or different.
For this task, we chose to use the c410 and c400 datasets. We chose this because each class is pretty distinct, and the images are relatively small, just 32 by 32 pixels, which cuts down on train time quite a bit. We decided to use the c400 dataset for training and the c410 dataset for testing. This is possible because none of the classes in the c410 dataset are present in the c400 dataset. So all of our test time data will be 100 percent new to the model with no classes that it has seen before.
LSTM Model Architecture
Next, we decided to use an LSTM as our technique for merging the feature embeddings for our architecture. Here at each step, the feature embedding from one of the examples is passed into the LSTM, and our idea is that only the most important parts of that embedding will be stored in the LSTM’s memory cell.
Once the LSTM has processed all of the example images, we’ll then take the contents of its memory cell and use that as our merged feature embedding to compare it to the test image. We also decided to use the cosine distance as our similarity metric since it extends well to high-dimensional vectors. And the embedding calculated by our CNN was 512 dimensional.
This is the result of our LSTM model architecture. So there are a few things that we noted in particular from this graph here. And the first was that the variance was very high, both for our test data, test accuracy, and train accuracy. And we attributed this to the fact that with only K shots, so a very limited number of examples; in this case, it was five examples. The actual examples you get have a big impact on your ability to classify that test class in general. So with five examples, if your examples don’t generalize very well to the classes, all that can lead to these different kinds of variances across our testing.
Additionally, we noted that we were beginning to overfit because the test accuracy flattened while the training accuracy continued to increase. And so we were overfitting on our training data and that our accuracy was not as high as we would have liked it to be. And so this led us to consider different ways that we can improve on architecture.
The first improvement we made was instead of using a hand design CNN, we chose to follow the architecture of the CNN that we knew was effective, and we chose VGG-16.
And so we created first a CNN following the VGG-16 architecture using three by three convolutions with max-pooling in progressively greater filter depths and then trained the CNN on cfr 100 it just a typical image classification problem and then saved the weights of that training and then reloaded back in the model and removed the fully connected layers and the softmax, or remove the classification head, which we needed for originally training the model. But we don’t need outputting features that our k-shot model requires. So once we remove the classification head, just to output features instead of labels, we then froze the weights of the model so that only the LSTM would be training and not the feature extractor itself and save that and use that instead of our hand-design CNN.
LSTM + VGG-16
This shows our improved model with the VGG pre-trained CNN set of our hand-designed one, and it led to an increase in performance, although again, not as much as you would have liked, which led us to consider other implementations for combining the feature embedding, which we then transferred to using a transformer instead of an LSTM.
Transformer Model Architecture
Given our limited success with the LSTM architecture and the recent successes of Transformers in multiple domains, we decided to attempt a transformer model architecture.
So this differs from our LSTM architecture only in the last step of combining the embeddings. So we get the same embeddings from our CNN or VGG CNN, and instead of using the LSTM approach, we create a self-attention vector. For each embedding based on all the other embeddings, and then average those self-attention vectors.
Our Model in Action
Here we have our model in action, and these are results from our Transformer plus VGG model. On the left, we can see six bird images that will be combined; their embeddings will be combined to be merged embedding, which will be compared to the embedding of each of these input images on the right. On the top, we have three correctly classified bird images, and on the bottom, we have three images that are not birds, two of which are classified correctly as not birds, and one of which frog is misclassified. So our model classifies an image as of the same class as the examples. If it computes a similarity above point six, and we justify this threshold as we wanted to make it slightly harder to group classes, as we thought that likely in high dimensional space, vectors are more often orthogonal. And so we wanted to make it slightly harder.
Transformer + VGG Accuracy
Here we have the accuracy from our Transformer plus VGG model. Notice the huge difference between test and train accuracy almost right off the bat. Our model is overfitting, but we also notice how high the test accuracy starts above 70 percent. And this is due largely to the strength of the VGG feature extractions.
LSTM + VGG Accuracy
Here we have LSTM plus VGG accuracy. Notice it’s a very similar story to the transformer, but notice that above 95 percent train accuracy. This is very likely due to the fact that VGG was trained on Cfr 100, which is our training set, and not C410, which is responsible for that discrepancy. So we think that if those VGG weights were not frozen and they could adjust how they learned those feature embeddings, that might increase test performance.
Here we have all of our results, including baselines at the bottom in purple; we have our simple CNN baseline. This model is our simple CNN trained on K positive examples, and the rest of the data set worth negative examples. It grossly over-fits and gets poor accuracy as we expected. But it demonstrates how traditional machine learning techniques struggle with a lack of data, which is why we chose to go with a Few Shot learning. So right above that, in the red, we have our toy CNN trained alongside an LSTN, this is a very small CNN, and for that reason, it can’t extract very meaningful features and therefore performs poorly. So right above that, we have our VGG trained with LSTM, which does not perform as well as we were hoping. And our justification for why this lags behind the other two is that LSTMs are naturally order based and while they can learn to treat each example equally, that does not come naturally to them. So right above them, we have the VGG baseline, which was created by merging K VGG embeddings of the K examples and comparing that to the embedding of the input image and then our transformer architecture.
The ethical implications that we consider that we felt had the greatest impact were first a negative implication. So with the ability to use Few Shot learning, someone could take a few pictures, only a couple of you on the street, more a few from your social media presence and while traditional deep learning techniques that would not be enough training data to be able to classify you with these few short techniques, especially with N-way K-shot learning that could be enough to even with just a couple of pictures from you and the outside world to determine who you are for surveillance or other nefarious tasks. However, one positive ethical implication of our model, because of our heavy reliance on transfer learning and pre-trained CNNs, we can reuse pre-developed models and decrease the amount of training we’re doing, the number of resources that we’re requiring, which are better for the environment and lower carbon emissions.