Continual machine learning aims to design and develop computational sytems and algorithms that learn as humans do. For example, as humans we are usually good at retaining the information learned in the past, abstracting shareable knowledge from them, and using the knowledge to help future learning and problem solving. The rationale is that when faced with a new situation, we humans use our previous experiences and knowledge to help deal with and learn from the new situation. Our biological neural network is very good at learning and accumulating knowledge continuously in a lifelong manner. I argue that it is essential to incorporate this capability into an artificial intelligence (AI) system to make it versatile, holistic and truly “intelligent”.
In traditional supervised machine learning (ML) setups, we train deep neural network (DNN) models independently on datasets consisting of the inputs and labels, , which I enumerate as tasks , where is the maximum number tasks the model is expected to learn from. This is the most commonly used ML paradigm in practice. However, this leads to huge computational loads and scalability issues because we need aggregate the old and new data each time a new task needs to be learned. This is because knowledge is not retained between tasks (ie. knowledge is not cumulative, and the model cannot learn by leveraging past knowledge). Training such models in this setup requires a extremely large number of training examples to allow the model to generalize to all possible situations in real-world dynamic environments. I’m going to refer to this as Machine Learning (ML) 1.0.
Example: Perception for Autonomous Vehicles
Data collected through driving can often be huge given that the output of the various sensors like cameras, lidar and radar as well as internal signals from CAN bus and other measurement units can easily be of the order of hundreds of megabytes per second. The observations are also not equally informative.
It is difficult to come up with either a one-time dataset or modeling that can cover all possible driving situations that an autonomous vehicle may encounter. Driving rules, traffic behavior and weather conditions change from one place to another and over time. So, it will be necessary to continuously collect such new instances as they are encountered in order for the existing systems to adapt accordingly. Thus, there is a need for the perception modules to learn continuously and reliably from new data in a scalable manner but without compromising on the knowledge obtained from previous training.
In continual learning, a DNN learns the tasks in a sequential manner, and when faced with the task it uses the relevant knowledge gained in the past tasks to help learning for the task. This suggests that a DNN should generate some prior knowledge from the past observed tasks to help future task learning without even observing any information from future task . Ideally, in the continual learning setup we aim for . As we can see, this is different from the traditional ML training setup because the prior knowledge may be generated from the tasks with or without considering the task’s data. That way, the learning becomes truly lifelong and autonomous. It is also important to note that we are not jointly optimizing all new and past tasks, but only the task we are currently training on. Now, I’ll refer to this as ML 2.0.
Catastrophic Forgetting at a Glance
Continual learning sounds awesome but if we ask ourselves why hasn’t there been much success with this learning paradigm, it is due to a well-known problem in DNNs called catastrophic forgetting or catastrophic interference [2, 3].
Catastrophic forgetting poses a grand challenge for DNNs, something that was discovered a long time ago in the 90’s. In artificial neural networks, everytime new information is learned, it forgets a little of what it had already been trained on before. Neural networks store information by setting the values of weights, which are the strengths of the synaptic connections between neurons (see diagram below). These weights are also analgous to the axons in our brain, the long tendrils that connect from one neuron to dendrites of another neuron, and meet at microscopic gaps called synapses. Therefore, the value of the weight between two neurons in an artificial neural network is roughly like the number of axons between neurons in the biological neural network. When we train neural networks, we try to minimize the loss function using some gradient descent optimizer and backpropagation . As a result, the weights are slowly adjusted when we train the neural network iteratively on mini-batches of data until we achieve the desired output or best test performance.
Now, what happens when we take this same model and train on new data that we may have collected a few months later? The instant we start optimizing the model again with gradient descent and backpropagation, we start overwriting the pre-trained weights with new values that no longer represent the values we had for the previous dataset. The network starts “forgetting”.
My Take on Continual Learning
Although, current research in continual lifelong learning is still in its infancy, there have been several promising research directions within the machine learning (ML) community . I strongly believe that within the next decade, DNNs will be capable of learning continually to perform multiple tasks without any human intervention. This will further enhance existing technologies that leverage AI such as perception for self-driving cars, fraud detection, recommender systems, climate monitoring, sentiment analysis, face recognition, and etc. It will also be crucial step towards achieving artificial general intelligence (AGI), a long standing challenge for many AI researchers.
 Chauvin, Y. and Rumelhart, D. E. (eds.). Backpropagation: Theory, Architectures, and Applications, ISBN 0-8058- 1259-8, 1995.
 French, R. Catastrophic forgetting in connectionist networks. Trends in Cognitive Sciences, 3(4):128–135, 1999.
 McCloskey, M. and Cohen, N. J. Catastrophic interference in connectionist networks: The sequential learning problem. The Psychology of Learning and Motivation, 24: 104–169, 1989.
 Parisi, G. I., Kemker R., Part J. L., Kanan C., and Wermter S. Continual Lifelong Learning with Neural Networks: A Review. Corr, abs/1802.07569, 2018.