## Section: Research Program

### Self-Paced Learning with Missing Information

Many tasks in artificial intelligence are solved by building a model whose parameters encode the prior domain knowledge and the likelihood of the observed data. In order to use such models in practice, we need to estimate its parameters automatically using training data. The most prevalent paradigm of parameter estimation is supervised learning, which requires the collection of the inputs ${x}_{i}$ and the desired outputs ${y}_{i}$. However, such an approach has two main disadvantages. First, obtaining the ground-truth annotation of high-level applications, such as a tight bounding box around all the objects present in an image, is often expensive. This prohibits the use of a large training dataset, which is essential for learning the existing complex models. Second, in many applications, particularly in the field of medical image analysis, obtaining the ground-truth annotation may not be feasible. For example, even the experts may disagree on the correct segmentation of a microscopical image due to the similarities between the appearance of the foreground and background.

In order to address the deficiencies of supervised learning, researchers have started to focus on the problem of parameter estimation with data that contains hidden
variables. The hidden variables model the missing information in the annotations. Obtaining such data is practically more feasible: image-level labels (`contains car',`does not contain person') instead of tight bounding boxes; partial segmentation of medical images. Formally, the parameters **w** of the model are learned by minimizing the following objective:

$\underset{\mathrm{\u0e50\x9d\x90\u0e10}\u0e42\x88\x88\mathrm{\u0e50\x9d\x92\u0e12}}{min}R\left(\mathrm{\u0e50\x9d\x90\u0e10}\right)+\underset{i=1}{\overset{n}{\u0e42\x88\x91}}\mathrm{\u0e2e\x94}({y}_{i},{y}_{i}\left(\mathrm{\u0e50\x9d\x90\u0e10}\right),{h}_{i}\left(\mathrm{\u0e50\x9d\x90\u0e10}\right)).$ | (5) |

Here, $\mathrm{\u0e50\x9d\x92\u0e12}$ represents the space of all parameters, $n$ is the number of training samples, $R(\u0e22\u0e17)$ is a regularization function, and $\mathrm{\u0e2e\x94}(\u0e22\u0e17)$ is a measure of the difference between the ground-truth output ${y}_{i}$ and the predicted output and hidden variable pair $({y}_{i}\left(\mathrm{\u0e50\x9d\x90\u0e10}\right),{h}_{i}\left(\mathrm{\u0e50\x9d\x90\u0e10}\right))$.

Previous attempts at minimizing the above objective function treat all the training samples equally. This is in stark contrast to how a child learns: first focus on easy samples (`learn to add two natural numbers') before moving on to more complex samples (`learn to add two complex numbers'). In our work, we capture this intuition using a novel, iterative algorithm called self-paced learning (spl ). At an iteration $t$, spl minimizes the following objective function:

$\underset{\mathrm{\u0e50\x9d\x90\u0e10}\u0e42\x88\x88\mathrm{\u0e50\x9d\x92\u0e12},\mathrm{\u0e50\x9d\x90\u0e0f}\u0e42\x88\x88{\{0,1\}}^{n}}{min}R\left(\mathrm{\u0e50\x9d\x90\u0e10}\right)+\underset{i=1}{\overset{n}{\u0e42\x88\x91}}{v}_{i}\mathrm{\u0e2e\x94}({y}_{i},{y}_{i}\left(\mathrm{\u0e50\x9d\x90\u0e10}\right),{h}_{i}\left(\mathrm{\u0e50\x9d\x90\u0e10}\right))-{\mathrm{\u0e2e\u0e1c}}_{t}\underset{i=1}{\overset{n}{\u0e42\x88\x91}}{v}_{i}.$ | (6) |

Here, samples with ${v}_{i}=0$ are discarded during the iteration $t$, since the corresponding loss is multiplied by 0. The term ${\mathrm{\u0e2e\u0e1c}}_{t}$ is a threshold that governs how many samples are discarded. It is annealed at each iteration, allowing the learner to estimate the parameters using more and more samples, until all samples are used. Our results already demonstrate that spl estimates accurate parameters for various applications such as image classification, discriminative motif finding, handwritten digit recognition and semantic segmentation. We will investigate the use of spl to estimate the parameters of the models of medical imaging applications, such as segmentation and registration, that are being developed in the GALEN team. The ability to handle missing information is extremely important in this domain due to the similarities between foreground and background appearances (which results in ambiguities in annotations). We will also develop methods that are capable of minimizing more general loss functions that depend on the (unknown) value of the hidden variables, that is,

$\underset{\mathrm{\u0e50\x9d\x90\u0e10}\u0e42\x88\x88\mathrm{\u0e50\x9d\x92\u0e12},\mathrm{\u0e2e\u0e18}\u0e42\x88\x88\mathrm{\u0e2e\x98}}{min}R\left(\mathrm{\u0e50\x9d\x90\u0e10}\right)+\underset{i=1}{\overset{n}{\u0e42\x88\x91}}\underset{{h}_{i}\u0e42\x88\x88\mathrm{\u0e42\x84\x8b}}{\u0e42\x88\x91}Pr\left({h}_{i}\right|{x}_{i},{y}_{i};\mathrm{\u0e2e\u0e18})\mathrm{\u0e2e\x94}({y}_{i},{h}_{i},{y}_{i}\left(\mathrm{\u0e50\x9d\x90\u0e10}\right),{h}_{i}\left(\mathrm{\u0e50\x9d\x90\u0e10}\right)).$ | (7) |

Here, $\mathrm{\u0e2e\u0e18}$ is the parameter vector of the distribution of the hidden variables ${h}_{i}$ given the input ${x}_{i}$ and output ${y}_{i}$, and needs to be estimated together with the model parameters **w**. The use of a more general loss function will allow us to better exploit the freely available data with missing information. For example, consider the case where ${y}_{i}$ is a binary indicator for the presence of a type of cell in a microscopical image, and ${h}_{i}$ is a tight bounding box around the cell. While the loss function $\mathrm{\u0e2e\x94}({y}_{i},{y}_{i}\left(\mathrm{\u0e50\x9d\x90\u0e10}\right),{h}_{i}\left(\mathrm{\u0e50\x9d\x90\u0e10}\right))$ can be used to learn to classify an image as containing a particular cell or not, the more
general loss function $\mathrm{\u0e2e\x94}({y}_{i},{h}_{i},{y}_{i}\left(\mathrm{\u0e50\x9d\x90\u0e10}\right),{h}_{i}\left(\mathrm{\u0e50\x9d\x90\u0e10}\right))$ can be used to learn to detect the cell as well (since ${h}_{i}$ models its location)