Anatomy of a learning problem
Mark D. Reid, James Montgomery, Mindika Premachandra
Abstract — In order to relate machine learning problems we argue that we need to
be able to articulate what is meant by a single machine learning problem. By
attempting to name the various aspects of a learning problem we hope to clarify
ways in which learning problems might be related to each other. We tentatively put
forward a proposal for an anatomy of learning problems that will serve as scaffolding
for posing questions about relations. After surveying the way learning problems are
discussed in a range of repositories and services. We then argue that the terms
used to describe problems to better understand a range of viewpoints within
machine learning ranging from the theoretical to the practical.
As machine learning researchers and practitioners, we define and solve problems every day.
However, the sheer diversity of modern machine learning can make it difficult for, say, alearning theorist who focuses on bandit problems to appreciate what kind of problem a deep-belief network researcher is trying to solve and vice versa. As the FAQ from the MLComp
Anyone who has worked with machine learning knows that it’s a zoo. There are adazzling array of methods published at conferences each year, leaving thepractitioner, who just wants to choose the best method for his/her task, baffled.
We argue that an unavoidable part of the problem is that different research aims necessarilyfocus on different aspects of a learning problem and therefore use and emphasise differentterms. One could also reverse cause and effect by invoking the Sapir-Whorf hypothesis. Thislinguistic theory states that language is not just a medium for articulating our thoughts but is amedium that affects the way in which we conceptualise the world. In his Notation as a tool of
thought2 Ken Iverson makes a similar point concerning programming, rather than human,languages, quoting George Boole: “That language is an instrument of human reason, and notmerely a medium for the expression of thought, is a truth generally admitted.”
Whatever the direction of causality, there is strong case for the correlation between thediversity of terminology and the machine learning “zoo” we find ourselves in. In an attempt tofind some structure or common ground this paper presents some tentative steps towardssurveying and discussing terminology used to describe machine learning problems.
For this preliminary work, we will take a fairly narrow view of machine learning and begin byfocusing on describing properties of the most common varieties of classification, regression,clustering, and some others. We start by trying to pin down some high-level terminology and,in particular, what we mean by “machine learning problem”.
There are relatively few books, papers, or projects that attempt to categorise learning
problems. Carbonell, Michalski, and Mitchell’s textbook3 from 1983 sketches a taxonomy ofmachine learning research at the time along three dimensions: underlying learning strategies,representation of knowledge, and application domain. We find some discussion ofclassification and regression problems under “learning strategies” that learn “from examples”.
A distinction is made under this heading between “one-trial” and “incremental” learning.
Problems such as clustering fall under those strategies that learn “from observation anddiscovery”. Along the “representation of knowledge” dimension is a list of what is required of alearning algorithm to solve a problem. These include “decision trees”, “grammars”,“parameters”, “logical expressions”, and more.
In his later book, Mitchell4 introduces the notion of a Well-Posed Learning Problem by definingwhat it means for a computer to learn:
A computer program is said to learn from experience E with respect to someclass of tasks T and performance measure P, if its performance at tasks in T,measured by P, improves with experience E.
Mitchell uses the term “task” here to denote broad categories such as reinforcement learningand concept learning (“Inferring a boolean-valued function from training examples of its inputand output”).
Both those takes on describing machine learning arguably focus on what learning is ratherthan what learning acts upon. Our view, which we articulate further below, attempts to turnaround Mitchell’s definition of learning by saying a learning problem consists of data,models, and an assessment procedure and defines learning to be any process that returnsmodels assessed to be of higher quality when presented with larger amounts of data.
To expand on this more problem-focused approach, we turn to repositories and services formachine learning on the web for examples of taxonomies “in the wild”. These projectsnecessarily have to grapple with the question of how to classify a variety of data sets,methods, performance measures, etc. in order to present them in a unified manner. Wesurvey some of the terminology used by these projects in §3 below.
The space of machine learning problems is already rich and rapidly expanding so any attemptfor comprehensiveness is doomed. We do not claim we are particularly systematic in oursurvey of projects either. Projects were chosen by the authors familiarity with the projects ortheir perceived visibility within the machine learning literature. This means there is very likely abias in the terminology towards describing predictive machine learning problems such asclassification and regression (i.e., those solved by minimising some loss on previouslyunseen instances).
We should also note what we are not trying to do with this line of work. We are not proposinga single vocabulary that we hope everyone adopt. Our intention is merely to survey commonlyused terms in part of machine learning to better understand why those terms are used.
With the data, model, assessment taxonomy of learning problems in place, we now proposean tentative refinement to build an anatomy of learning problems. Many of these terms werechosen to echo existing usage but are deliberately vague or under-specified to allow for broadapplicability. The hope is that these place-holders will take on more well-defined meaningover time. The aim is not to prescribe a new vocabulary but rather to offer a starting point fordiscussing how we talk about machine learning problems.
The data available for a learning problem is of central importance for applied machinelearning. We will speak of data as a collection of instances each with an associated trainingsignal (typically in the form of labels) represented in a common format.
Formats describe how data is represented. Generally speaking, many learning problems usefeatures to describe instances. These can be thought of as projections of the instancesdown to simple values such as numbers or categories. Another common format is therelational matrix. This can be thought of as describing features of pairs of instances of thesame kind (e.g., kernel matrix) or between different types of instances (e.g., viewers andmovies in collaborative filtering). Both feature vectors and matrices can be either sparse ordense depending on whether every entry is explicitly described or not.
A model is used to describe the class of possible results obtained by applying a learningalgorithm to data. It can be thought of as a set of constraints on the class of hypotheses alearner can make about the data (e.g., linear models).
An important feature of a model is the form of its predictions. Typically, these are the sameform as the training signal in the data but may not be. For example, in class probability
estimation, a data set contains a categorical training signal but predictions are probabilitydistributions over those categories.
Once a model is created it must be assessed in some way. This is commonly achieved byapplying the model to some new data and evaluating its predictions against the novel trainingsignal using a loss – a function that assigns a penalty to prediction-signal pairs.
With the main components of a learning problem sketched, we can turn to describingcommon patterns or procedures for solving a learning problem.
We first introduce some general terminology to talk about the objects or resources of alearning problem and how they interact with each other in various phases of a solution.
The resource that solves a learning problem by constructing a model given some data iscalled a learner. The application of a learner to data to create or update a model is called thetraining phase. A prediction phase is when a model is applied to an instance to make aprediction. The assessment of a model is performed in the evaluation phase.
During the training phase, the access a learner has to the training signal can be described assupervised (a training signal for every instance), semi-supervised (signal for someinstances but not all), or unsupervised (no training signal). In active access to the trainingsignal is “pull” rather than “push” in that the learner can request it for specific instances ratherthan being given them all.
In addition to the training signal, the learner may have some access to the way its models willbe assessed. This will be called feedback. In full feedback problems the learner canrequest an assessment of a model’s predictions during training. In some problems (e.g.,bandit problems) the learner only has partial feedback. In these cases, the loss for certainpredictions is not available.
The various phases of interaction between resources can take place in several differentways. Modes are used to described the relationship between phases. Modes can be used asmodifiers when describing learning phases.
Batch training, prediction, and evaluation phases occur independently of each other. Inonline learning training, prediction, and possibly evaluation are interleaved in blocks of one ormore instances. The inductive and transductive modifiers apply to the training phase oflearning and describe whether the learner has access to the data its models will be assessedupon (transductive) or not (inductive).
We briefly survey two repositories and two services for their use of terminology for describingvarious aspects machine learning problems and their solutions. Our focus will be on thelanguage used by each to describe and distinguish between different classes of learningproblem (e.g., classification, regression, etc.). Of particular importance are the terms used todescribe the different dimensions learning problems can be placed upon.
This venerable machine learning repository has been available in various forms since 19875.
Currently, it holds 189 data sets for a variety of problem types, including classification,regression and clustering problems. In the default table view of the repository, entries aredescribed by their Name, Data Type, Default Task, Attribute Type, Number of Instances,Number of Attributes, and Year. Additionally, the search criteria also add Area and FormatType. The term Area denotes an application domain (e.g., “Life Sciences”, “Business”,“Game”, etc.) while Format is either “Matrix” or “Non-Matrix”. Of these, the most pertinent toclassifying problems are Data Type, Default Task, and Attribute Type.
The listed categories in the Default Task field are “Classification” (126 data sets),“Regression” (16), “Clustering” (8), and “Other” (44). Under “Other” are tasks described as“Causal-Discovery”, “Recommender-Systems”, “Function-Learning”, and “N/A”.
Attribute Types describe the values in instances’ features as well as their output type. Theseinclude “Integer”, “Categorical”, “Real”, and combinations thereof. A total of 24 out of the 189data sets do not have an entry for Attribute Types. This is occasionally due to lack ofdocumentation but also occurs for data sets with instances that are not easily described inattribute-value format, such as relational problems such as finding domain theories.
Data Types can be loosely described as denoting a class of problem. Entries include “Data-Generator”, “Domain-Theory”, “Image”, “Multivariate”, “Relational”, “Sequential”, “Time-Series”, “Spatial”, “Text”, “Univariate”, “Transactional” and combinations of some of thoseterms (e.g., “Text, Sequential”). The Format Type is one of either “Matrix” or “Non-Matrix”.
The MLData service at http://mldata.org describes itself as “a repository for yourmachine learning data”. It allows its users to create four entities that are stored in therepository: data (“Raw data as a collection of similarly structured objects”), methods(“Descriptions of the computational pipeline”), tasks (a data set plus some performancemeasure to optimise on the data), and challenges (“Collections of tas ks which have aparticular theme”).
At the time of writing there are 27 publicly available tasks at MLData which fall into three tasktypes: “binary classification”, “multiclass”, and “regression”. A task specifies an input and
output format (e.g., “real-valued matrix” to “+1/–1”), a performance measure (e.g.,“accuracy”), a data set, and a split of the data set’s variables into input/output and instancesinto train/validation/test. Of the 21 binary classification tasks, 20 use “accuracy” as theperformance measure and one uses “ROC curve”; both multiclass problems use “accuracy”;and three of the four regression tasks use “root mean square error” with the other using“mean absolute error”.
Under methods there are currently 9 entries, all of which describe (to varying degrees ofcompleteness) the application of a particular software package to one or more tasks using aparticular configuration of learning parameters.
This project6 offers a web service API to a number of (undisclosed) learning algorithms.
Users can upload data, use it to train a predictors, then make predictions on new data.
In the Google Prediction API several problem-related terms are used with a specific meaning.
Data refers to a tabular representation of instances in CSV format that is uploaded toGoogle’s data storage service. Each column of the data table is a feature that can take valuesthat are exclusively either numerical or strings. The first column of the data is the target valueand can be either numerical (defining a regression problem) or a string (for classificationproblems). Data for classification problems can specify an optional utility for each example.
The training service will aim to predict better on higher utility examples.
After the data is uploaded, a model can be trained by sending a URL for the data to theservice. Depending on the type of the target value in the data set, either a regression orcategorical model is returned along with an evaluation of the model’s classification accuracy(for categorical models) or mean square error (for regression models). The evaluation isgenerated via cross-fold validation.
Once trained, categorical models (but not regression models) can be updated with additionaltraining examples. This is referred to as streaming training.
The MLComp service7 describes itself as “a free website for objectively comparing machinelearning programs across various datasets for multiple problem domains.” At the time ofwriting the service hosts 382 data sets, 325 programs, 10 domains, 3 domain types, and hasperformed 9774 runs.
Unlike the Google Prediction service, MLComp admits a range of problem types than justclassification and regression. In particular, they categorise problems into domains which canbe solved via conforming programs during a run. Domains are classes of problems“equipped with the following”: “a particular dataset format”, “a particular program interface”,and “standard evaluation metrics”. A user of the MLComp service can define their owndomain by specifying these properties via a structured markup language (YAML).
Domains are further categorised into three broad domain types. In supervised learning a runconsists of a learn and test phase. In a performing domain (e.g., clustering, optimisation), arun just applies a program to a single data set. In an interactive learning domain (e.g., onlinelearning and bandit problems), the training and evaluation phases are interleaved so thatduring a run the program repeatedly receives an unlabelled instance, makes a prediction,then the label is revealed.
The MLComp framework also recently added support for reductions between classes of
learning problems. As described by Beygelzimer et al.8, these are techniques fortransforming one type of problem (e.g., mutliclass classification) into others (e.g., binaryclassification) in a way that provides some guarantees regarding generalisation performance.
How useful are the introduced terms and questions in emulating the surveyed projects’systems of categorising machine learning problems? Obviously since the terminology wehave introduced is generally at a higher level than that used in particular projects, any re-expression of those projects’ terms will not capture the same detail.
What the UCI repository refers to as a data type does not easily fit into our ontology. Forexample, we note that since the repository mainly holds data sets, much of our process-oriented terminology — that is, phases and modes — is not applicable.
What the MLComp project calls a domain is what we would described by a combination offormat, mode and feedback specification. For example, in BinaryClassification data ina sparse, feature vector format is presented with a binary label supervised signal in batchmode assessed using a misclassification loss. WordSegmentation is an unsupervisedbatch trained problem with English sentences as instances presented in text (UTF8) formatassessed using precision, recall, and F1 losses.
Just as medical specialists can be dermatologists, neurosurgeons, and heart specialists, sotoo can machine learning researchers and practitioners be placed into a number ofcategories, each with their own jargon. A big difference, however, between medicine andmachine learning is that medical students are trained broadly so that specialists have manycommon points of reference.
Can the type of work various people do under the auspices of machine learning be describewith reference to the parts of machine learning problems they most emphasise? Thefollowing caricatures are intended to highlight the ways in which researchers differ in theirfocus.
Applied research tends to ask “What can be learnt from this data?” and will use whatevermodels and assessment procedure is most appropriate. Research into new machinelearning techniques is arguably model-focused, developing efficient algorithms that perform
well on particular data sets for a narrow range of assessments. Learning theory is arguablyassessment-focused, typically ignoring the finer details of data representation and aiming toshow when classes of problems are learnable or not independent of particular techniques.
This proposal barely scratches the surface of the rich and complex range of learningproblems that are encountered in the literature. A more comprehensive study would includean analysis of the APIs of various machine learning toolkits in an attempt to understand theimplicit ontologies are used by the algorithms they implement. An extension of the currentapproach to a broader range of problems such as reinforcement learning, dimensionalityreduction, and other classes is also planned.
1. MLComp, http://mlcomp.org 2. Iverson, K.E., Notation as a tool of thought, Communications of the ACM, vol. 23, 1980.
3. Ryszard S. Michalski, Jaime G. Carbonell, Tom M. Mitchell, Machine Learning: An Artificial
Intelligence Approach, Tioga Publishing Company, 1983
4. Mitchell, T., Machine Learning, McGraw-Hill, 1997 5. Asuncion,
http://www.ics.uci.edu/~mlearn/MLRepository.html. Irvine, CA: University ofCalifornia, School of Information and Computer Science. 2007
6. The Google Prediction API, http://code.google.com/apis/predict/ 7. MLComp, http://mlcomp.org 8. Beygelzimer, A. and Langford, J. and Zadrozny, B., Machine learning techniques —
Reductions between prediction quality metrics, In Performance Modeling andEngineering, Springer, 2008
Bijlage 1 Informatie voor ouders met een kind dat behandeld wordt met hydrocortison De bijnieren van uw kind maken te weinig cortisol (=hydrocortison). Om toch voldoende van deze stof in het lichaam te hebben, krijgt uw kind dagelijks hydrocortison. In sommige situaties heeft uw kind meer nodig dan normaal. Dan moet u extra hydrocortison geven. In deze brief staan de situaties, waari
Surname…………………………. Forename………………………NHS number……………….D.O.B………… Community Respiratory Team MANAGEMENT PLAN – COPD Exacerbation Actual Problem………………………………………………………………………………. Goals agreed with patient, family and nurse Short Term: Long Term: Exacerbation is manag