Dissertation Notes by Abraham Bagherjeiran.
See Past Results for old work. Currently working on: ICML07?. See the Publication Plan
- 1 Plan
- 2 Required Assumptions
- 3 Performance Stability
- 4 Value of Parameters and Tasks
- 5 Algorithm
- 6 Evaluation / Experiments
- 7 Theory
1 Plan
- Related Work Search
- Extend sample complexity theory to multi-objective problem
- Understand and type up all details of Maurer06Bounds? for linear multi-task learning.
- Extend this to handle a weighted combination of objectives or an arbitrary loss function, defined over X x Y -> R^1 (weighted for R^n). Basically, this reduces to re-defining the function phi_gamma to be a weighted combination of objectives.
- Show that the Linear preprocessor can be reduced to weight vectors.
- Artificial Dataset, Show how to compute the Empirical Rademacher Complexity? for artificial dataset.
- Provably simple learner and hypothesis space
- Working optimization algorithm for simple learner
- Develop theory of parameter values
- Apply to more complex learners, datasets, and parameters
2 Required Assumptions
What is an idea set of parameters, dataset, and learner for multi-objective bias learning? Create an artificial dataset and parameterized learner that has necessary qualities. Ideally, each objective, individually, is a linear or at worst convex function over the set of parameters. The objectives do not change over different tasks. Thus, optimization for several tasks is equivalent to doing it over one task.2.1 Artificial Dataset
Using the nearest-neighbor algorithm, can we design a sequence of tasks that meet these assumptions?3 Performance Stability
Maurer06Bounds? Shows how to construct bounds for multi-task learning using Rademacher Complexity?. This can be extended to the multi-objective framework given a weighted combination of objectives.4 Value of Parameters and Tasks
The value of parameters and tasks depends on having the tradeoff surface. There are two kinds of value of parameters: sampling value and evaluation value. Sampling value refers to which samples to include in the model. Evaluation values refers to which samples to use for new tasks. The sampling value is the change in the tradeoff surface you get by including them in the sample. If we can measure an expected value for the parameters, we can use that to do active sampling. Say we had a fixed number of samples to collect or some cost to applying the algorithm, then we would want to try only the most promising parameters. The evaluation value depends on the tradeoff surface. The value of a dataset is the degree to which it agrees with the tradeoff space. An outlier task is one whose tradeoff surface is far from that of the sequence of tasks.5 Algorithm
Try using Froelich's online Gaussian processes algorithm. Also try using SVMs to do regular optimization. How to do regular optimization in kernel space. Approximate pre-image problem. Whatever algorithm is chosen, it makes certain assumptions about the objective space and tasks. It must be possible to test or prove that new tasks, algorithms, and parameters, fit these assumptions. It must further be shown that when these assumptions are violated, bad things happen. One can then show which algorithms, parameters, datasets are amenable to multi-objective bias learning.5.1 Survey
Investigating existing regression algorithms compatible with idea of multi-objective bias learning. Algorithms are evaluated on the following criteria:- Availability of existing implementation
- Algorithm can run on large number of examples
- Returns a function that can be used with optimization
- Existing work has used the method for optimization
5.1.1 Convex Function Approximation
Tried convex function approximation, fitting a convex function to the data. Found on page 352 / 338 in Boyd06Convex?. The method is extremely complex, requiring O(nData * nAtt) optimization variables. It works only when the function is convex. If a concave function is negated and the approximated, that works but not sure how to get the original function back. Needs more work. Can be used for optimization, but has never been applied in machine learning (so far as I know).5.1.2 Online Gaussian Processes
Implementation of Online Gaussian Proccesses (OGP) available for matlab using the NETLAB package. Has been used for optimization Frohlich03Efficient?. Not sure if it results in a standard function, but can learn kernelized functions, which may work. Long time to optimize the hyperparameters. Updating the GP does not seem to be that hard. Tried the OGP algorithm. Seems to work, ROC curves TP and FP learned reasonably well with Laplace noise model. Does not use that much memory for 4200 points (sonar ROC). Takes some time, though, has linear time complexity. There is a way to do vector-valued learning but it does not seem to work. Crashes on simple ROC curve with more than 500 points, takes a long time. Tried the gpr algorithm. Works reasonably well. Easy to use, better documentation. Does not seem to handle more than 500 points. The sparse mode assumes that you can select a subset of methods. There are greedy approaches to selecting the subset. Perform additional experiments:- Figure out how model information is saved. How posterior is generated. Write Java interface, use matlab file to store structure in Java. Save to temp file, read to byte stream (compress if possible), return to Java.
- Perform ROC regression and perf data regression, compare error and Pareto Rank?.
5.1.3 Support Vector Regression
Implemented in LibSVM package. Tried this in the AAAI07 paper. Regression is possible for performance data, but does not work well for ROC curves. Objective function is linear in kernel space, but transforming the optimal point in objective space to the original space is a difficult problem. Compute average performance across multiple datasets. Perform SVM regression. Then 1-class SVM to find maximum and minimum distance to origin, for constraints. Plug in to MOLP?. If this works OK, try regression in kernel space and compute the Pre Image? of the optimal solution.5.1.4 Use ROC Curve Distance as Kernel Function
What about using the Hausdorff Distance or Matching Distance as a kernel matrix? Then we can associate each parameter vector with the ROC curve. and predict accuracy using the ROC curve as a feature. In optimization, what do you get, a ROC curve, what are the parameters? If you assume that the optimization is a convex combination of ROC curves, then can translate this into a convex combination of parameters. How could we check that this idea works?6 Evaluation / Experiments
Try on the artificial dataset, validate theoretical predictions. Then, try on more complex datasets.- SVM, Regression, Optimization, Linear Function
- Evgeniou06Regularized? linear version, combined with MOLP?
- Evgeniou06Regularized? original version with weights, compare with what you find with MOLP?
- See Multi Task Experiment Notes for more details.
7 Theory
For multi-objective bias learning to work, one must show that the upper bound on the error rate of learning the tradeoff surface decreases with additional tasks, more so than with tasks individually. The error rate also depends on the number of objectives. This upper-bound on the error rate is independent of the algorithm, parameters, objectives, and datasets. It should be possible to obtain tighter upper bounds by considering this information, possibly by using the Rademacher complexity for the regression model. To further reduce the bounds, we can require that the objective functions and parameter spaces be convex. Should be possible to show that some property of parameters, algorithms, datasets, etc. greatly influences error. This property can be computed and presented in a table showing which parameters, algorithms, datasets have lower capacity. Examples of possible measures:- Generalized VC dimension to account for multiple objectives.
- Generalized Rademacher complexity for multiple tasks and multiple objectives.
Waiting...
Comments