mirror of
https://github.com/microsoft/autogen.git
synced 2026-01-24 13:58:02 -05:00
Documentation on search space and parallel/sequential tuning (#675)
* add search space doc * remove redundancy * add parallel and sequential guidelines * add caveats * update doc * add code example * wording * revise example * add a tutorial link in readme * wording change * update readme * remove redundancy * Update website/docs/Use-Cases/Task-Oriented-AutoML.md Co-authored-by: Xueqing Liu <liususan091219@users.noreply.github.com> * Update website/docs/Use-Cases/Task-Oriented-AutoML.md Co-authored-by: Xueqing Liu <liususan091219@users.noreply.github.com> Co-authored-by: Xueqing Liu <liususan091219@users.noreply.github.com>
This commit is contained in:
@@ -12,9 +12,12 @@
|
||||
<br>
|
||||
</p>
|
||||
|
||||
:fire: **Update (2022/08): We will give a [hands-on tutorial on FLAML at KDD 2022](https://github.com/microsoft/FLAML/tree/tutorial/tutorial) on 08/16/2022.**
|
||||
|
||||
## What is FLAML
|
||||
FLAML is a lightweight Python library that finds accurate machine
|
||||
learning models automatically, efficiently and economically. It frees users from selecting
|
||||
learners and hyperparameters for each learner.
|
||||
learners and hyperparameters for each learner. It can also be used to tune generic hyperparameters for MLOps workflows, pipelines, mathematical/statistical models, algorithms, computing experiments, software configurations and so on.
|
||||
|
||||
1. For common machine learning tasks like classification and regression, it quickly finds quality models for user-provided data with low computational resources. It supports both classifcal machine learning models and deep neural networks.
|
||||
1. It is easy to customize or extend. Users can find their desired customizability from a smooth range: minimal customization (computational resource budget), medium customization (e.g., scikit-style learner, search space and metric), or full customization (arbitrary training and evaluation code).
|
||||
@@ -24,6 +27,7 @@ and learner selection method invented by Microsoft Research.
|
||||
|
||||
FLAML has a .NET implementation in [ML.NET](http://dot.net/ml), an open-source, cross-platform machine learning framework for .NET. In ML.NET, you can use FLAML via low-code solutions like [Model Builder](https://dotnet.microsoft.com/apps/machinelearning-ai/ml-dotnet/model-builder) Visual Studio extension and the cross-platform [ML.NET CLI](https://docs.microsoft.com/dotnet/machine-learning/automate-training-with-cli). Alternatively, you can use the [ML.NET AutoML API](https://www.nuget.org/packages/Microsoft.ML.AutoML/#versions-body-tab) for a code-first experience.
|
||||
|
||||
|
||||
## Installation
|
||||
|
||||
### Python
|
||||
|
||||
@@ -383,6 +383,26 @@ automl.fit(X_train, y_train, n_jobs=4, n_concurrent_trials=4)
|
||||
```
|
||||
flaml will perform 4 trials in parallel, each consuming 4 CPU cores. The parallel tuning uses the [BlendSearch](Tune-User-Defined-Function##blendsearch-economical-hyperparameter-optimization-with-blended-search-strategy) algorithm.
|
||||
|
||||
#### **Guidelines on parallel vs sequential tuning**
|
||||
|
||||
**(1) Considerations on wall-clock time.**
|
||||
|
||||
One common motivation for parallel tuning is to save wall-clock time. When sequential tuning and parallel tuning achieve a similar wall-clock time, sequential tuning should be preferred. This is a rule of thumb when the HPO algorithm is sequential by nature (e.g., Bayesian Optimization and FLAML's HPO algorithms CFO and BS). Sequential tuning allows the HPO algorithms to take advantage of the historical trial results. Then the question is **How to estimate the wall-clock-time needed by parallel tuning and sequential tuning**?
|
||||
|
||||
You can use the following way to roughly estimate the wall-clock time in parallel tuning and sequential tuning: To finish $N$ trials of hyperparameter tuning, i.e., run $N$ hyperparameter configurations, the total wall-clock time needed is $N/k*(SingleTrialTime + Overhead)$, in which $SingleTrialTime$ is the trial time to evaluate a particular hyperparameter configuration, $k$ is the scale of parallelism, e.g., the number of parallel CPU/GPU cores, and $Overhead$ is the computation overhead.
|
||||
|
||||
In sequential tuning, $k=1$, and in parallel tuning $k>1$. This may suggest that parallel tuning has a shorter wall-clock time. But it is not always the case considering the other two factors $SingleTrialTime$, and $Overhead$:
|
||||
|
||||
- The $Overhead$ in sequential tuning is typically negligible; while in parallel tuning, it is relatively large.
|
||||
|
||||
- You can also try to reduce the $SingleTrialTime$ to reduce the wall-clock time in sequential tuning: For example, by increasing the resource consumed by a single trial (distributed or multi-thread training), you can reduce $SingleTrialTime$. One concrete example is to use the `n_jobs` parameter that sets the number of threads the fitting process can use in many scikit-learn style algorithms.
|
||||
|
||||
**(2) Considerations on randomness.**
|
||||
|
||||
Potential reasons that cause randomness:
|
||||
1. Parallel tuning: In the case of parallel tuning, the order of trials' finishing time is no longer deterministic. This non-deterministic order, combined with sequential HPO algorithms, leads to a non-deterministic hyperparameter tuning trajectory.
|
||||
|
||||
2. Distributed or multi-thread training: Distributed/multi-thread training may introduce randomness in model training, i.e., the trained model with the same hyperparameter may be different because of such randomness. This model-level randomness may be undesirable in some cases.
|
||||
|
||||
### Warm start
|
||||
|
||||
|
||||
@@ -74,16 +74,65 @@ from flaml import tune
|
||||
config_search_space = {
|
||||
"x": tune.lograndint(lower=1, upper=100000),
|
||||
"y": tune.randint(lower=1, upper=100000)
|
||||
}
|
||||
}
|
||||
|
||||
# provide the search space to tune.run
|
||||
tune.run(..., config=config_search_space, ...)
|
||||
```
|
||||
|
||||
#### More details about the search space domain
|
||||
#### **Details and guidelines on hyperparameter search space**
|
||||
The corresponding value of a particular hyperparameter in the search space dictionary is called a *domain*, for example, `tune.randint(lower=1, upper=100000)` is the domain for the hyperparameter `y`.
|
||||
The domain specifies a *type* and *valid range* to sample parameters from. Supported types include float, integer, and categorical.
|
||||
|
||||
- **Categorical hyperparameter**
|
||||
|
||||
If it is a categorical hyperparameter, then you should use `tune.choice(possible_choices)` in which `possible_choices` is the list of possible categorical values of the hyperparameter. For example, if you are tuning the optimizer used in model training, and the candidate optimizers are "sgd" and "adam", you should specify the search space in the following way:
|
||||
```python
|
||||
{
|
||||
"optimizer": tune.choice(["sgd", "adam"]),
|
||||
}
|
||||
```
|
||||
- **Numerical hyperparameter**
|
||||
|
||||
If it is a numerical hyperparameter, you need to know whether it takes integer values or float values. In addition, you need to know:
|
||||
- The range of valid values, i.e., what are the lower limit and upper limit of the hyperparameter value?
|
||||
- Do you want to sample in linear scale or log scale? It is a common practice to sample in the log scale if the valid value range is large and the evaluation function changes more regularly with respect to the log domain, as shown in the following example for learning rate tuning. In this code example, we set the lower limit and the upper limit of the learning rate to be 1/1024 and 1.0, respectively. We sample in the log space because model performance changes more regularly in the log scale with respect to the learning rate within such a large search range.
|
||||
|
||||
```python
|
||||
{
|
||||
"learning_rate": tune.loguniform(lower=1 / 1024, upper=1.0),
|
||||
}
|
||||
```
|
||||
When the search range of learning rate is small, it is more common to sample in the linear scale as shown in the following example,
|
||||
|
||||
```python
|
||||
{
|
||||
"learning_rate": tune.uniform(lower=0.1, upper=0.2),
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
- Do you have quantization granularity requirements?
|
||||
|
||||
When you have a desired quantization granularity for the hyperparameter change, you can use `tune.qlograndint` or `tune.qloguniform` to realize the quantization requirement. The following code example helps you realize the need for sampling uniformly in the range of 0.1 and 0.2 with increments of 0.02, i.e., the sampled learning rate can only take values in {0.1, 0.12, 0.14, 0.16, ..., 0.2},
|
||||
```python
|
||||
{
|
||||
"learning_rate": tune.uniform(lower=0.1, upper=0.2, q=0.02),
|
||||
}
|
||||
```
|
||||
|
||||
You can find the corresponding search space choice in the table below once you have answers to the aforementioned three questions.
|
||||
|
||||
|
||||
| | Integer | Float |
|
||||
| ----------- | ----------- |-----------
|
||||
| linear scale | tune.randint(lower: int, upper: int)| tune.uniform(lower: float, upper: float)|
|
||||
| log scale | tune.lograndint(lower: int, upper: int, base: float = 10 | tune.loguniform(lower: float, upper: float, base: float = 10)|
|
||||
| linear scale with quantization| tune.qrandint(lower: int, upper: int, q: int = 1)| tune.quniform(lower: float, upper: float, q: float = 1)|
|
||||
log scale with quantization | tune.qlograndint(lower: int, upper, q: int = 1, base: float = 10)| tune.qloguniform(lower: float, upper, q: float = 1, base: float = 10)
|
||||
|
|
||||
|
||||
|
||||
The corresponding value of a particular hyperparameter in the search space dictionary is called a domain, for example, `tune.randint(lower=1, upper=100000)` is the domain for the hyperparameter `y`. The domain specifies a type and valid range to sample parameters from. Supported types include float, integer, and categorical. You can also specify how to sample values from certain distributions in linear scale or log scale.
|
||||
It is a common practice to sample in log scale if the valid value range is large and the evaluation function changes more regularly with respect to the log domain.
|
||||
See the example below for the commonly used types of domains.
|
||||
|
||||
```python
|
||||
@@ -132,6 +181,7 @@ config = {
|
||||
```
|
||||
<!-- Please refer to [ray.tune](https://docs.ray.io/en/latest/tune/api_docs/search_space.html#overview) for a more comprehensive introduction about possible choices of the domain. -->
|
||||
|
||||
|
||||
#### Cost-related hyperparameters
|
||||
|
||||
Cost-related hyperparameters are a subset of the hyperparameters which directly affect the computation cost incurred in the evaluation of any hyperparameter configuration. For example, the number of estimators (`n_estimators`) and the maximum number of leaves (`max_leaves`) are known to affect the training cost of tree-based learners. So they are cost-related hyperparameters for tree-based learners.
|
||||
@@ -223,7 +273,7 @@ flaml.tune.run(evaluation_function=evaluate_config, mode="min",
|
||||
config_constraints=[(area, "<=", 1000)], ...)
|
||||
```
|
||||
|
||||
You can also specify a list of metric constraints to be satisfied via the argument `metric_constraints`. Each element in the `metric_constraints` list is a tuple that consists of (1) a string specifying the name of the metric (the metric name must be defined and returned in the user-defined `evaluation_function`); (2) an operation chosen from "<=" or ">="; (3) a numerical threshold.
|
||||
You can also specify a list of metric constraints to be satisfied via the argument `metric_constraints`. Each element in the `metric_constraints` list is a tuple that consists of (1) a string specifying the name of the metric (the metric name must be defined and returned in the user-defined `evaluation_function`); (2) an operation chosen from "<=" or ">="; (3) a numerical threshold.
|
||||
|
||||
In the following code example, we constrain the metric `score` to be no larger than 0.4.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user