This commit is contained in:
Chi Wang
2023-09-16 10:57:57 +00:00
parent 4f8e30786c
commit bc4473fe8a
318 changed files with 56 additions and 70662 deletions

View File

@@ -1,69 +0,0 @@
# AutoML - Classification
### Prerequisites
Install the [automl] option.
```bash
pip install "flaml[automl]"
```
### A basic classification example
```python
from flaml import AutoML
from sklearn.datasets import load_iris
# Initialize an AutoML instance
automl = AutoML()
# Specify automl goal and constraint
automl_settings = {
"time_budget": 1, # in seconds
"metric": 'accuracy',
"task": 'classification',
"log_file_name": "iris.log",
}
X_train, y_train = load_iris(return_X_y=True)
# Train with labeled input data
automl.fit(X_train=X_train, y_train=y_train,
**automl_settings)
# Predict
print(automl.predict_proba(X_train))
# Print the best model
print(automl.model.estimator)
```
#### Sample of output
```
[flaml.automl: 11-12 18:21:44] {1485} INFO - Data split method: stratified
[flaml.automl: 11-12 18:21:44] {1489} INFO - Evaluation method: cv
[flaml.automl: 11-12 18:21:44] {1540} INFO - Minimizing error metric: 1-accuracy
[flaml.automl: 11-12 18:21:44] {1577} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'catboost', 'xgboost', 'extra_tree', 'lrl1']
[flaml.automl: 11-12 18:21:44] {1826} INFO - iteration 0, current learner lgbm
[flaml.automl: 11-12 18:21:44] {1944} INFO - Estimated sufficient time budget=1285s. Estimated necessary time budget=23s.
[flaml.automl: 11-12 18:21:44] {2029} INFO - at 0.2s, estimator lgbm's best error=0.0733, best estimator lgbm's best error=0.0733
[flaml.automl: 11-12 18:21:44] {1826} INFO - iteration 1, current learner lgbm
[flaml.automl: 11-12 18:21:44] {2029} INFO - at 0.3s, estimator lgbm's best error=0.0733, best estimator lgbm's best error=0.0733
[flaml.automl: 11-12 18:21:44] {1826} INFO - iteration 2, current learner lgbm
[flaml.automl: 11-12 18:21:44] {2029} INFO - at 0.4s, estimator lgbm's best error=0.0533, best estimator lgbm's best error=0.0533
[flaml.automl: 11-12 18:21:44] {1826} INFO - iteration 3, current learner lgbm
[flaml.automl: 11-12 18:21:44] {2029} INFO - at 0.6s, estimator lgbm's best error=0.0533, best estimator lgbm's best error=0.0533
[flaml.automl: 11-12 18:21:44] {1826} INFO - iteration 4, current learner lgbm
[flaml.automl: 11-12 18:21:44] {2029} INFO - at 0.6s, estimator lgbm's best error=0.0533, best estimator lgbm's best error=0.0533
[flaml.automl: 11-12 18:21:44] {1826} INFO - iteration 5, current learner xgboost
[flaml.automl: 11-12 18:21:45] {2029} INFO - at 0.9s, estimator xgboost's best error=0.0600, best estimator lgbm's best error=0.0533
[flaml.automl: 11-12 18:21:45] {1826} INFO - iteration 6, current learner lgbm
[flaml.automl: 11-12 18:21:45] {2029} INFO - at 1.0s, estimator lgbm's best error=0.0533, best estimator lgbm's best error=0.0533
[flaml.automl: 11-12 18:21:45] {1826} INFO - iteration 7, current learner extra_tree
[flaml.automl: 11-12 18:21:45] {2029} INFO - at 1.1s, estimator extra_tree's best error=0.0667, best estimator lgbm's best error=0.0533
[flaml.automl: 11-12 18:21:45] {2242} INFO - retrain lgbm for 0.0s
[flaml.automl: 11-12 18:21:45] {2247} INFO - retrained model: LGBMClassifier(learning_rate=0.2677050123105203, max_bin=127,
min_child_samples=12, n_estimators=4, num_leaves=4,
reg_alpha=0.001348364934537134, reg_lambda=1.4442580148221913,
verbose=-1)
[flaml.automl: 11-12 18:21:45] {1608} INFO - fit succeeded
[flaml.automl: 11-12 18:21:45] {1610} INFO - Time taken to find the best model: 0.3756711483001709
```
### A more advanced example including custom learner and metric
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/automl_classification.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/automl_classification.ipynb)

View File

@@ -1,376 +0,0 @@
# AutoML - NLP
### Requirements
This example requires GPU. Install the [automl,hf] option:
```python
pip install "flaml[automl,hf]"
```
### A simple sequence classification example
```python
from flaml import AutoML
from datasets import load_dataset
train_dataset = load_dataset("glue", "mrpc", split="train").to_pandas()
dev_dataset = load_dataset("glue", "mrpc", split="validation").to_pandas()
test_dataset = load_dataset("glue", "mrpc", split="test").to_pandas()
custom_sent_keys = ["sentence1", "sentence2"]
label_key = "label"
X_train, y_train = train_dataset[custom_sent_keys], train_dataset[label_key]
X_val, y_val = dev_dataset[custom_sent_keys], dev_dataset[label_key]
X_test = test_dataset[custom_sent_keys]
automl = AutoML()
automl_settings = {
"time_budget": 100,
"task": "seq-classification",
"fit_kwargs_by_estimator": {
"transformer":
{
"output_dir": "data/output/" # if model_path is not set, the default model is facebook/muppet-roberta-base: https://huggingface.co/facebook/muppet-roberta-base
}
}, # setting the huggingface arguments: output directory
"gpu_per_trial": 1, # set to 0 if no GPU is available
}
automl.fit(X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, **automl_settings)
automl.predict(X_test)
```
Notice that after you run `automl.fit`, the intermediate checkpoints are saved under the specified output_dir `data/output`. You can use the following code to clean these outputs if they consume a large storage space:
```python
if os.path.exists("data/output/"):
shutil.rmtree("data/output/")
```
#### Sample output
```
[flaml.automl: 12-06 08:21:39] {1943} INFO - task = seq-classification
[flaml.automl: 12-06 08:21:39] {1945} INFO - Data split method: stratified
[flaml.automl: 12-06 08:21:39] {1949} INFO - Evaluation method: holdout
[flaml.automl: 12-06 08:21:39] {2019} INFO - Minimizing error metric: 1-accuracy
[flaml.automl: 12-06 08:21:39] {2071} INFO - List of ML learners in AutoML Run: ['transformer']
[flaml.automl: 12-06 08:21:39] {2311} INFO - iteration 0, current learner transformer
{'data/output/train_2021-12-06_08-21-53/train_8947b1b2_1_n=1e-06,s=9223372036854775807,e=1e-05,s=-1,s=0.45765,e=32,d=42,o=0.0,y=0.0_2021-12-06_08-21-53/checkpoint-53': 53}
[flaml.automl: 12-06 08:22:56] {2424} INFO - Estimated sufficient time budget=766860s. Estimated necessary time budget=767s.
[flaml.automl: 12-06 08:22:56] {2499} INFO - at 76.7s, estimator transformer's best error=0.1740, best estimator transformer's best error=0.1740
[flaml.automl: 12-06 08:22:56] {2606} INFO - selected model: <flaml.nlp.huggingface.trainer.TrainerForAuto object at 0x7f49ea8414f0>
[flaml.automl: 12-06 08:22:56] {2100} INFO - fit succeeded
[flaml.automl: 12-06 08:22:56] {2101} INFO - Time taken to find the best model: 76.69802761077881
[flaml.automl: 12-06 08:22:56] {2112} WARNING - Time taken to find the best model is 77% of the provided time budget and not all estimators' hyperparameter search converged. Consider increasing the time budget.
```
### A simple sequence regression example
```python
from flaml import AutoML
from datasets import load_dataset
train_dataset = (
load_dataset("glue", "stsb", split="train").to_pandas()
)
dev_dataset = (
load_dataset("glue", "stsb", split="train").to_pandas()
)
custom_sent_keys = ["sentence1", "sentence2"]
label_key = "label"
X_train = train_dataset[custom_sent_keys]
y_train = train_dataset[label_key]
X_val = dev_dataset[custom_sent_keys]
y_val = dev_dataset[label_key]
automl = AutoML()
automl_settings = {
"gpu_per_trial": 0,
"time_budget": 20,
"task": "seq-regression",
"metric": "rmse",
}
automl_settings["fit_kwargs_by_estimator"] = { # setting the huggingface arguments
"transformer": {
"model_path": "google/electra-small-discriminator", # if model_path is not set, the default model is facebook/muppet-roberta-base: https://huggingface.co/facebook/muppet-roberta-base
"output_dir": "data/output/", # setting the output directory
"fp16": False,
} # setting whether to use FP16
}
automl.fit(
X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, **automl_settings
)
```
#### Sample output
```
[flaml.automl: 12-20 11:47:28] {1965} INFO - task = seq-regression
[flaml.automl: 12-20 11:47:28] {1967} INFO - Data split method: uniform
[flaml.automl: 12-20 11:47:28] {1971} INFO - Evaluation method: holdout
[flaml.automl: 12-20 11:47:28] {2063} INFO - Minimizing error metric: rmse
[flaml.automl: 12-20 11:47:28] {2115} INFO - List of ML learners in AutoML Run: ['transformer']
[flaml.automl: 12-20 11:47:28] {2355} INFO - iteration 0, current learner transformer
```
### A simple summarization example
```python
from flaml import AutoML
from datasets import load_dataset
train_dataset = (
load_dataset("xsum", split="train").to_pandas()
)
dev_dataset = (
load_dataset("xsum", split="validation").to_pandas()
)
custom_sent_keys = ["document"]
label_key = "summary"
X_train = train_dataset[custom_sent_keys]
y_train = train_dataset[label_key]
X_val = dev_dataset[custom_sent_keys]
y_val = dev_dataset[label_key]
automl = AutoML()
automl_settings = {
"gpu_per_trial": 1,
"time_budget": 20,
"task": "summarization",
"metric": "rouge1",
}
automl_settings["fit_kwargs_by_estimator"] = { # setting the huggingface arguments
"transformer": {
"model_path": "t5-small", # if model_path is not set, the default model is t5-small: https://huggingface.co/t5-small
"output_dir": "data/output/", # setting the output directory
"fp16": False,
} # setting whether to use FP16
}
automl.fit(
X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, **automl_settings
)
```
#### Sample Output
```
[flaml.automl: 12-20 11:44:03] {1965} INFO - task = summarization
[flaml.automl: 12-20 11:44:03] {1967} INFO - Data split method: uniform
[flaml.automl: 12-20 11:44:03] {1971} INFO - Evaluation method: holdout
[flaml.automl: 12-20 11:44:03] {2063} INFO - Minimizing error metric: -rouge
[flaml.automl: 12-20 11:44:03] {2115} INFO - List of ML learners in AutoML Run: ['transformer']
[flaml.automl: 12-20 11:44:03] {2355} INFO - iteration 0, current learner transformer
loading configuration file https://huggingface.co/t5-small/resolve/main/config.json from cache at /home/xliu127/.cache/huggingface/transformers/fe501e8fd6425b8ec93df37767fcce78ce626e34cc5edc859c662350cf712e41.406701565c0afd9899544c1cb8b93185a76f00b31e5ce7f6e18bbaef02241985
Model config T5Config {
"_name_or_path": "t5-small",
"architectures": [
"T5WithLMHeadModel"
],
"d_ff": 2048,
"d_kv": 64,
"d_model": 512,
"decoder_start_token_id": 0,
"dropout_rate": 0.1,
"eos_token_id": 1,
"feed_forward_proj": "relu",
"initializer_factor": 1.0,
"is_encoder_decoder": true,
"layer_norm_epsilon": 1e-06,
"model_type": "t5",
"n_positions": 512,
"num_decoder_layers": 6,
"num_heads": 8,
"num_layers": 6,
"output_past": true,
"pad_token_id": 0,
"relative_attention_num_buckets": 32,
"task_specific_params": {
"summarization": {
"early_stopping": true,
"length_penalty": 2.0,
"max_length": 200,
"min_length": 30,
"no_repeat_ngram_size": 3,
"num_beams": 4,
"prefix": "summarize: "
},
"translation_en_to_de": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to German: "
},
"translation_en_to_fr": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to French: "
},
"translation_en_to_ro": {
"early_stopping": true,
"max_length": 300,
"num_beams": 4,
"prefix": "translate English to Romanian: "
}
},
"transformers_version": "4.14.1",
"use_cache": true,
"vocab_size": 32128
}
```
### A simple token classification example
There are two ways to define the label for a token classification task. The first is to define the token labels:
```python
from flaml import AutoML
import pandas as pd
train_dataset = {
"id": ["0", "1"],
"ner_tags": [
["B-ORG", "O", "B-MISC", "O", "O", "O", "B-MISC", "O", "O"],
["B-PER", "I-PER"],
],
"tokens": [
[
"EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", ".",
],
["Peter", "Blackburn"],
],
}
dev_dataset = {
"id": ["0"],
"ner_tags": [
["O"],
],
"tokens": [
["1996-08-22"]
],
}
test_dataset = {
"id": ["0"],
"ner_tags": [
["O"],
],
"tokens": [
['.']
],
}
custom_sent_keys = ["tokens"]
label_key = "ner_tags"
train_dataset = pd.DataFrame(train_dataset)
dev_dataset = pd.DataFrame(dev_dataset)
test_dataset = pd.DataFrame(test_dataset)
X_train, y_train = train_dataset[custom_sent_keys], train_dataset[label_key]
X_val, y_val = dev_dataset[custom_sent_keys], dev_dataset[label_key]
X_test = test_dataset[custom_sent_keys]
automl = AutoML()
automl_settings = {
"time_budget": 10,
"task": "token-classification",
"fit_kwargs_by_estimator": {
"transformer":
{
"output_dir": "data/output/"
# if model_path is not set, the default model is facebook/muppet-roberta-base: https://huggingface.co/facebook/muppet-roberta-base
}
}, # setting the huggingface arguments: output directory
"gpu_per_trial": 1, # set to 0 if no GPU is available
"metric": "seqeval:overall_f1"
}
automl.fit(X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, **automl_settings)
automl.predict(X_test)
```
The second is to define the id labels + a token [label list](https://microsoft.github.io/FLAML/docs/reference/nlp/huggingface/training_args):
```python
from flaml import AutoML
import pandas as pd
train_dataset = {
"id": ["0", "1"],
"ner_tags": [
[3, 0, 7, 0, 0, 0, 7, 0, 0],
[1, 2],
],
"tokens": [
[
"EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", ".",
],
["Peter", "Blackburn"],
],
}
dev_dataset = {
"id": ["0"],
"ner_tags": [
[0],
],
"tokens": [
["1996-08-22"]
],
}
test_dataset = {
"id": ["0"],
"ner_tags": [
[0],
],
"tokens": [
['.']
],
}
custom_sent_keys = ["tokens"]
label_key = "ner_tags"
train_dataset = pd.DataFrame(train_dataset)
dev_dataset = pd.DataFrame(dev_dataset)
test_dataset = pd.DataFrame(test_dataset)
X_train, y_train = train_dataset[custom_sent_keys], train_dataset[label_key]
X_val, y_val = dev_dataset[custom_sent_keys], dev_dataset[label_key]
X_test = test_dataset[custom_sent_keys]
automl = AutoML()
automl_settings = {
"time_budget": 10,
"task": "token-classification",
"fit_kwargs_by_estimator": {
"transformer":
{
"output_dir": "data/output/",
# if model_path is not set, the default model is facebook/muppet-roberta-base: https://huggingface.co/facebook/muppet-roberta-base
"label_list": [ "O","B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-MISC", "I-MISC" ]
}
}, # setting the huggingface arguments: output directory
"gpu_per_trial": 1, # set to 0 if no GPU is available
"metric": "seqeval:overall_f1"
}
automl.fit(X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val, **automl_settings)
automl.predict(X_test)
```
#### Sample Output
```
[flaml.automl: 06-30 03:10:02] {2423} INFO - task = token-classification
[flaml.automl: 06-30 03:10:02] {2425} INFO - Data split method: stratified
[flaml.automl: 06-30 03:10:02] {2428} INFO - Evaluation method: holdout
[flaml.automl: 06-30 03:10:02] {2497} INFO - Minimizing error metric: seqeval:overall_f1
[flaml.automl: 06-30 03:10:02] {2637} INFO - List of ML learners in AutoML Run: ['transformer']
[flaml.automl: 06-30 03:10:02] {2929} INFO - iteration 0, current learner transformer
```
For tasks that are not currently supported, use `flaml.tune` for [customized tuning](Tune-HuggingFace).
### Link to Jupyter notebook
To run more examples, especially examples using Ray Tune, please go to:
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/automl_nlp.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/automl_nlp.ipynb)

View File

@@ -1,103 +0,0 @@
# AutoML - Rank
### Prerequisites
Install the [automl] option.
```bash
pip install "flaml[automl]"
```
### A simple learning-to-rank example
```python
from sklearn.datasets import fetch_openml
from flaml import AutoML
X_train, y_train = fetch_openml(name="credit-g", return_X_y=True, as_frame=False)
y_train = y_train.cat.codes
# not a real learning to rank dataaset
groups = [200] * 4 + [100] * 2 # group counts
automl = AutoML()
automl.fit(
X_train, y_train, groups=groups,
task='rank', time_budget=10, # in seconds
)
```
#### Sample output
```
[flaml.automl: 11-15 07:14:30] {1485} INFO - Data split method: group
[flaml.automl: 11-15 07:14:30] {1489} INFO - Evaluation method: holdout
[flaml.automl: 11-15 07:14:30] {1540} INFO - Minimizing error metric: 1-ndcg
[flaml.automl: 11-15 07:14:30] {1577} INFO - List of ML learners in AutoML Run: ['lgbm', 'xgboost']
[flaml.automl: 11-15 07:14:30] {1826} INFO - iteration 0, current learner lgbm
[flaml.automl: 11-15 07:14:30] {1944} INFO - Estimated sufficient time budget=679s. Estimated necessary time budget=1s.
[flaml.automl: 11-15 07:14:30] {2029} INFO - at 0.1s, estimator lgbm's best error=0.0248, best estimator lgbm's best error=0.0248
[flaml.automl: 11-15 07:14:30] {1826} INFO - iteration 1, current learner lgbm
[flaml.automl: 11-15 07:14:30] {2029} INFO - at 0.1s, estimator lgbm's best error=0.0248, best estimator lgbm's best error=0.0248
[flaml.automl: 11-15 07:14:30] {1826} INFO - iteration 2, current learner lgbm
[flaml.automl: 11-15 07:14:30] {2029} INFO - at 0.2s, estimator lgbm's best error=0.0248, best estimator lgbm's best error=0.0248
[flaml.automl: 11-15 07:14:30] {1826} INFO - iteration 3, current learner lgbm
[flaml.automl: 11-15 07:14:30] {2029} INFO - at 0.2s, estimator lgbm's best error=0.0248, best estimator lgbm's best error=0.0248
[flaml.automl: 11-15 07:14:30] {1826} INFO - iteration 4, current learner xgboost
[flaml.automl: 11-15 07:14:30] {2029} INFO - at 0.2s, estimator xgboost's best error=0.0315, best estimator lgbm's best error=0.0248
[flaml.automl: 11-15 07:14:30] {1826} INFO - iteration 5, current learner xgboost
[flaml.automl: 11-15 07:14:30] {2029} INFO - at 0.2s, estimator xgboost's best error=0.0315, best estimator lgbm's best error=0.0248
[flaml.automl: 11-15 07:14:30] {1826} INFO - iteration 6, current learner lgbm
[flaml.automl: 11-15 07:14:30] {2029} INFO - at 0.3s, estimator lgbm's best error=0.0248, best estimator lgbm's best error=0.0248
[flaml.automl: 11-15 07:14:30] {1826} INFO - iteration 7, current learner lgbm
[flaml.automl: 11-15 07:14:30] {2029} INFO - at 0.3s, estimator lgbm's best error=0.0248, best estimator lgbm's best error=0.0248
[flaml.automl: 11-15 07:14:30] {1826} INFO - iteration 8, current learner xgboost
[flaml.automl: 11-15 07:14:30] {2029} INFO - at 0.4s, estimator xgboost's best error=0.0315, best estimator lgbm's best error=0.0248
[flaml.automl: 11-15 07:14:30] {1826} INFO - iteration 9, current learner xgboost
[flaml.automl: 11-15 07:14:30] {2029} INFO - at 0.4s, estimator xgboost's best error=0.0315, best estimator lgbm's best error=0.0248
[flaml.automl: 11-15 07:14:30] {1826} INFO - iteration 10, current learner xgboost
[flaml.automl: 11-15 07:14:30] {2029} INFO - at 0.4s, estimator xgboost's best error=0.0233, best estimator xgboost's best error=0.0233
[flaml.automl: 11-15 07:14:30] {1826} INFO - iteration 11, current learner xgboost
[flaml.automl: 11-15 07:14:30] {2029} INFO - at 0.4s, estimator xgboost's best error=0.0233, best estimator xgboost's best error=0.0233
[flaml.automl: 11-15 07:14:30] {1826} INFO - iteration 12, current learner xgboost
[flaml.automl: 11-15 07:14:30] {2029} INFO - at 0.4s, estimator xgboost's best error=0.0233, best estimator xgboost's best error=0.0233
[flaml.automl: 11-15 07:14:30] {1826} INFO - iteration 13, current learner xgboost
[flaml.automl: 11-15 07:14:30] {2029} INFO - at 0.4s, estimator xgboost's best error=0.0233, best estimator xgboost's best error=0.0233
[flaml.automl: 11-15 07:14:30] {1826} INFO - iteration 14, current learner lgbm
[flaml.automl: 11-15 07:14:30] {2029} INFO - at 0.5s, estimator lgbm's best error=0.0225, best estimator lgbm's best error=0.0225
[flaml.automl: 11-15 07:14:30] {1826} INFO - iteration 15, current learner xgboost
[flaml.automl: 11-15 07:14:30] {2029} INFO - at 0.5s, estimator xgboost's best error=0.0233, best estimator lgbm's best error=0.0225
[flaml.automl: 11-15 07:14:30] {1826} INFO - iteration 16, current learner lgbm
[flaml.automl: 11-15 07:14:30] {2029} INFO - at 0.5s, estimator lgbm's best error=0.0225, best estimator lgbm's best error=0.0225
[flaml.automl: 11-15 07:14:30] {1826} INFO - iteration 17, current learner lgbm
[flaml.automl: 11-15 07:14:31] {2029} INFO - at 0.5s, estimator lgbm's best error=0.0225, best estimator lgbm's best error=0.0225
[flaml.automl: 11-15 07:14:31] {1826} INFO - iteration 18, current learner lgbm
[flaml.automl: 11-15 07:14:31] {2029} INFO - at 0.6s, estimator lgbm's best error=0.0225, best estimator lgbm's best error=0.0225
[flaml.automl: 11-15 07:14:31] {1826} INFO - iteration 19, current learner lgbm
[flaml.automl: 11-15 07:14:31] {2029} INFO - at 0.6s, estimator lgbm's best error=0.0201, best estimator lgbm's best error=0.0201
[flaml.automl: 11-15 07:14:31] {1826} INFO - iteration 20, current learner lgbm
[flaml.automl: 11-15 07:14:31] {2029} INFO - at 0.6s, estimator lgbm's best error=0.0201, best estimator lgbm's best error=0.0201
[flaml.automl: 11-15 07:14:31] {1826} INFO - iteration 21, current learner lgbm
[flaml.automl: 11-15 07:14:31] {2029} INFO - at 0.7s, estimator lgbm's best error=0.0201, best estimator lgbm's best error=0.0201
[flaml.automl: 11-15 07:14:31] {1826} INFO - iteration 22, current learner lgbm
[flaml.automl: 11-15 07:14:31] {2029} INFO - at 0.7s, estimator lgbm's best error=0.0201, best estimator lgbm's best error=0.0201
[flaml.automl: 11-15 07:14:31] {1826} INFO - iteration 23, current learner lgbm
[flaml.automl: 11-15 07:14:31] {2029} INFO - at 0.8s, estimator lgbm's best error=0.0201, best estimator lgbm's best error=0.0201
[flaml.automl: 11-15 07:14:31] {1826} INFO - iteration 24, current learner lgbm
[flaml.automl: 11-15 07:14:31] {2029} INFO - at 0.8s, estimator lgbm's best error=0.0201, best estimator lgbm's best error=0.0201
[flaml.automl: 11-15 07:14:31] {1826} INFO - iteration 25, current learner lgbm
[flaml.automl: 11-15 07:14:31] {2029} INFO - at 0.8s, estimator lgbm's best error=0.0201, best estimator lgbm's best error=0.0201
[flaml.automl: 11-15 07:14:31] {1826} INFO - iteration 26, current learner lgbm
[flaml.automl: 11-15 07:14:31] {2029} INFO - at 0.9s, estimator lgbm's best error=0.0197, best estimator lgbm's best error=0.0197
[flaml.automl: 11-15 07:14:31] {1826} INFO - iteration 27, current learner lgbm
[flaml.automl: 11-15 07:14:31] {2029} INFO - at 0.9s, estimator lgbm's best error=0.0197, best estimator lgbm's best error=0.0197
[flaml.automl: 11-15 07:14:31] {1826} INFO - iteration 28, current learner lgbm
[flaml.automl: 11-15 07:14:31] {2029} INFO - at 1.0s, estimator lgbm's best error=0.0197, best estimator lgbm's best error=0.0197
[flaml.automl: 11-15 07:14:31] {1826} INFO - iteration 29, current learner lgbm
[flaml.automl: 11-15 07:14:31] {2029} INFO - at 1.0s, estimator lgbm's best error=0.0197, best estimator lgbm's best error=0.0197
[flaml.automl: 11-15 07:14:31] {2242} INFO - retrain lgbm for 0.0s
[flaml.automl: 11-15 07:14:31] {2247} INFO - retrained model: LGBMRanker(colsample_bytree=0.9852774042640857,
learning_rate=0.034918421933217675, max_bin=1023,
min_child_samples=22, n_estimators=6, num_leaves=23,
reg_alpha=0.0009765625, reg_lambda=21.505295697527654, verbose=-1)
[flaml.automl: 11-15 07:14:31] {1608} INFO - fit succeeded
[flaml.automl: 11-15 07:14:31] {1610} INFO - Time taken to find the best model: 0.8846545219421387
[flaml.automl: 11-15 07:14:31] {1624} WARNING - Time taken to find the best model is 88% of the provided time budget and not all estimators' hyperparameter search converged. Consider increasing the time budget.
```

View File

@@ -1,108 +0,0 @@
# AutoML - Regression
### Prerequisites
Install the [automl] option.
```bash
pip install "flaml[automl]"
```
### A basic regression example
```python
from flaml import AutoML
from sklearn.datasets import fetch_california_housing
# Initialize an AutoML instance
automl = AutoML()
# Specify automl goal and constraint
automl_settings = {
"time_budget": 1, # in seconds
"metric": 'r2',
"task": 'regression',
"log_file_name": "california.log",
}
X_train, y_train = fetch_california_housing(return_X_y=True)
# Train with labeled input data
automl.fit(X_train=X_train, y_train=y_train,
**automl_settings)
# Predict
print(automl.predict(X_train))
# Print the best model
print(automl.model.estimator)
```
#### Sample output
```
[flaml.automl: 11-15 07:08:19] {1485} INFO - Data split method: uniform
[flaml.automl: 11-15 07:08:19] {1489} INFO - Evaluation method: holdout
[flaml.automl: 11-15 07:08:19] {1540} INFO - Minimizing error metric: 1-r2
[flaml.automl: 11-15 07:08:19] {1577} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'catboost', 'xgboost', 'extra_tree']
[flaml.automl: 11-15 07:08:19] {1826} INFO - iteration 0, current learner lgbm
[flaml.automl: 11-15 07:08:19] {1944} INFO - Estimated sufficient time budget=846s. Estimated necessary time budget=2s.
[flaml.automl: 11-15 07:08:19] {2029} INFO - at 0.2s, estimator lgbm's best error=0.7393, best estimator lgbm's best error=0.7393
[flaml.automl: 11-15 07:08:19] {1826} INFO - iteration 1, current learner lgbm
[flaml.automl: 11-15 07:08:19] {2029} INFO - at 0.3s, estimator lgbm's best error=0.7393, best estimator lgbm's best error=0.7393
[flaml.automl: 11-15 07:08:19] {1826} INFO - iteration 2, current learner lgbm
[flaml.automl: 11-15 07:08:19] {2029} INFO - at 0.3s, estimator lgbm's best error=0.5446, best estimator lgbm's best error=0.5446
[flaml.automl: 11-15 07:08:19] {1826} INFO - iteration 3, current learner lgbm
[flaml.automl: 11-15 07:08:19] {2029} INFO - at 0.4s, estimator lgbm's best error=0.2807, best estimator lgbm's best error=0.2807
[flaml.automl: 11-15 07:08:19] {1826} INFO - iteration 4, current learner lgbm
[flaml.automl: 11-15 07:08:19] {2029} INFO - at 0.5s, estimator lgbm's best error=0.2712, best estimator lgbm's best error=0.2712
[flaml.automl: 11-15 07:08:19] {1826} INFO - iteration 5, current learner lgbm
[flaml.automl: 11-15 07:08:19] {2029} INFO - at 0.5s, estimator lgbm's best error=0.2712, best estimator lgbm's best error=0.2712
[flaml.automl: 11-15 07:08:19] {1826} INFO - iteration 6, current learner lgbm
[flaml.automl: 11-15 07:08:20] {2029} INFO - at 0.6s, estimator lgbm's best error=0.2712, best estimator lgbm's best error=0.2712
[flaml.automl: 11-15 07:08:20] {1826} INFO - iteration 7, current learner lgbm
[flaml.automl: 11-15 07:08:20] {2029} INFO - at 0.7s, estimator lgbm's best error=0.2197, best estimator lgbm's best error=0.2197
[flaml.automl: 11-15 07:08:20] {1826} INFO - iteration 8, current learner xgboost
[flaml.automl: 11-15 07:08:20] {2029} INFO - at 0.8s, estimator xgboost's best error=1.4958, best estimator lgbm's best error=0.2197
[flaml.automl: 11-15 07:08:20] {1826} INFO - iteration 9, current learner xgboost
[flaml.automl: 11-15 07:08:20] {2029} INFO - at 0.8s, estimator xgboost's best error=1.4958, best estimator lgbm's best error=0.2197
[flaml.automl: 11-15 07:08:20] {1826} INFO - iteration 10, current learner xgboost
[flaml.automl: 11-15 07:08:20] {2029} INFO - at 0.9s, estimator xgboost's best error=0.7052, best estimator lgbm's best error=0.2197
[flaml.automl: 11-15 07:08:20] {1826} INFO - iteration 11, current learner xgboost
[flaml.automl: 11-15 07:08:20] {2029} INFO - at 0.9s, estimator xgboost's best error=0.3619, best estimator lgbm's best error=0.2197
[flaml.automl: 11-15 07:08:20] {1826} INFO - iteration 12, current learner xgboost
[flaml.automl: 11-15 07:08:20] {2029} INFO - at 0.9s, estimator xgboost's best error=0.3619, best estimator lgbm's best error=0.2197
[flaml.automl: 11-15 07:08:20] {1826} INFO - iteration 13, current learner xgboost
[flaml.automl: 11-15 07:08:20] {2029} INFO - at 1.0s, estimator xgboost's best error=0.3619, best estimator lgbm's best error=0.2197
[flaml.automl: 11-15 07:08:20] {1826} INFO - iteration 14, current learner extra_tree
[flaml.automl: 11-15 07:08:20] {2029} INFO - at 1.1s, estimator extra_tree's best error=0.7197, best estimator lgbm's best error=0.2197
[flaml.automl: 11-15 07:08:20] {2242} INFO - retrain lgbm for 0.0s
[flaml.automl: 11-15 07:08:20] {2247} INFO - retrained model: LGBMRegressor(colsample_bytree=0.7610534336273627,
learning_rate=0.41929025492645006, max_bin=255,
min_child_samples=4, n_estimators=45, num_leaves=4,
reg_alpha=0.0009765625, reg_lambda=0.009280655005879943,
verbose=-1)
[flaml.automl: 11-15 07:08:20] {1608} INFO - fit succeeded
[flaml.automl: 11-15 07:08:20] {1610} INFO - Time taken to find the best model: 0.7289648056030273
[flaml.automl: 11-15 07:08:20] {1624} WARNING - Time taken to find the best model is 73% of the provided time budget and not all estimators' hyperparameter search converged. Consider increasing the time budget.
```
### Multi-output regression
We can combine `sklearn.MultiOutputRegressor` and `flaml.AutoML` to do AutoML for multi-output regression.
```python
from flaml import AutoML
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputRegressor
# create regression data
X, y = make_regression(n_targets=3)
# split into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
# train the model
model = MultiOutputRegressor(AutoML(task="regression", time_budget=60))
model.fit(X_train, y_train)
# predict
print(model.predict(X_test))
```
It will perform AutoML for each target, each taking 60 seconds.

File diff suppressed because it is too large Load Diff

View File

@@ -1,207 +0,0 @@
# AutoML for LightGBM
### Prerequisites for this example
Install the [automl] option.
```bash
pip install "flaml[automl] matplotlib openml"
```
### Use built-in LGBMEstimator
```python
from flaml import AutoML
from flaml.automl.data import load_openml_dataset
# Download [houses dataset](https://www.openml.org/d/537) from OpenML. The task is to predict median price of the house in the region based on demographic composition and a state of housing market in the region.
X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=537, data_dir='./')
automl = AutoML()
settings = {
"time_budget": 60, # total running time in seconds
"metric": 'r2', # primary metrics for regression can be chosen from: ['mae','mse','r2']
"estimator_list": ['lgbm'], # list of ML learners; we tune lightgbm in this example
"task": 'regression', # task type
"log_file_name": 'houses_experiment.log', # flaml log file
"seed": 7654321, # random seed
}
automl.fit(X_train=X_train, y_train=y_train, **settings)
```
#### Sample output
```
[flaml.automl: 11-15 19:46:44] {1485} INFO - Data split method: uniform
[flaml.automl: 11-15 19:46:44] {1489} INFO - Evaluation method: cv
[flaml.automl: 11-15 19:46:44] {1540} INFO - Minimizing error metric: 1-r2
[flaml.automl: 11-15 19:46:44] {1577} INFO - List of ML learners in AutoML Run: ['lgbm']
[flaml.automl: 11-15 19:46:44] {1826} INFO - iteration 0, current learner lgbm
[flaml.automl: 11-15 19:46:44] {1944} INFO - Estimated sufficient time budget=3232s. Estimated necessary time budget=3s.
[flaml.automl: 11-15 19:46:44] {2029} INFO - at 0.5s, estimator lgbm's best error=0.7383, best estimator lgbm's best error=0.7383
[flaml.automl: 11-15 19:46:44] {1826} INFO - iteration 1, current learner lgbm
[flaml.automl: 11-15 19:46:44] {2029} INFO - at 0.6s, estimator lgbm's best error=0.4774, best estimator lgbm's best error=0.4774
[flaml.automl: 11-15 19:46:44] {1826} INFO - iteration 2, current learner lgbm
[flaml.automl: 11-15 19:46:44] {2029} INFO - at 0.7s, estimator lgbm's best error=0.4774, best estimator lgbm's best error=0.4774
[flaml.automl: 11-15 19:46:44] {1826} INFO - iteration 3, current learner lgbm
[flaml.automl: 11-15 19:46:44] {2029} INFO - at 0.9s, estimator lgbm's best error=0.2985, best estimator lgbm's best error=0.2985
[flaml.automl: 11-15 19:46:44] {1826} INFO - iteration 4, current learner lgbm
[flaml.automl: 11-15 19:46:45] {2029} INFO - at 1.3s, estimator lgbm's best error=0.2337, best estimator lgbm's best error=0.2337
[flaml.automl: 11-15 19:46:45] {1826} INFO - iteration 5, current learner lgbm
[flaml.automl: 11-15 19:46:45] {2029} INFO - at 1.4s, estimator lgbm's best error=0.2337, best estimator lgbm's best error=0.2337
[flaml.automl: 11-15 19:46:45] {1826} INFO - iteration 6, current learner lgbm
[flaml.automl: 11-15 19:46:46] {2029} INFO - at 2.5s, estimator lgbm's best error=0.2219, best estimator lgbm's best error=0.2219
[flaml.automl: 11-15 19:46:46] {1826} INFO - iteration 7, current learner lgbm
[flaml.automl: 11-15 19:46:46] {2029} INFO - at 2.9s, estimator lgbm's best error=0.2219, best estimator lgbm's best error=0.2219
[flaml.automl: 11-15 19:46:46] {1826} INFO - iteration 8, current learner lgbm
[flaml.automl: 11-15 19:46:48] {2029} INFO - at 4.5s, estimator lgbm's best error=0.1764, best estimator lgbm's best error=0.1764
[flaml.automl: 11-15 19:46:48] {1826} INFO - iteration 9, current learner lgbm
[flaml.automl: 11-15 19:46:54] {2029} INFO - at 10.5s, estimator lgbm's best error=0.1630, best estimator lgbm's best error=0.1630
[flaml.automl: 11-15 19:46:54] {1826} INFO - iteration 10, current learner lgbm
[flaml.automl: 11-15 19:46:56] {2029} INFO - at 12.4s, estimator lgbm's best error=0.1630, best estimator lgbm's best error=0.1630
[flaml.automl: 11-15 19:46:56] {1826} INFO - iteration 11, current learner lgbm
[flaml.automl: 11-15 19:47:13] {2029} INFO - at 29.0s, estimator lgbm's best error=0.1630, best estimator lgbm's best error=0.1630
[flaml.automl: 11-15 19:47:13] {1826} INFO - iteration 12, current learner lgbm
[flaml.automl: 11-15 19:47:15] {2029} INFO - at 31.1s, estimator lgbm's best error=0.1630, best estimator lgbm's best error=0.1630
[flaml.automl: 11-15 19:47:15] {1826} INFO - iteration 13, current learner lgbm
[flaml.automl: 11-15 19:47:29] {2029} INFO - at 45.8s, estimator lgbm's best error=0.1564, best estimator lgbm's best error=0.1564
[flaml.automl: 11-15 19:47:33] {2242} INFO - retrain lgbm for 3.2s
[flaml.automl: 11-15 19:47:33] {2247} INFO - retrained model: LGBMRegressor(colsample_bytree=0.8025848209352517,
learning_rate=0.09100963138990374, max_bin=255,
min_child_samples=42, n_estimators=363, num_leaves=216,
reg_alpha=0.001113000336715291, reg_lambda=76.50614276906414,
verbose=-1)
[flaml.automl: 11-15 19:47:33] {1608} INFO - fit succeeded
[flaml.automl: 11-15 19:47:33] {1610} INFO - Time taken to find the best model: 45.75616669654846
[flaml.automl: 11-15 19:47:33] {1624} WARNING - Time taken to find the best model is 76% of the provided time budget and not all estimators' hyperparameter search converged. Consider increasing the time budget.
```
#### Retrieve best config
```python
print('Best hyperparmeter config:', automl.best_config)
print('Best r2 on validation data: {0:.4g}'.format(1-automl.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))
print(automl.model.estimator)
# Best hyperparmeter config: {'n_estimators': 363, 'num_leaves': 216, 'min_child_samples': 42, 'learning_rate': 0.09100963138990374, 'log_max_bin': 8, 'colsample_bytree': 0.8025848209352517, 'reg_alpha': 0.001113000336715291, 'reg_lambda': 76.50614276906414}
# Best r2 on validation data: 0.8436
# Training duration of best run: 3.229 s
# LGBMRegressor(colsample_bytree=0.8025848209352517,
# learning_rate=0.09100963138990374, max_bin=255,
# min_child_samples=42, n_estimators=363, num_leaves=216,
# reg_alpha=0.001113000336715291, reg_lambda=76.50614276906414,
# verbose=-1)
```
#### Plot feature importance
```python
import matplotlib.pyplot as plt
plt.barh(automl.feature_names_in_, automl.feature_importances_)
```
![png](../Use-Cases/images/feature_importance.png)
#### Compute predictions of testing dataset
```python
y_pred = automl.predict(X_test)
print('Predicted labels', y_pred)
# Predicted labels [143391.65036562 245535.13731811 153171.44071629 ... 184354.52735963
# 235510.49470445 282617.22858956]
```
#### Compute different metric values on testing dataset
```python
from flaml.automl.ml import sklearn_metric_loss_score
print('r2', '=', 1 - sklearn_metric_loss_score('r2', y_pred, y_test))
print('mse', '=', sklearn_metric_loss_score('mse', y_pred, y_test))
print('mae', '=', sklearn_metric_loss_score('mae', y_pred, y_test))
# r2 = 0.8505434326526395
# mse = 1975592613.138005
# mae = 29471.536046068788
```
#### Compare with untuned LightGBM
```python
from lightgbm import LGBMRegressor
lgbm = LGBMRegressor()
lgbm.fit(X_train, y_train)
y_pred = lgbm.predict(X_test)
from flaml.automl.ml import sklearn_metric_loss_score
print('default lgbm r2', '=', 1 - sklearn_metric_loss_score('r2', y_pred, y_test))
# default lgbm r2 = 0.8296179648694404
```
#### Plot learning curve
How does the model accuracy improve as we search for different hyperparameter configurations?
```python
from flaml.automl.data import get_output_from_log
import numpy as np
time_history, best_valid_loss_history, valid_loss_history, config_history, metric_history =
get_output_from_log(filename=settings['log_file_name'], time_budget=60)
plt.title('Learning Curve')
plt.xlabel('Wall Clock Time (s)')
plt.ylabel('Validation r2')
plt.step(time_history, 1 - np.array(best_valid_loss_history), where='post')
plt.show()
```
![png](images/lgbm_curve.png)
### Use a customized LightGBM learner
The native API of LightGBM allows one to specify a custom objective function in the model constructor. You can easily enable it by adding a customized LightGBM learner in FLAML. In the following example, we show how to add such a customized LightGBM learner with a custom objective function.
#### Create a customized LightGBM learner with a custom objective function
```python
import numpy as np
# define your customized objective function
def my_loss_obj(y_true, y_pred):
c = 0.5
residual = y_pred - y_true
grad = c * residual / (np.abs(residual) + c)
hess = c ** 2 / (np.abs(residual) + c) ** 2
# rmse grad and hess
grad_rmse = residual
hess_rmse = 1.0
# mae grad and hess
grad_mae = np.array(residual)
grad_mae[grad_mae > 0] = 1.
grad_mae[grad_mae <= 0] = -1.
hess_mae = 1.0
coef = [0.4, 0.3, 0.3]
return coef[0] * grad + coef[1] * grad_rmse + coef[2] * grad_mae,
coef[0] * hess + coef[1] * hess_rmse + coef[2] * hess_mae
from flaml.automl.model import LGBMEstimator
class MyLGBM(LGBMEstimator):
"""LGBMEstimator with my_loss_obj as the objective function"""
def __init__(self, **config):
super().__init__(objective=my_loss_obj, **config)
```
#### Add the customized learner and tune it
```python
automl = AutoML()
automl.add_learner(learner_name='my_lgbm', learner_class=MyLGBM)
settings["estimator_list"] = ['my_lgbm'] # change the estimator list
automl.fit(X_train=X_train, y_train=y_train, **settings)
```
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/automl_lightgbm.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/automl_lightgbm.ipynb)

View File

@@ -1,232 +0,0 @@
# AutoML for XGBoost
### Prerequisites for this example
Install the [automl] option.
```bash
pip install "flaml[automl] matplotlib openml"
```
### Use built-in XGBoostSklearnEstimator
```python
from flaml import AutoML
from flaml.automl.data import load_openml_dataset
# Download [houses dataset](https://www.openml.org/d/537) from OpenML. The task is to predict median price of the house in the region based on demographic composition and a state of housing market in the region.
X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=537, data_dir='./')
automl = AutoML()
settings = {
"time_budget": 60, # total running time in seconds
"metric": 'r2', # primary metrics for regression can be chosen from: ['mae','mse','r2']
"estimator_list": ['xgboost'], # list of ML learners; we tune XGBoost in this example
"task": 'regression', # task type
"log_file_name": 'houses_experiment.log', # flaml log file
"seed": 7654321, # random seed
}
automl.fit(X_train=X_train, y_train=y_train, **settings)
```
#### Sample output
```
[flaml.automl: 09-29 23:06:46] {1446} INFO - Data split method: uniform
[flaml.automl: 09-29 23:06:46] {1450} INFO - Evaluation method: cv
[flaml.automl: 09-29 23:06:46] {1496} INFO - Minimizing error metric: 1-r2
[flaml.automl: 09-29 23:06:46] {1533} INFO - List of ML learners in AutoML Run: ['xgboost']
[flaml.automl: 09-29 23:06:46] {1763} INFO - iteration 0, current learner xgboost
[flaml.automl: 09-29 23:06:47] {1880} INFO - Estimated sufficient time budget=2621s. Estimated necessary time budget=3s.
[flaml.automl: 09-29 23:06:47] {1952} INFO - at 0.3s, estimator xgboost's best error=2.1267, best estimator xgboost's best error=2.1267
[flaml.automl: 09-29 23:06:47] {1763} INFO - iteration 1, current learner xgboost
[flaml.automl: 09-29 23:06:47] {1952} INFO - at 0.5s, estimator xgboost's best error=2.1267, best estimator xgboost's best error=2.1267
[flaml.automl: 09-29 23:06:47] {1763} INFO - iteration 2, current learner xgboost
[flaml.automl: 09-29 23:06:47] {1952} INFO - at 0.6s, estimator xgboost's best error=0.8485, best estimator xgboost's best error=0.8485
[flaml.automl: 09-29 23:06:47] {1763} INFO - iteration 3, current learner xgboost
[flaml.automl: 09-29 23:06:47] {1952} INFO - at 0.8s, estimator xgboost's best error=0.3799, best estimator xgboost's best error=0.3799
[flaml.automl: 09-29 23:06:47] {1763} INFO - iteration 4, current learner xgboost
[flaml.automl: 09-29 23:06:47] {1952} INFO - at 1.0s, estimator xgboost's best error=0.3799, best estimator xgboost's best error=0.3799
[flaml.automl: 09-29 23:06:47] {1763} INFO - iteration 5, current learner xgboost
[flaml.automl: 09-29 23:06:47] {1952} INFO - at 1.2s, estimator xgboost's best error=0.3799, best estimator xgboost's best error=0.3799
[flaml.automl: 09-29 23:06:47] {1763} INFO - iteration 6, current learner xgboost
[flaml.automl: 09-29 23:06:48] {1952} INFO - at 1.5s, estimator xgboost's best error=0.2992, best estimator xgboost's best error=0.2992
[flaml.automl: 09-29 23:06:48] {1763} INFO - iteration 7, current learner xgboost
[flaml.automl: 09-29 23:06:48] {1952} INFO - at 1.9s, estimator xgboost's best error=0.2992, best estimator xgboost's best error=0.2992
[flaml.automl: 09-29 23:06:48] {1763} INFO - iteration 8, current learner xgboost
[flaml.automl: 09-29 23:06:49] {1952} INFO - at 2.2s, estimator xgboost's best error=0.2992, best estimator xgboost's best error=0.2992
[flaml.automl: 09-29 23:06:49] {1763} INFO - iteration 9, current learner xgboost
[flaml.automl: 09-29 23:06:49] {1952} INFO - at 2.5s, estimator xgboost's best error=0.2513, best estimator xgboost's best error=0.2513
[flaml.automl: 09-29 23:06:49] {1763} INFO - iteration 10, current learner xgboost
[flaml.automl: 09-29 23:06:49] {1952} INFO - at 2.8s, estimator xgboost's best error=0.2513, best estimator xgboost's best error=0.2513
[flaml.automl: 09-29 23:06:49] {1763} INFO - iteration 11, current learner xgboost
[flaml.automl: 09-29 23:06:49] {1952} INFO - at 3.0s, estimator xgboost's best error=0.2513, best estimator xgboost's best error=0.2513
[flaml.automl: 09-29 23:06:49] {1763} INFO - iteration 12, current learner xgboost
[flaml.automl: 09-29 23:06:50] {1952} INFO - at 3.3s, estimator xgboost's best error=0.2113, best estimator xgboost's best error=0.2113
[flaml.automl: 09-29 23:06:50] {1763} INFO - iteration 13, current learner xgboost
[flaml.automl: 09-29 23:06:50] {1952} INFO - at 3.5s, estimator xgboost's best error=0.2113, best estimator xgboost's best error=0.2113
[flaml.automl: 09-29 23:06:50] {1763} INFO - iteration 14, current learner xgboost
[flaml.automl: 09-29 23:06:50] {1952} INFO - at 4.0s, estimator xgboost's best error=0.2090, best estimator xgboost's best error=0.2090
[flaml.automl: 09-29 23:06:50] {1763} INFO - iteration 15, current learner xgboost
[flaml.automl: 09-29 23:06:51] {1952} INFO - at 4.5s, estimator xgboost's best error=0.2090, best estimator xgboost's best error=0.2090
[flaml.automl: 09-29 23:06:51] {1763} INFO - iteration 16, current learner xgboost
[flaml.automl: 09-29 23:06:51] {1952} INFO - at 5.2s, estimator xgboost's best error=0.1919, best estimator xgboost's best error=0.1919
[flaml.automl: 09-29 23:06:51] {1763} INFO - iteration 17, current learner xgboost
[flaml.automl: 09-29 23:06:52] {1952} INFO - at 5.5s, estimator xgboost's best error=0.1919, best estimator xgboost's best error=0.1919
[flaml.automl: 09-29 23:06:52] {1763} INFO - iteration 18, current learner xgboost
[flaml.automl: 09-29 23:06:54] {1952} INFO - at 8.0s, estimator xgboost's best error=0.1797, best estimator xgboost's best error=0.1797
[flaml.automl: 09-29 23:06:54] {1763} INFO - iteration 19, current learner xgboost
[flaml.automl: 09-29 23:06:55] {1952} INFO - at 9.0s, estimator xgboost's best error=0.1797, best estimator xgboost's best error=0.1797
[flaml.automl: 09-29 23:06:55] {1763} INFO - iteration 20, current learner xgboost
[flaml.automl: 09-29 23:07:08] {1952} INFO - at 21.8s, estimator xgboost's best error=0.1797, best estimator xgboost's best error=0.1797
[flaml.automl: 09-29 23:07:08] {1763} INFO - iteration 21, current learner xgboost
[flaml.automl: 09-29 23:07:11] {1952} INFO - at 24.4s, estimator xgboost's best error=0.1797, best estimator xgboost's best error=0.1797
[flaml.automl: 09-29 23:07:11] {1763} INFO - iteration 22, current learner xgboost
[flaml.automl: 09-29 23:07:16] {1952} INFO - at 30.0s, estimator xgboost's best error=0.1782, best estimator xgboost's best error=0.1782
[flaml.automl: 09-29 23:07:16] {1763} INFO - iteration 23, current learner xgboost
[flaml.automl: 09-29 23:07:20] {1952} INFO - at 33.5s, estimator xgboost's best error=0.1782, best estimator xgboost's best error=0.1782
[flaml.automl: 09-29 23:07:20] {1763} INFO - iteration 24, current learner xgboost
[flaml.automl: 09-29 23:07:29] {1952} INFO - at 42.3s, estimator xgboost's best error=0.1782, best estimator xgboost's best error=0.1782
[flaml.automl: 09-29 23:07:29] {1763} INFO - iteration 25, current learner xgboost
[flaml.automl: 09-29 23:07:30] {1952} INFO - at 43.2s, estimator xgboost's best error=0.1782, best estimator xgboost's best error=0.1782
[flaml.automl: 09-29 23:07:30] {1763} INFO - iteration 26, current learner xgboost
[flaml.automl: 09-29 23:07:50] {1952} INFO - at 63.4s, estimator xgboost's best error=0.1663, best estimator xgboost's best error=0.1663
[flaml.automl: 09-29 23:07:50] {2059} INFO - selected model: <xgboost.core.Booster object at 0x7f6399005910>
[flaml.automl: 09-29 23:07:55] {2122} INFO - retrain xgboost for 5.4s
[flaml.automl: 09-29 23:07:55] {2128} INFO - retrained model: <xgboost.core.Booster object at 0x7f6398fc0eb0>
[flaml.automl: 09-29 23:07:55] {1557} INFO - fit succeeded
[flaml.automl: 09-29 23:07:55] {1558} INFO - Time taken to find the best model: 63.427649974823
[flaml.automl: 09-29 23:07:55] {1569} WARNING - Time taken to find the best model is 106% of the provided time budget and not all estimators' hyperparameter search converged. Consider increasing the time budget.
```
#### Retrieve best config
```python
print('Best hyperparmeter config:', automl.best_config)
print('Best r2 on validation data: {0:.4g}'.format(1-automl.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))
print(automl.model.estimator)
# Best hyperparmeter config: {'n_estimators': 473, 'max_leaves': 35, 'max_depth': 0, 'min_child_weight': 0.001, 'learning_rate': 0.26865031351923346, 'subsample': 0.9718245679598786, 'colsample_bylevel': 0.7421362469066445, 'colsample_bytree': 1.0, 'reg_alpha': 0.06824336834995245, 'reg_lambda': 250.9654222583276}
# Best r2 on validation data: 0.8384
# Training duration of best run: 2.194 s
# XGBRegressor(base_score=0.5, booster='gbtree',
# colsample_bylevel=0.7421362469066445, colsample_bynode=1,
# colsample_bytree=1.0, gamma=0, gpu_id=-1, grow_policy='lossguide',
# importance_type='gain', interaction_constraints='',
# learning_rate=0.26865031351923346, max_delta_step=0, max_depth=0,
# max_leaves=35, min_child_weight=0.001, missing=nan,
# monotone_constraints='()', n_estimators=473, n_jobs=-1,
# num_parallel_tree=1, random_state=0, reg_alpha=0.06824336834995245,
# reg_lambda=250.9654222583276, scale_pos_weight=1,
# subsample=0.9718245679598786, tree_method='hist',
# use_label_encoder=False, validate_parameters=1, verbosity=0)
```
#### Plot feature importance
```python
import matplotlib.pyplot as plt
plt.barh(automl.feature_names_in_, automl.feature_importances_)
```
![png](images/xgb_feature_importance.png)
#### Compute predictions of testing dataset
```python
y_pred = automl.predict(X_test)
print('Predicted labels', y_pred)
# Predicted labels [139062.95 237622. 140522.03 ... 182125.5 252156.36 264884.5 ]
```
#### Compute different metric values on testing dataset
```python
from flaml.automl.ml import sklearn_metric_loss_score
print('r2', '=', 1 - sklearn_metric_loss_score('r2', y_pred, y_test))
print('mse', '=', sklearn_metric_loss_score('mse', y_pred, y_test))
print('mae', '=', sklearn_metric_loss_score('mae', y_pred, y_test))
# r2 = 0.8456494234135888
# mse = 2040284106.2781258
# mae = 30212.830996680445
```
#### Compare with untuned XGBoost
```python
from xgboost import XGBRegressor
xgb = XGBRegressor()
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
from flaml.automl.ml import sklearn_metric_loss_score
print('default xgboost r2', '=', 1 - sklearn_metric_loss_score('r2', y_pred, y_test))
# default xgboost r2 = 0.8265451174596482
```
#### Plot learning curve
How does the model accuracy improve as we search for different hyperparameter configurations?
```python
from flaml.automl.data import get_output_from_log
import numpy as np
time_history, best_valid_loss_history, valid_loss_history, config_history, metric_history =
get_output_from_log(filename=settings['log_file_name'], time_budget=60)
plt.title('Learning Curve')
plt.xlabel('Wall Clock Time (s)')
plt.ylabel('Validation r2')
plt.step(time_history, 1 - np.array(best_valid_loss_history), where='post')
plt.show()
```
![png](images/xgb_curve.png)
### Use a customized XGBoost learner
You can easily enable a custom objective function by adding a customized XGBoost learner (inherit XGBoostEstimator or XGBoostSklearnEstimator) in FLAML. In the following example, we show how to add such a customized XGBoost learner with a custom objective function.
```python
import numpy as np
# define your customized objective function
def logregobj(preds, dtrain):
labels = dtrain.get_label()
preds = 1.0 / (1.0 + np.exp(-preds)) # transform raw leaf weight
grad = preds - labels
hess = preds * (1.0 - preds)
return grad, hess
from flaml.automl.model import XGBoostEstimator
class MyXGB1(XGBoostEstimator):
'''XGBoostEstimator with the logregobj function as the objective function
'''
def __init__(self, **config):
super().__init__(objective=logregobj, **config)
class MyXGB2(XGBoostEstimator):
'''XGBoostEstimator with 'reg:squarederror' as the objective function
'''
def __init__(self, **config):
super().__init__(objective='reg:gamma', **config)
```
#### Add the customized learners and tune them
```python
automl = AutoML()
automl.add_learner(learner_name='my_xgb1', learner_class=MyXGB1)
automl.add_learner(learner_name='my_xgb2', learner_class=MyXGB2)
settings["estimator_list"] = ['my_xgb1', 'my_xgb2'] # change the estimator list
automl.fit(X_train=X_train, y_train=y_train, **settings)
```
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/automl_xgboost.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/automl_xgboost.ipynb)

View File

@@ -1,109 +0,0 @@
# Default - Flamlized Estimator
Flamlized estimators automatically use data-dependent default hyperparameter configurations for each estimator, offering a unique zero-shot AutoML capability, or "no tuning" AutoML.
## Flamlized LGBMRegressor
### Prerequisites
This example requires the [autozero] option.
```bash
pip install flaml[autozero] lightgbm openml
```
### Zero-shot AutoML
```python
from flaml.automl.data import load_openml_dataset
from flaml.default import LGBMRegressor
from flaml.automl.ml import sklearn_metric_loss_score
X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=537, data_dir="./")
lgbm = LGBMRegressor()
lgbm.fit(X_train, y_train)
y_pred = lgbm.predict(X_test)
print("flamlized lgbm r2", "=", 1 - sklearn_metric_loss_score("r2", y_pred, y_test))
print(lgbm)
```
#### Sample output
```
load dataset from ./openml_ds537.pkl
Dataset name: houses
X_train.shape: (15480, 8), y_train.shape: (15480,);
X_test.shape: (5160, 8), y_test.shape: (5160,)
flamlized lgbm r2 = 0.8537444671194614
LGBMRegressor(colsample_bytree=0.7019911744574896,
learning_rate=0.022635758411078528, max_bin=511,
min_child_samples=2, n_estimators=4797, num_leaves=122,
reg_alpha=0.004252223402511765, reg_lambda=0.11288241427227624,
verbose=-1)
```
### Suggest hyperparameters without training
```
from flaml.data import load_openml_dataset
from flaml.default import LGBMRegressor
from flaml.ml import sklearn_metric_loss_score
X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=537, data_dir="./")
lgbm = LGBMRegressor()
hyperparams, estimator_name, X_transformed, y_transformed = lgbm.suggest_hyperparams(X_train, y_train)
print(hyperparams)
```
#### Sample output
```
load dataset from ./openml_ds537.pkl
Dataset name: houses
X_train.shape: (15480, 8), y_train.shape: (15480,);
X_test.shape: (5160, 8), y_test.shape: (5160,)
{'n_estimators': 4797, 'num_leaves': 122, 'min_child_samples': 2, 'learning_rate': 0.022635758411078528, 'colsample_bytree': 0.7019911744574896, 'reg_alpha': 0.004252223402511765, 'reg_lambda': 0.11288241427227624, 'max_bin': 511, 'verbose': -1}
```
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/zeroshot_lightgbm.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/zeroshot_lightgbm.ipynb)
## Flamlized XGBClassifier
### Prerequisites
This example requires xgboost, sklearn, openml==0.10.2.
### Zero-shot AutoML
```python
from flaml.automl.data import load_openml_dataset
from flaml.default import XGBClassifier
from flaml.automl.ml import sklearn_metric_loss_score
X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=1169, data_dir="./")
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)
print("flamlized xgb accuracy", "=", 1 - sklearn_metric_loss_score("accuracy", y_pred, y_test))
print(xgb)
```
#### Sample output
```
load dataset from ./openml_ds1169.pkl
Dataset name: airlines
X_train.shape: (404537, 7), y_train.shape: (404537,);
X_test.shape: (134846, 7), y_test.shape: (134846,)
flamlized xgb accuracy = 0.6729009388487608
XGBClassifier(base_score=0.5, booster='gbtree',
colsample_bylevel=0.4601573737792679, colsample_bynode=1,
colsample_bytree=1.0, gamma=0, gpu_id=-1, grow_policy='lossguide',
importance_type='gain', interaction_constraints='',
learning_rate=0.04039771837785377, max_delta_step=0, max_depth=0,
max_leaves=159, min_child_weight=0.3396294979905001, missing=nan,
monotone_constraints='()', n_estimators=540, n_jobs=4,
num_parallel_tree=1, random_state=0,
reg_alpha=0.0012362430984376035, reg_lambda=3.093428791531145,
scale_pos_weight=1, subsample=1.0, tree_method='hist',
use_label_encoder=False, validate_parameters=1, verbosity=0)
```

View File

@@ -1,168 +0,0 @@
FLAML can be used together with AzureML. On top of that, using mlflow and ray is easy too.
### Prerequisites
Install the [automl,azureml] option.
```bash
pip install "flaml[automl,azureml]"
```
Setup a AzureML workspace:
```python
from azureml.core import Workspace
ws = Workspace.create(name='myworkspace', subscription_id='<azure-subscription-id>', resource_group='myresourcegroup')
```
### Enable mlflow in AzureML workspace
```python
import mlflow
from azureml.core import Workspace
ws = Workspace.from_config()
mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())
```
### Start an AutoML run
```python
from flaml.automl.data import load_openml_dataset
from flaml import AutoML
# Download [Airlines dataset](https://www.openml.org/d/1169) from OpenML. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure.
X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=1169, data_dir="./")
automl = AutoML()
settings = {
"time_budget": 60, # total running time in seconds
"metric": "accuracy", # metric to optimize
"task": "classification", # task type
"log_file_name": "airlines_experiment.log", # flaml log file
}
experiment = mlflow.set_experiment("flaml") # the experiment name in AzureML workspace
with mlflow.start_run() as run: # create a mlflow run
automl.fit(X_train=X_train, y_train=y_train, **settings)
mlflow.sklearn.log_model(automl, "automl")
```
The metrics in the run will be automatically logged in an experiment named "flaml" in your AzureML workspace. They can be retrieved by `mlflow.search_runs`:
```python
mlflow.search_runs(experiment_ids=[experiment.experiment_id], filter_string="params.learner = 'xgboost'")
```
The logged model can be loaded and used to make predictions:
```python
automl = mlflow.sklearn.load_model(f"{run.info.artifact_uri}/automl")
print(automl.predict(X_test))
```
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/integrate_azureml.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/integrate_azureml.ipynb)
### Use ray to distribute across a cluster
When you have a compute cluster in AzureML, you can distribute `flaml.AutoML` or `flaml.tune` with ray.
#### Build a ray environment in AzureML
Create a docker file such as [.Docker/Dockerfile-cpu](https://github.com/microsoft/FLAML/blob/main/test/.Docker/Dockerfile-cpu). Make sure `RUN pip install flaml[blendsearch,ray]` is included in the docker file.
Then build a AzureML environment in the workspace `ws`.
```python
ray_environment_name = "aml-ray-cpu"
ray_environment_dockerfile_path = "./Docker/Dockerfile-cpu"
# Build CPU image for Ray
ray_cpu_env = Environment.from_dockerfile(name=ray_environment_name, dockerfile=ray_environment_dockerfile_path)
ray_cpu_env.register(workspace=ws)
ray_cpu_build_details = ray_cpu_env.build(workspace=ws)
import time
while ray_cpu_build_details.status not in ["Succeeded", "Failed"]:
print(f"Awaiting completion of ray CPU environment build. Current status is: {ray_cpu_build_details.status}")
time.sleep(10)
```
You only need to do this step once for one workspace.
#### Create a compute cluster with multiple nodes
```python
from azureml.core.compute import AmlCompute, ComputeTarget
compute_target_name = "cpucluster"
node_count = 2
# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6
compute_target_size = "STANDARD_D2_V2"
if compute_target_name in ws.compute_targets:
compute_target = ws.compute_targets[compute_target_name]
if compute_target and type(compute_target) is AmlCompute:
if compute_target.provisioning_state == "Succeeded":
print("Found compute target; using it:", compute_target_name)
else:
raise Exception(
"Found compute target but it is in state", compute_target.provisioning_state)
else:
print("creating a new compute target...")
provisioning_config = AmlCompute.provisioning_configuration(
vm_size=compute_target_size,
min_nodes=0,
max_nodes=node_count)
# Create the cluster
compute_target = ComputeTarget.create(ws, compute_target_name, provisioning_config)
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min node count is provided it will use the scale settings for the cluster
compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
# For a more detailed view of current AmlCompute status, use get_status()
print(compute_target.get_status().serialize())
```
If the computer target "cpucluster" already exists, it will not be recreated.
#### Run distributed AutoML job
Assuming you have an automl script like [ray/distribute_automl.py](https://github.com/microsoft/FLAML/blob/main/test/ray/distribute_automl.py). It uses `n_concurrent_trials=k` to inform `AutoML.fit()` to perform k concurrent trials in parallel.
Submit an AzureML job as the following:
```python
from azureml.core import Workspace, Experiment, ScriptRunConfig, Environment
from azureml.core.runconfig import RunConfiguration, DockerConfiguration
command = ["python distribute_automl.py"]
ray_environment_name = "aml-ray-cpu"
env = Environment.get(workspace=ws, name=ray_environment_name)
aml_run_config = RunConfiguration(communicator="OpenMpi")
aml_run_config.target = compute_target
aml_run_config.docker = DockerConfiguration(use_docker=True)
aml_run_config.environment = env
aml_run_config.node_count = 2
config = ScriptRunConfig(
source_directory="ray/",
command=command,
run_config=aml_run_config,
)
exp = Experiment(ws, "distribute-automl")
run = exp.submit(config)
print(run.get_portal_url()) # link to ml.azure.com
run.wait_for_completion(show_output=True)
```
#### Run distributed tune job
Prepare a script like [ray/distribute_tune.py](https://github.com/microsoft/FLAML/blob/main/test/ray/distribute_tune.py). Replace the command in the above eample with:
```python
command = ["python distribute_tune.py"]
```
Everything else is the same.

View File

@@ -1,72 +0,0 @@
As FLAML's AutoML module can be used a transformer in the Sklearn's pipeline we can get all the benefits of pipeline.
### Prerequisites
Install the [automl] option.
```bash
pip install "flaml[automl] openml"
```
### Load data
```python
from flaml.automl.data import load_openml_dataset
# Download [Airlines dataset](https://www.openml.org/d/1169) from OpenML. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure.
X_train, X_test, y_train, y_test = load_openml_dataset(
dataset_id=1169, data_dir='./', random_state=1234, dataset_format='array')
```
### Create a pipeline
```python
from sklearn import set_config
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from flaml import AutoML
set_config(display='diagram')
imputer = SimpleImputer()
standardizer = StandardScaler()
automl = AutoML()
automl_pipeline = Pipeline([
("imputuer",imputer),
("standardizer", standardizer),
("automl", automl)
])
automl_pipeline
```
![png](images/pipeline.png)
### Run AutoML in the pipeline
```python
automl_settings = {
"time_budget": 60, # total running time in seconds
"metric": "accuracy", # primary metrics can be chosen from: ['accuracy', 'roc_auc', 'roc_auc_weighted', 'roc_auc_ovr', 'roc_auc_ovo', 'f1', 'log_loss', 'mae', 'mse', 'r2'] Check the documentation for more details (https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#optimization-metric)
"task": "classification", # task type
"estimator_list": ["xgboost", "catboost", "lgbm"],
"log_file_name": "airlines_experiment.log", # flaml log file
}
pipeline_settings = {
f"automl__{key}": value for key, value in automl_settings.items()
}
automl_pipeline.fit(X_train, y_train, **pipeline_settings)
```
### Get the automl object from the pipeline
```python
automl = automl_pipeline.steps[2][1]
# Get the best config and best learner
print('Best ML leaner:', automl.best_estimator)
print('Best hyperparmeter config:', automl.best_config)
print('Best accuracy on validation data: {0:.4g}'.format(1 - automl.best_loss))
print('Training duration of best run: {0:.4g} s'.format(automl.best_config_train_time))
```
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/integrate_sklearn.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/integrate_sklearn.ipynb)

View File

@@ -1,118 +0,0 @@
# Integrate - Spark
FLAML has integrated Spark for distributed training. There are two main aspects of integration with Spark:
- Use Spark ML estimators for AutoML.
- Use Spark to run training in parallel spark jobs.
## Spark ML Estimators
FLAML integrates estimators based on Spark ML models. These models are trained in parallel using Spark, so we called them Spark estimators. To use these models, you first need to organize your data in the required format.
### Data
For Spark estimators, AutoML only consumes Spark data. FLAML provides a convenient function `to_pandas_on_spark` in the `flaml.automl.spark.utils` module to convert your data into a pandas-on-spark (`pyspark.pandas`) dataframe/series, which Spark estimators require.
This utility function takes data in the form of a `pandas.Dataframe` or `pyspark.sql.Dataframe` and converts it into a pandas-on-spark dataframe. It also takes `pandas.Series` or `pyspark.sql.Dataframe` and converts it into a [pandas-on-spark](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html) series. If you pass in a `pyspark.pandas.Dataframe`, it will not make any changes.
This function also accepts optional arguments `index_col` and `default_index_type`.
- `index_col` is the column name to use as the index, default is None.
- `default_index_type` is the default index type, default is "distributed-sequence". More info about default index type could be found on Spark official [documentation](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/options.html#default-index-type)
Here is an example code snippet for Spark Data:
```python
import pandas as pd
from flaml.automl.spark.utils import to_pandas_on_spark
# Creating a dictionary
data = {"Square_Feet": [800, 1200, 1800, 1500, 850],
"Age_Years": [20, 15, 10, 7, 25],
"Price": [100000, 200000, 300000, 240000, 120000]}
# Creating a pandas DataFrame
dataframe = pd.DataFrame(data)
label = "Price"
# Convert to pandas-on-spark dataframe
psdf = to_pandas_on_spark(dataframe)
```
To use Spark ML models you need to format your data appropriately. Specifically, use [`VectorAssembler`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.VectorAssembler.html) to merge all feature columns into a single vector column.
Here is an example of how to use it:
```python
from pyspark.ml.feature import VectorAssembler
columns = psdf.columns
feature_cols = [col for col in columns if col != label]
featurizer = VectorAssembler(inputCols=feature_cols, outputCol="features")
psdf = featurizer.transform(psdf.to_spark(index_col="index"))["index", "features"]
```
Later in conducting the experiment, use your pandas-on-spark data like non-spark data and pass them using `X_train, y_train` or `dataframe, label`.
### Estimators
#### Model List
- `lgbm_spark`: The class for fine-tuning Spark version LightGBM models, using [SynapseML](https://microsoft.github.io/SynapseML/docs/features/lightgbm/about/) API.
#### Usage
First, prepare your data in the required format as described in the previous section.
By including the models you intend to try in the `estimators_list` argument to `flaml.automl`, FLAML will start trying configurations for these models. If your input is Spark data, FLAML will also use estimators with the `_spark` postfix by default, even if you haven't specified them.
Here is an example code snippet using SparkML models in AutoML:
```python
import flaml
# prepare your data in pandas-on-spark format as we previously mentioned
automl = flaml.AutoML()
settings = {
"time_budget": 30,
"metric": "r2",
"estimator_list": ["lgbm_spark"], # this setting is optional
"task": "regression",
}
automl.fit(
dataframe=psdf,
label=label,
**settings,
)
```
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/automl_bankrupt_synapseml.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/automl_bankrupt_synapseml.ipynb)
## Parallel Spark Jobs
You can activate Spark as the parallel backend during parallel tuning in both [AutoML](/docs/Use-Cases/Task-Oriented-AutoML#parallel-tuning) and [Hyperparameter Tuning](/docs/Use-Cases/Tune-User-Defined-Function#parallel-tuning), by setting the `use_spark` to `true`. FLAML will dispatch your job to the distributed Spark backend using [`joblib-spark`](https://github.com/joblib/joblib-spark).
Please note that you should not set `use_spark` to `true` when applying AutoML and Tuning for Spark Data. This is because only SparkML models will be used for Spark Data in AutoML and Tuning. As SparkML models run in parallel, there is no need to distribute them with `use_spark` again.
All the Spark-related arguments are stated below. These arguments are available in both Hyperparameter Tuning and AutoML:
- `use_spark`: boolean, default=False | Whether to use spark to run the training in parallel spark jobs. This can be used to accelerate training on large models and large datasets, but will incur more overhead in time and thus slow down training in some cases. GPU training is not supported yet when use_spark is True. For Spark clusters, by default, we will launch one trial per executor. However, sometimes we want to launch more trials than the number of executors (e.g., local mode). In this case, we can set the environment variable `FLAML_MAX_CONCURRENT` to override the detected `num_executors`. The final number of concurrent trials will be the minimum of `n_concurrent_trials` and `num_executors`.
- `n_concurrent_trials`: int, default=1 | The number of concurrent trials. When n_concurrent_trials > 1, FLAML performes parallel tuning.
- `force_cancel`: boolean, default=False | Whether to forcely cancel Spark jobs if the search time exceeded the time budget. Spark jobs include parallel tuning jobs and Spark-based model training jobs.
An example code snippet for using parallel Spark jobs:
```python
import flaml
automl_experiment = flaml.AutoML()
automl_settings = {
"time_budget": 30,
"metric": "r2",
"task": "regression",
"n_concurrent_trials": 2,
"use_spark": True,
"force_cancel": True, # Activating the force_cancel option can immediately halt Spark jobs once they exceed the allocated time_budget.
}
automl.fit(
dataframe=dataframe,
label=label,
**automl_settings,
)
```
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/integrate_spark.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/integrate_spark.ipynb)

View File

@@ -1,216 +0,0 @@
# Tune - AzureML pipeline
This example uses flaml to tune an Azure ML pipeline that fits a lightgbm classifier on the [sklearn breast cancer dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)).
If you already have an Azure ML pipeline, you can use the approach to tune your pipeline with flaml.
## Prepare for tuning
### Requirements
We recommend using conda or venv to create a virtual env to install the dependencies.
```bash
# set up new conda environment
conda create -n pipeline_tune python=3.8 pip=20.2 -y
conda activate pipeline_tune
# install azureml packages for runnig AzureML pipelines
pip install azureml-core==1.39.0
pip install azure-ml-component[notebooks]==0.9.10.post1
pip install azureml-dataset-runtime==1.39.0
# install hydra-core for passing AzureML pipeline parameters
pip install hydra-core==1.1.1
# install flaml
pip install flaml[blendsearch,ray]==1.0.9
```
### Azure ML training pipeline
Before we are ready for tuning, we must first have an Azure ML pipeline.
In this example, we use the following toy pipeline for illustration.
The pipeline consists of two steps: (1) data preparation and (2) model training.
![png](images/AzureML_train_pipeline.png).
The [code example](https://github.com/microsoft/FLAML/tree/main/test/pipeline_tuning_example) discussed in the page is included in
`test/pipeline_tuning_example/`.
We will use the relative path in the rest of the page.
### Data
The example data exsits in `data/data.csv`.
It will be uploaded to AzureML workspace to be consumed by the training pipeline
using the following code.
```python
Dataset.File.upload_directory(
src_dir=to_absolute_path(LOCAL_DIR / "data"),
target=(datastore, "classification_data"),
overwrite=True,
)
dataset = Dataset.File.from_files(path=(datastore, 'classification_data'))
```
### Configurations for the pipeline
The pipeline configuration is defined in
`configs/train_config.yaml`.
```yaml
hydra:
searchpath:
- file://.
aml_config:
workspace_name: your_workspace_name
resource_group: your_resource_group
subscription_id: your_subscription_id
cpu_target: cpucluster
train_config:
exp_name: sklearn_breast_cancer_classification
test_train_ratio: 0.4
learning_rate: 0.05
n_estimators: 50
```
### Define and submit the pipeline
The pipeline was defined in
`submit_train_pipeline.py`.
To submit the pipeline, please specify your AzureML resources
in the `configs/train_config.yaml` and run
```bash
cd test/pipeline_tuning_example
python submit_train_pipeline.py
```
To get the pipeline ready for HPO, in the training step,
we need to log the metrics of interest to AzureML using
```python
run.log(f"{data_name}_{eval_name}", result)
```
## Hyperparameter Optimization
We are now ready to set up the HPO job for the AzureML pipeline, including:
- config the HPO job,
- set up the interaction between the HPO job and the training job.
These two steps are done in `tuner/tuner_func.py`.
### Set up the tune job
`tuner_func.tune_pipeline` sets up the search space, metric to optimize, mode, etc.
```python
def tune_pipeline(concurrent_run=1):
start_time = time.time()
# config the HPO job
search_space = {
"train_config.n_estimators": flaml.tune.randint(50, 200),
"train_config.learning_rate": flaml.tune.uniform(0.01, 0.5),
}
hp_metric = "eval_binary_error"
mode = "max"
num_samples = 2
if concurrent_run > 1:
import ray # For parallel tuning
ray.init(num_cpus=concurrent_run)
use_ray = True
else:
use_ray = False
# launch the HPO job
analysis = flaml.tune.run(
run_with_config,
config=search_space,
metric=hp_metric,
mode=mode,
num_samples=num_samples, # number of trials
use_ray=use_ray,
)
# get the best config
best_trial = analysis.get_best_trial(hp_metric, mode, "all")
metric = best_trial.metric_analysis[hp_metric][mode]
print(f"n_trials={len(analysis.trials)}")
print(f"time={time.time()-start_time}")
print(f"Best {hp_metric}: {metric:.4f}")
print(f"Best coonfiguration: {best_trial.config}")
```
### Interact with AzureML pipeline jobs
The interaction between FLAML and AzureML pipeline jobs is in `tuner_func.run_with_config`.
```python
def run_with_config(config: dict):
"""Run the pipeline with a given config dict
"""
# pass the hyperparameters to AzureML jobs by overwriting the config file.
overrides = [f"{key}={value}" for key, value in config.items()]
print(overrides)
run = submit_train_pipeline.build_and_submit_aml_pipeline(overrides)
print(run.get_portal_url())
# retrieving the metrics to optimize before the job completes.
stop = False
while not stop:
# get status
status = run._core_run.get_status()
print(f'status: {status}')
# get metrics
metrics = run._core_run.get_metrics(recursive=True)
if metrics:
run_metrics = list(metrics.values())
new_metric = run_metrics[0]['eval_binary_error']
if type(new_metric) == list:
new_metric = new_metric[-1]
print(f'eval_binary_error: {new_metric}')
tune.report(eval_binary_error=new_metric)
time.sleep(5)
if status == 'FAILED' or status == 'Completed':
stop = True
print("The run is terminated.")
print(status)
return
```
Overall, to tune the hyperparameters of the AzureML pipeline, run:
```bash
# the training job will run remotely as an AzureML job in both choices
# run the tuning job locally
python submit_tune.py --local
# run the tuning job remotely
python submit_tune.py --remote --subscription_id <your subscription_id> --resource_group <your resource_group> --workspace <your workspace>
```
The local option runs the `tuner/tuner_func.py` in your local machine.
The remote option wraps up the `tuner/tuner_func.py` as an AzureML component and
starts another AzureML job to tune the AzureML pipeline.

View File

@@ -1,191 +0,0 @@
# Tune - HuggingFace
This example uses flaml to finetune a transformer model from Huggingface transformers library.
*Note*: `flaml.AutoML` has built-in support for certain finetuning tasks with a
[higher-level API](AutoML-NLP).
It may be easier to use that API unless you have special requirements not handled by that API.
### Requirements
This example requires GPU. Install dependencies:
```python
pip install torch transformers datasets "flaml[blendsearch,ray]"
```
### Prepare for tuning
#### Tokenizer
```python
from transformers import AutoTokenizer
MODEL_NAME = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
COLUMN_NAME = "sentence"
def tokenize(examples):
return tokenizer(examples[COLUMN_NAME], truncation=True)
```
#### Define training method
```python
import flaml
import datasets
from transformers import AutoModelForSequenceClassification
TASK = "cola"
NUM_LABELS = 2
def train_distilbert(config: dict):
# Load CoLA dataset and apply tokenizer
cola_raw = datasets.load_dataset("glue", TASK)
cola_encoded = cola_raw.map(tokenize, batched=True)
train_dataset, eval_dataset = cola_encoded["train"], cola_encoded["validation"]
model = AutoModelForSequenceClassification.from_pretrained(
MODEL_NAME, num_labels=NUM_LABELS
)
metric = datasets.load_metric("glue", TASK)
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return metric.compute(predictions=predictions, references=labels)
training_args = TrainingArguments(
output_dir='.',
do_eval=False,
disable_tqdm=True,
logging_steps=20000,
save_total_limit=0,
**config,
)
trainer = Trainer(
model,
training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
# train model
trainer.train()
# evaluate model
eval_output = trainer.evaluate()
# report the metric to optimize & the metric to log
flaml.tune.report(
loss=eval_output["eval_loss"],
matthews_correlation=eval_output["eval_matthews_correlation"],
)
```
### Define the search
We are now ready to define our search. This includes:
- The `search_space` for our hyperparameters
- The `metric` and the `mode` ('max' or 'min') for optimization
- The constraints (`n_cpus`, `n_gpus`, `num_samples`, and `time_budget_s`)
```python
max_num_epoch = 64
search_space = {
# You can mix constants with search space objects.
"num_train_epochs": flaml.tune.loguniform(1, max_num_epoch),
"learning_rate": flaml.tune.loguniform(1e-6, 1e-4),
"adam_epsilon": flaml.tune.loguniform(1e-9, 1e-7),
"adam_beta1": flaml.tune.uniform(0.8, 0.99),
"adam_beta2": flaml.tune.loguniform(98e-2, 9999e-4),
}
# optimization objective
HP_METRIC, MODE = "matthews_correlation", "max"
# resources
num_cpus = 4
num_gpus = 4 # change according to your GPU resources
# constraints
num_samples = -1 # number of trials, -1 means unlimited
time_budget_s = 3600 # time budget in seconds
```
### Launch the tuning
We are now ready to launch the tuning using `flaml.tune.run`:
```python
import ray
ray.init(num_cpus=num_cpus, num_gpus=num_gpus)
print("Tuning started...")
analysis = flaml.tune.run(
train_distilbert,
search_alg=flaml.CFO(
space=search_space,
metric=HP_METRIC,
mode=MODE,
low_cost_partial_config={"num_train_epochs": 1}),
resources_per_trial={"gpu": num_gpus, "cpu": num_cpus},
local_dir='logs/',
num_samples=num_samples,
time_budget_s=time_budget_s,
use_ray=True,
)
```
This will run tuning for one hour. At the end we will see a summary.
```
== Status ==
Memory usage on this node: 32.0/251.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/4 CPUs, 0/4 GPUs, 0.0/150.39 GiB heap, 0.0/47.22 GiB objects (0/1.0 accelerator_type:V100)
Result logdir: /home/chiw/FLAML/notebook/logs/train_distilbert_2021-05-07_02-35-58
Number of trials: 22/infinite (22 TERMINATED)
Trial name status loc adam_beta1 adam_beta2 adam_epsilon learning_rate num_train_epochs iter total time (s) loss matthews_correlation
train_distilbert_a0c303d0 TERMINATED 0.939079 0.991865 7.96945e-08 5.61152e-06 1 1 55.6909 0.587986 0
train_distilbert_a0c303d1 TERMINATED 0.811036 0.997214 2.05111e-09 2.05134e-06 1.44427 1 71.7663 0.603018 0
train_distilbert_c39b2ef0 TERMINATED 0.909395 0.993715 1e-07 5.26543e-06 1 1 53.7619 0.586518 0
train_distilbert_f00776e2 TERMINATED 0.968763 0.990019 4.38943e-08 5.98035e-06 1.02723 1 56.8382 0.581313 0
train_distilbert_11ab3900 TERMINATED 0.962198 0.991838 7.09296e-08 5.06608e-06 1 1 54.0231 0.585576 0
train_distilbert_353025b6 TERMINATED 0.91596 0.991892 8.95426e-08 6.21568e-06 2.15443 1 98.3233 0.531632 0.388893
train_distilbert_5728a1de TERMINATED 0.926933 0.993146 1e-07 1.00902e-05 1 1 55.3726 0.538505 0.280558
train_distilbert_9394c2e2 TERMINATED 0.928106 0.990614 4.49975e-08 3.45674e-06 2.72935 1 121.388 0.539177 0.327295
train_distilbert_b6543fec TERMINATED 0.876896 0.992098 1e-07 7.01176e-06 1.59538 1 76.0244 0.527516 0.379177
train_distilbert_0071f998 TERMINATED 0.955024 0.991687 7.39776e-08 5.50998e-06 2.90939 1 126.871 0.516225 0.417157
train_distilbert_2f830be6 TERMINATED 0.886931 0.989628 7.6127e-08 4.37646e-06 1.53338 1 73.8934 0.551629 0.0655887
train_distilbert_7ce03f12 TERMINATED 0.984053 0.993956 8.70144e-08 7.82557e-06 4.08775 1 174.027 0.523732 0.453549
train_distilbert_aaab0508 TERMINATED 0.940707 0.993946 1e-07 8.91979e-06 3.40243 1 146.249 0.511288 0.45085
train_distilbert_14262454 TERMINATED 0.99 0.991696 4.60093e-08 4.83405e-06 3.4954 1 152.008 0.53506 0.400851
train_distilbert_6d211fe6 TERMINATED 0.959277 0.994556 5.40791e-08 1.17333e-05 6.64995 1 271.444 0.609851 0.526802
train_distilbert_c980bae4 TERMINATED 0.99 0.993355 1e-07 5.21929e-06 2.51275 1 111.799 0.542276 0.324968
train_distilbert_6d0d29d6 TERMINATED 0.965773 0.995182 9.9752e-08 1.15549e-05 13.694 1 527.944 0.923802 0.549474
train_distilbert_b16ea82a TERMINATED 0.952781 0.993931 2.93182e-08 1.19145e-05 3.2293 1 139.844 0.533466 0.451307
train_distilbert_eddf7cc0 TERMINATED 0.99 0.997109 8.13498e-08 1.28515e-05 15.5807 1 614.789 0.983285 0.56993
train_distilbert_43008974 TERMINATED 0.929089 0.993258 1e-07 1.03892e-05 12.0357 1 474.387 0.857461 0.520022
train_distilbert_b3408a4e TERMINATED 0.99 0.993809 4.67441e-08 1.10418e-05 11.9165 1 474.126 0.828205 0.526164
train_distilbert_cfbfb220 TERMINATED 0.979454 0.9999 1e-07 1.49578e-05 20.3715
```
### Retrieve the results
```python
best_trial = analysis.get_best_trial(HP_METRIC, MODE, "all")
metric = best_trial.metric_analysis[HP_METRIC][MODE]
print(f"n_trials={len(analysis.trials)}")
print(f"time={time.time()-start_time}")
print(f"Best model eval {HP_METRIC}: {metric:.4f}")
print(f"Best model parameters: {best_trial.config}")
# n_trials=22
# time=3999.769361972809
# Best model eval matthews_correlation: 0.5699
# Best model parameters: {'num_train_epochs': 15.580684188655825, 'learning_rate': 1.2851507818900338e-05, 'adam_epsilon': 8.134982521948352e-08, 'adam_beta1': 0.99, 'adam_beta2': 0.9971094424784387}
```
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/tune_huggingface.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/tune_huggingface.ipynb)

View File

@@ -1,171 +0,0 @@
# Tune - Lexicographic Objectives
## Requirements
```python
pip install "flaml>=1.1.0" thop torchvision torch
```
Tuning multiple objectives with Lexicographic preference is a new feature added in version 1.1.0 and is subject to change in future versions.
## Tuning accurate and efficient neural networks with lexicographic preference
### Data
```python
import torch
import thop
import torch.nn as nn
from flaml import tune
import torch.nn.functional as F
import torchvision
import numpy as np
import os
DEVICE = torch.device("cpu")
BATCHSIZE = 128
N_TRAIN_EXAMPLES = BATCHSIZE * 30
N_VALID_EXAMPLES = BATCHSIZE * 10
data_dir = os.path.abspath("data")
train_dataset = torchvision.datasets.FashionMNIST(
data_dir,
train=True,
download=True,
transform=torchvision.transforms.ToTensor(),
)
train_loader = torch.utils.data.DataLoader(
torch.utils.data.Subset(train_dataset, list(range(N_TRAIN_EXAMPLES))),
batch_size=BATCHSIZE,
shuffle=True,
)
val_dataset = torchvision.datasets.FashionMNIST(
data_dir, train=False, transform=torchvision.transforms.ToTensor()
)
val_loader = torch.utils.data.DataLoader(
torch.utils.data.Subset(val_dataset, list(range(N_VALID_EXAMPLES))),
batch_size=BATCHSIZE,
shuffle=True,
```
### Specific the model
```python
def define_model(configuration):
n_layers = configuration["n_layers"]
layers = []
in_features = 28 * 28
for i in range(n_layers):
out_features = configuration["n_units_l{}".format(i)]
layers.append(nn.Linear(in_features, out_features))
layers.append(nn.ReLU())
p = configuration["dropout_{}".format(i)]
layers.append(nn.Dropout(p))
in_features = out_features
layers.append(nn.Linear(in_features, 10))
layers.append(nn.LogSoftmax(dim=1))
return nn.Sequential(*layers)
```
### Train
```python
def train_model(model, optimizer, train_loader):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.view(-1, 28 * 28).to(DEVICE), target.to(DEVICE)
optimizer.zero_grad()
F.nll_loss(model(data), target).backward()
optimizer.step()
```
### Metrics
```python
def eval_model(model, valid_loader):
model.eval()
correct = 0
with torch.no_grad():
for batch_idx, (data, target) in enumerate(valid_loader):
data, target = data.view(-1, 28 * 28).to(DEVICE), target.to(DEVICE)
pred = model(data).argmax(dim=1, keepdim=True)
correct += pred.eq(target.view_as(pred)).sum().item()
accuracy = correct / N_VALID_EXAMPLES
flops, params = thop.profile(
model, inputs=(torch.randn(1, 28 * 28).to(DEVICE),), verbose=False
)
return np.log2(flops), 1 - accuracy, params
```
### Evaluation function
```python
def evaluate_function(configuration):
model = define_model(configuration).to(DEVICE)
optimizer = torch.optim.Adam(model.parameters(), configuration["lr"])
n_epoch = configuration["n_epoch"]
for epoch in range(n_epoch):
train_model(model, optimizer, train_loader)
flops, error_rate, params = eval_model(model, val_loader)
return {"error_rate": error_rate, "flops": flops, "params": params}
```
### Search space
```python
search_space = {
"n_layers": tune.randint(lower=1, upper=3),
"n_units_l0": tune.randint(lower=4, upper=128),
"n_units_l1": tune.randint(lower=4, upper=128),
"n_units_l2": tune.randint(lower=4, upper=128),
"dropout_0": tune.uniform(lower=0.2, upper=0.5),
"dropout_1": tune.uniform(lower=0.2, upper=0.5),
"dropout_2": tune.uniform(lower=0.2, upper=0.5),
"lr": tune.loguniform(lower=1e-5, upper=1e-1),
"n_epoch": tune.randint(lower=1, upper=20),
}
```
### Launch the tuning process
```python
# Low cost initial point
low_cost_partial_config = {
"n_layers": 1,
"n_units_l0": 4,
"n_units_l1": 4,
"n_units_l2": 4,
"n_epoch": 1,
}
# Specific lexicographic preference
lexico_objectives = {}
lexico_objectives["metrics"] = ["error_rate", "flops"]
lexico_objectives["tolerances"] = {"error_rate": 0.02, "flops": 0.0}
lexico_objectives["targets"] = {"error_rate": 0.0, "flops": 0.0}
lexico_objectives["modes"] = ["min", "min"]
# launch the tuning process
analysis = tune.run(
evaluate_function,
num_samples=-1,
time_budget_s=100,
config=search_space, # search space of NN
use_ray=False,
lexico_objectives=lexico_objectives,
low_cost_partial_config=low_cost_partial_config, # low cost initial point
)
```
We also support providing percentage tolerance as shown below.
```python
lexico_objectives["tolerances"] = {"error_rate": "5%", "flops": "0%"}
```
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/tune_lexicographic.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/tune_lexicographic.ipynb)

View File

@@ -1,287 +0,0 @@
# Tune - PyTorch
This example uses flaml to tune a pytorch model on CIFAR10.
## Prepare for tuning
### Requirements
```bash
pip install torchvision "flaml[blendsearch,ray]"
```
Before we are ready for tuning, we first need to define the neural network that we would like to tune.
### Network Specification
```python
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import random_split
import torchvision
import torchvision.transforms as transforms
class Net(nn.Module):
def __init__(self, l1=120, l2=84):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, l1)
self.fc2 = nn.Linear(l1, l2)
self.fc3 = nn.Linear(l2, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
```
### Data
```python
def load_data(data_dir="data"):
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
trainset = torchvision.datasets.CIFAR10(
root=data_dir, train=True, download=True, transform=transform)
testset = torchvision.datasets.CIFAR10(
root=data_dir, train=False, download=True, transform=transform)
return trainset, testset
```
### Training
```python
from ray import tune
def train_cifar(config, checkpoint_dir=None, data_dir=None):
if "l1" not in config:
logger.warning(config)
net = Net(2**config["l1"], 2**config["l2"])
device = "cpu"
if torch.cuda.is_available():
device = "cuda:0"
if torch.cuda.device_count() > 1:
net = nn.DataParallel(net)
net.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)
# The `checkpoint_dir` parameter gets passed by Ray Tune when a checkpoint
# should be restored.
if checkpoint_dir:
checkpoint = os.path.join(checkpoint_dir, "checkpoint")
model_state, optimizer_state = torch.load(checkpoint)
net.load_state_dict(model_state)
optimizer.load_state_dict(optimizer_state)
trainset, testset = load_data(data_dir)
test_abs = int(len(trainset) * 0.8)
train_subset, val_subset = random_split(
trainset, [test_abs, len(trainset) - test_abs])
trainloader = torch.utils.data.DataLoader(
train_subset,
batch_size=int(2**config["batch_size"]),
shuffle=True,
num_workers=4)
valloader = torch.utils.data.DataLoader(
val_subset,
batch_size=int(2**config["batch_size"]),
shuffle=True,
num_workers=4)
for epoch in range(int(round(config["num_epochs"]))): # loop over the dataset multiple times
running_loss = 0.0
epoch_steps = 0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
epoch_steps += 1
if i % 2000 == 1999: # print every 2000 mini-batches
print("[%d, %5d] loss: %.3f" % (epoch + 1, i + 1,
running_loss / epoch_steps))
running_loss = 0.0
# Validation loss
val_loss = 0.0
val_steps = 0
total = 0
correct = 0
for i, data in enumerate(valloader, 0):
with torch.no_grad():
inputs, labels = data
inputs, labels = inputs.to(device), labels.to(device)
outputs = net(inputs)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
loss = criterion(outputs, labels)
val_loss += loss.cpu().numpy()
val_steps += 1
# Here we save a checkpoint. It is automatically registered with
# Ray Tune and will potentially be passed as the `checkpoint_dir`
# parameter in future iterations.
with tune.checkpoint_dir(step=epoch) as checkpoint_dir:
path = os.path.join(checkpoint_dir, "checkpoint")
torch.save(
(net.state_dict(), optimizer.state_dict()), path)
tune.report(loss=(val_loss / val_steps), accuracy=correct / total)
print("Finished Training")
```
### Test Accuracy
```python
def _test_accuracy(net, device="cpu"):
trainset, testset = load_data()
testloader = torch.utils.data.DataLoader(
testset, batch_size=4, shuffle=False, num_workers=2)
correct = 0
total = 0
with torch.no_grad():
for data in testloader:
images, labels = data
images, labels = images.to(device), labels.to(device)
outputs = net(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
return correct / total
```
## Hyperparameter Optimization
```python
import numpy as np
import flaml
import os
data_dir = os.path.abspath("data")
load_data(data_dir) # Download data for all trials before starting the run
```
### Search space
```python
max_num_epoch = 100
config = {
"l1": tune.randint(2, 9), # log transformed with base 2
"l2": tune.randint(2, 9), # log transformed with base 2
"lr": tune.loguniform(1e-4, 1e-1),
"num_epochs": tune.loguniform(1, max_num_epoch),
"batch_size": tune.randint(1, 5) # log transformed with base 2
}
```
### Budget and resource constraints
```python
time_budget_s = 600 # time budget in seconds
gpus_per_trial = 0.5 # number of gpus for each trial; 0.5 means two training jobs can share one gpu
num_samples = 500 # maximal number of trials
np.random.seed(7654321)
```
### Launch the tuning
```python
import time
start_time = time.time()
result = flaml.tune.run(
tune.with_parameters(train_cifar, data_dir=data_dir),
config=config,
metric="loss",
mode="min",
low_cost_partial_config={"num_epochs": 1},
max_resource=max_num_epoch,
min_resource=1,
scheduler="asha", # Use asha scheduler to perform early stopping based on intermediate results reported
resources_per_trial={"cpu": 1, "gpu": gpus_per_trial},
local_dir='logs/',
num_samples=num_samples,
time_budget_s=time_budget_s,
use_ray=True)
```
### Check the result
```python
print(f"#trials={len(result.trials)}")
print(f"time={time.time()-start_time}")
best_trial = result.get_best_trial("loss", "min", "all")
print("Best trial config: {}".format(best_trial.config))
print("Best trial final validation loss: {}".format(
best_trial.metric_analysis["loss"]["min"]))
print("Best trial final validation accuracy: {}".format(
best_trial.metric_analysis["accuracy"]["max"]))
best_trained_model = Net(2**best_trial.config["l1"],
2**best_trial.config["l2"])
device = "cpu"
if torch.cuda.is_available():
device = "cuda:0"
if gpus_per_trial > 1:
best_trained_model = nn.DataParallel(best_trained_model)
best_trained_model.to(device)
checkpoint_value = getattr(best_trial.checkpoint, "dir_or_data", None) or best_trial.checkpoint.value
checkpoint_path = os.path.join(checkpoint_value, "checkpoint")
model_state, optimizer_state = torch.load(checkpoint_path)
best_trained_model.load_state_dict(model_state)
test_acc = _test_accuracy(best_trained_model, device)
print("Best trial test set accuracy: {}".format(test_acc))
```
### Sample of output
```
#trials=44
time=1193.913584947586
Best trial config: {'l1': 8, 'l2': 8, 'lr': 0.0008818671030627281, 'num_epochs': 55.9513429004283, 'batch_size': 3}
Best trial final validation loss: 1.0694482081472874
Best trial final validation accuracy: 0.6389
Files already downloaded and verified
Files already downloaded and verified
Best trial test set accuracy: 0.6294
```
[Link to notebook](https://github.com/microsoft/FLAML/blob/main/notebook/tune_pytorch.ipynb) | [Open in colab](https://colab.research.google.com/github/microsoft/FLAML/blob/main/notebook/tune_pytorch.ipynb)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 7.1 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 8.4 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 7.4 KiB