Enhance Integration with Spark (#1097)

* add doc for spark * labelCol equals to label by default * change title and reformat * reference about default index type * fix doc build * Update website/docs/Examples/Integrate - Spark.md * update doc * Added more references * remove exception case when `y_train.name` is None * fix broken link --------- Co-authored-by: Wendong Li <v-wendongli@microsoft.com> Co-authored-by: Li Jiang <bnujli@gmail.com>
2026-04-20 03:02:16 -04:00 · 2023-07-10 12:44:01 +08:00
parent 9c09090bf9
commit 5eece5c748
5 changed files with 129 additions and 8 deletions
--- a/website/docs/Use-Cases/Task-Oriented-AutoML.md
+++ b/website/docs/Use-Cases/Task-Oriented-AutoML.md
@@ -420,7 +420,7 @@ An example of using Spark for parallel tuning is:
 ```python
 automl.fit(X_train, y_train, n_concurrent_trials=4, use_spark=True)
 ```
-For Spark clusters, by default, we will launch one trial per executor. However, sometimes we want to launch more trials than the number of executors (e.g., local mode). In this case, we can set the environment variable `FLAML_MAX_CONCURRENT` to override the detected `num_executors`. The final number of concurrent trials will be the minimum of `n_concurrent_trials` and `num_executors`. Also, GPU training is not supported yet when use_spark is True.
+Details about parallel tuning with Spark could be found [here](../Examples/Integrate%20-%20Spark#parallel-spark-jobs). For Spark clusters, by default, we will launch one trial per executor. However, sometimes we want to launch more trials than the number of executors (e.g., local mode). In this case, we can set the environment variable `FLAML_MAX_CONCURRENT` to override the detected `num_executors`. The final number of concurrent trials will be the minimum of `n_concurrent_trials` and `num_executors`. Also, GPU training is not supported yet when use_spark is True.

 #### **Guidelines on parallel vs sequential tuning**

--- a/website/docs/Use-Cases/Tune-User-Defined-Function.md
+++ b/website/docs/Use-Cases/Tune-User-Defined-Function.md
@@ -293,6 +293,8 @@ Related arguments:
 - `use_spark`: A boolean of whether to use spark as the backend.
 - `resources_per_trial`: A dictionary of the hardware resources to allocate per trial, e.g., `{'cpu': 1}`. Only valid when using ray backend.

+Details about parallel tuning with Spark could be found [here](../Examples/Integrate%20-%20Spark#parallel-spark-jobs).
+

 You can perform parallel tuning by specifying `use_ray=True` (requiring flaml[ray] option installed) or `use_spark=True`
 (requiring flaml[spark] option installed). You can also limit the amount of resources allocated per trial by specifying `resources_per_trial`,