Data Science AgentΒΆ
π€ Automated Feature Engineering & Model Tuning EvolutionΒΆ
The Data Science Agent is an agent that can automatically perform feature engineering and model tuning. It can be used to solve various data science problems, such as image classification, time series forecasting, and text classification.
π IntroductionΒΆ
In this scenario, our automated system proposes hypothesis, choose action, implements code, conducts validation, and utilizes feedback in a continuous, iterative process.
The goal is to automatically optimize performance metrics within the validation set or Kaggle Leaderboard, ultimately discovering the most efficient features and models through autonomous research and development.
Hereβs an enhanced outline of the steps:
Step 1 : Hypothesis Generation π
Generate and propose initial hypotheses based on previous experiment analysis and domain expertise, with thorough reasoning and financial justification.
Step 2 : Experiment Creation β¨
Transform the hypothesis into a task.
Choose a specific action within feature engineering or model tuning.
Develop, define, and implement a new feature or model, including its name, description, and formulation.
Step 3 : Model/Feature Implementation π¨βπ»
Implement the model code based on the detailed description.
Evolve the model iteratively as a developer would, ensuring accuracy and efficiency.
Step 4 : Validation on Test Set or Kaggle π
Validate the newly developed model using the test set or Kaggle dataset.
Assess the modelβs effectiveness and performance based on the validation results.
Step 5: Feedback Analysis π
Analyze validation results to assess performance.
Use insights to refine hypotheses and enhance the model.
Step 6: Hypothesis Refinement β»οΈ
Adjust hypotheses based on validation feedback.
Iterate the process to continuously improve the model.
π Data Science BackgroundΒΆ
In the evolving landscape of artificial intelligence, Data Science represents a powerful paradigm where machines engage in autonomous exploration, hypothesis testing, and model development across diverse domains β from healthcare and finance to logistics and research.
The Data Science Agent stands as a central engine in this transformation, enabling users to automate the entire machine learning workflow: from hypothesis generation to code implementation, validation, and refinement β all guided by performance feedback.
By leveraging the Data Science Agent, researchers and developers can accelerate experimentation cycles. Whether fine-tuning custom models or competing in high-stakes benchmarks like Kaggle, the Data Science Agent unlocks new frontiers in intelligent, self-directed discovery.
π§ Example Guide - Customized datasetΒΆ
π§ Set up RD-Agent EnvironmentΒΆ
Before you start, please make sure you have installed RD-Agent and configured the environment for RD-Agent correctly. If you want to know how to install and configure the RD-Agent, please refer to the documentation.
π© Setting the Environment variables at .env file
Determine the path where the data will be stored and add it to the
.envfile.
dotenv set DS_LOCAL_DATA_PATH <your local directory>/ds_data dotenv set DS_SCEN rdagent.scenarios.data_science.scen.DataScienceScen
π₯ Prepare Customized datasetsΒΆ
A data science competition dataset usually consists of two parts:
competition datasetandevaluation dataset. (We provide a sample of a customized dataset named: arf-12-hours-prediction-task as a reference.)
The
competition datasetcontains training data, test data, description files, formatted submission files, data sampling codes.The
evaluation datasetcontains standard answer file, data checking codes, and Code for calculation of scores.We use the
arf-12-hours-prediction-taskdata as a sample to introduce the preparation workflow for the competition dataset.
Create a
ds_data/source_data/arf-12-hours-prediction-taskfolder, which will be used to store your raw dataset.
The raw files for the competition
arf-12-hours-prediction-taskhave two files:ARF_12h.csvandX.npz.Create a
ds_data/source_data/arf-12-hours-prediction-task/prepare.pyfile that splits your raw data into training data, test data, formatted submission file, and standard answer file. (You will need to write a script based on your raw data.)
The following shows the preprocessing code for the raw data of
arf-12-hours-prediction-task.ds_data/source_data/arf-12-hours-prediction-task/prepare.pyΒΆ1import random 2from pathlib import Path 3 4import numpy as np 5import pandas as pd 6import sparse 7 8CURRENT_DIR = Path(__file__).resolve().parent 9ROOT_DIR = CURRENT_DIR.parent.parent 10 11raw_feature_path = CURRENT_DIR / "X.npz" 12raw_label_path = CURRENT_DIR / "ARF_12h.csv" 13 14public = ROOT_DIR / "arf-12-hours-prediction-task" 15private = ROOT_DIR / "eval" / "arf-12-hours-prediction-task" 16 17if not (public / "test").exists(): 18 (public / "test").mkdir(parents=True, exist_ok=True) 19 20if not (public / "train").exists(): 21 (public / "train").mkdir(parents=True, exist_ok=True) 22 23if not private.exists(): 24 private.mkdir(parents=True, exist_ok=True) 25 26SEED = 42 27random.seed(SEED) 28np.random.seed(SEED) 29 30X_sparse = sparse.load_npz(raw_feature_path) # COO matrix, shape: [N, D, T] 31df_label = pd.read_csv(raw_label_path) # Contains column 'ARF_LABEL' 32N = X_sparse.shape[0] 33 34indices = np.arange(N) 35np.random.shuffle(indices) 36split = int(0.7 * N) 37train_idx, test_idx = indices[:split], indices[split:] 38 39X_train = X_sparse[train_idx] 40X_test = X_sparse[test_idx] 41 42df_train = df_label.iloc[train_idx].reset_index(drop=True) 43df_test = df_label.iloc[test_idx].reset_index(drop=True) 44 45submission_df = df_test.copy() 46submission_df["ARF_LABEL"] = 0 47submission_df.drop(submission_df.columns.difference(["ID", "ARF_LABEL"]), axis=1, inplace=True) 48submission_df.to_csv(public / "sample_submission.csv", index=False) 49 50df_test.to_csv(private / "submission_test.csv", index=False) 51 52df_test.drop(["ARF_LABEL"], axis=1, inplace=True) 53df_test.to_csv(public / "test" / "ARF_12h.csv", index=False) 54sparse.save_npz(public / "test" / "X.npz", X_test) 55 56sparse.save_npz(public / "train" / "X.npz", X_train) 57df_train.to_csv(public / "train" / "ARF_12h.csv", index=False) 58 59assert ( 60 X_train.shape[0] == df_train.shape[0] 61), f"Mismatch: X_train rows ({X_train.shape[0]}) != df_train rows ({df_train.shape[0]})" 62assert ( 63 X_test.shape[0] == df_test.shape[0] 64), f"Mismatch: X_test rows ({X_test.shape[0]}) != df_test rows ({df_test.shape[0]})" 65assert df_test.shape[1] == 2, "Public test set should have 2 columns" 66assert df_train.shape[1] == 3, "Public train set should have 3 columns" 67assert len(df_train) + len(df_test) == len( 68 df_label 69), "Length of new_train and new_test should equal length of old_train"
At the end of program execution, the
ds_datafolder structure will look like this:ds_data βββ arf-12-hours-prediction-task β βββ train β β βββ ARF_12h.csv β β βββ X.npz β βββ test β β βββ ARF_12h.csv β β βββ X.npz β βββ sample_submission.csv βββ eval β βββ arf-12-hours-prediction-task β βββ submission_test.csv βββ source_data βββ arf-12-hours-prediction-task βββ ARF_12h.csv βββ prepare.py βββ X.npzCreate a
ds_data/arf-12-hours-prediction-task/description.mdfile to describe your competition, Objective, dataset, and other information.
The following shows the description file for
arf-12-hours-prediction-taskds_data/arf-12-hours-prediction-task/description.mdΒΆ1# Competition name: ARF 12-Hour Prediction Task 2 3## Overview 4 5### Description 6 7Acute Respiratory Failure (ARF) is a life-threatening condition that often develops rapidly in critically ill patients. Accurate early prediction of ARF is crucial in intensive care units (ICUs) to enable timely clinical interventions and resource allocation. In this task, you are asked to build a machine learning model that predicts whether a patient will develop ARF within the next **12 hours**, based on multivariate clinical time series data. 8 9The dataset is extracted from electronic health records (EHRs) and preprocessed using the **FIDDLE** pipeline to generate structured temporal features for each patient. 10 11### Objective 12 13**Your Goal** is to develop a binary classification model that takes a 12-hour time series as input and predicts whether ARF will occur (1) or not (0) in the following 12 hours. 14 15--- 16 17## Data Description 18 191. train/ARF_12h.csv: A CSV file containing the ICU stay ID, the hour of ARF onset, and the binary label indicating whether ARF will occur in the next 12 hours. 20 21 * Columns: ID, ARF_ONSET_HOUR, ARF_LABEL 22 232. train/X.npz: N Γ T Γ D sparse tensor containing time-dependent features. 24 25 * N: Number of samples (number of ICU stays) 26 * T: Time step (12 hours of records per sample) 27 * D: Dynamic feature dimension (how many features per hour) 28 293. test/ARF_12h.csv: Ground truth labels (used for evaluation only). 30 314. test/X.npz: Test feature set in the same format as training data. 32 33--- 34 35## Data usage Notes 36 37To load the features, you need python and the sparse package. 38 39import sparse 40 41X = sparse.load_npz("<url>/X.npz").todense() 42 43 44To load the labels, use pandas or an alternative csv reader. 45 46import pandas as pd 47 48df = pd.read_csv("<url>/ARF_12h.csv") 49 50 51--- 52 53## Modeling 54 55Each sample is a 12-hour multivariate time series of ICU patient observations, represented as a tensor of shape (12, D). 56The goal is to predict whether the patient will develop ARF (1) or not (0) in the following 12 hours. 57 58* **Input**: 12 Γ D matrix of clinical features 59* **Output**: Binary prediction: 0 (no ARF) or 1 (ARF onset) 60* **Loss Function**: BCEWithLogitsLoss, CrossEntropyLoss or equivalent 61* **Evaluation Metric**: **AUROC** (Area Under the Receiver Operating Characteristic Curve) 62 63Note: Although the output is binary, AUROC evaluates the ranking quality of predicted scores. Therefore, your model should output a confidence score during training, which is then thresholded to produce 0 or 1 for final submission. 64 65--- 66 67## Evaluation 68 69### Area Under the Receiver Operating Characteristic curve (AUROC) 70 71The submissions are scored according to the area under the receiver operating characteristic curve. AUROC is defined as: 72 73$$ 74\text{AUROC} = \frac{1}{|P| \cdot |N|} \sum_{i \in P} \sum_{j \in N} \left[ \mathbb{1}(s_i > s_j) + \frac{1}{2} \cdot \mathbb{1}(s_i = s_j) \right] 75$$ 76 77AUROC reflects the model's ability to rank positive samples higher than negative ones. A score of 1.0 means perfect discrimination, and 0.5 means random guessing. 78 79### Submission Format 80 81For each `ID'' in the ARF_12h.csv file of the test dataset, you must predict whether ARF will occur (label = 1) or not (label = 0) in the following 12 hours(ARF_LABEL), based on the X.npz (sparse tensor, time-varying feature). The file should contain the following format: 82 83ID,ARF_LABEL 84246505,0 85291335,0 86286713,0 87etc. 88 89 90Note: Although the submission is binary, AUROC evaluates the ranking quality of your model. It is recommended to output probabilities during training and apply a threshold (e.g., 0.5) to convert to binary labels for submission. 91 92---Create a
ds_data/arf-12-hours-prediction-task/sample.pyfile to construct the debugging sample data.
The following shows the script for constructing the debugging sample data based on the
arf-12-hours-prediction-taskdataset implementation.ds_data/arf-12-hours-prediction-task/sample.pyΒΆ1import shutil 2from pathlib import Path 3 4import numpy as np 5import pandas as pd 6import sparse 7from tqdm import tqdm 8 9 10def sample_and_copy_subfolder( 11 input_dir: Path, 12 output_dir: Path, 13 min_frac: float, 14 min_num: int, 15 seed: int = 42, 16): 17 np.random.seed(seed) 18 19 feature_path = input_dir / "X.npz" 20 label_path = input_dir / "ARF_12h.csv" 21 22 # Load sparse features and label 23 X_sparse = sparse.load_npz(feature_path) 24 df_label = pd.read_csv(label_path) 25 26 N = X_sparse.shape[0] 27 n_keep = max(int(N * min_frac), min_num) 28 idx = np.random.choice(N, n_keep, replace=False) 29 30 X_sample = X_sparse[idx] 31 df_sample = df_label.iloc[idx].reset_index(drop=True) 32 33 output_dir.mkdir(parents=True, exist_ok=True) 34 sparse.save_npz(output_dir / "X.npz", X_sample) 35 df_sample.to_csv(output_dir / "ARF_12h.csv", index=False) 36 37 print(f"[INFO] Sampled {n_keep} of {N} from {input_dir.name}") 38 39 # Copy additional files 40 for f in input_dir.glob("*"): 41 if f.name not in {"X.npz", "ARF_12h.csv"} and f.is_file(): 42 shutil.copy(f, output_dir / f.name) 43 print(f"[COPY] Extra file: {f.name}") 44 45 46def copy_other_file(source: Path, target: Path): 47 for item in source.iterdir(): 48 if item.name in {"train", "test"}: 49 continue 50 51 relative_path = item.relative_to(source) 52 target_path = target / relative_path 53 54 if item.is_dir(): 55 shutil.copytree(item, target_path, dirs_exist_ok=True) 56 print(f"[COPY DIR] {item} -> {target_path}") 57 elif item.is_file(): 58 target_path.parent.mkdir(parents=True, exist_ok=True) 59 shutil.copy2(item, target_path) 60 print(f"[COPY FILE] {item} -> {target_path}") 61 62 63def create_debug_data( 64 dataset_path: str, 65 output_path: str, 66 min_frac: float = 0.02, 67 min_num: int = 10, 68): 69 dataset_root = Path(dataset_path) / "arf-12-hours-prediction-task" 70 output_root = Path(output_path) 71 72 for sub in ["train", "test"]: 73 input_dir = dataset_root / sub 74 output_dir = output_root / sub 75 print(f"\n[PROCESS] {sub} subset") 76 sample_and_copy_subfolder( 77 input_dir=input_dir, 78 output_dir=output_dir, 79 min_frac=min_frac, 80 min_num=min_num, 81 seed=42 if sub == "train" else 123, 82 ) 83 print(dataset_root.resolve()) 84 print(output_root.resolve()) 85 copy_other_file(source=dataset_root, target=output_root) 86 87 print(f"\n[INFO] Sampling complete β Output in: {output_root}") 88 89 90if __name__ == "__main__" or globals().get("__name__") == "<run_path>": 91 dataset_path = globals().get("dataset_path", "./") 92 output_path = globals().get("output_path", "./sample") 93 create_debug_data( 94 dataset_path=dataset_path, 95 output_path=output_path, 96 min_frac=0.02, 97 min_num=10, 98 )Create a
ds_data/eval/arf-12-hours-prediction-task/valid.pyfile, which is used to check the validity of the submission files to ensure that their formatting is consistent with the reference file.
The following shows a script that checks the validity of a submission based on the
arf-12-hours-prediction-taskdata.ds_data/eval/arf-12-hours-prediction-task/valid.pyΒΆ1from pathlib import Path 2 3# Check if our submission file exists 4assert Path("submission.csv").exists(), "Error: submission.csv not found" 5 6submission_lines = Path("submission.csv").read_text().splitlines() 7test_lines = Path("submission_test.csv").read_text().splitlines() 8 9is_valid = len(submission_lines) == len(test_lines) 10 11if is_valid: 12 message = "submission.csv and submission_test.csv have the same number of lines." 13else: 14 message = ( 15 f"submission.csv has {len(submission_lines)} lines, while submission_test.csv has {len(test_lines)} lines." 16 ) 17 18print(message) 19 20if not is_valid: 21 raise AssertionError("Submission is invalid")Create a
ds_data/eval/arf-12-hours-prediction-task/grade.pyfile, which is used to calculate the score based on the submission file and the standard answer file, and output the result in JSON format.
The following shows a grading script based on the
arf-12-hours-prediction-taskdata implementation.ds_data/eval/arf-12-hours-prediction-task/grade.pyΒΆ1import json 2 3import pandas as pd 4from sklearn.metrics import roc_auc_score 5 6 7def prepare_for_auroc_metric(submission: pd.DataFrame, answers: pd.DataFrame, id_col: str, target_col: str) -> dict: 8 9 # Answers checks 10 assert id_col in answers.columns, f"answers dataframe should have an {id_col} column" 11 assert target_col in answers.columns, f"answers dataframe should have a {target_col} column" 12 13 # Submission checks 14 if id_col not in submission.columns: 15 raise InvalidSubmissionError(f"Submission should have an {id_col} column") 16 if target_col not in submission.columns: 17 raise InvalidSubmissionError(f"Submission should have a {target_col} column") 18 if len(submission) != len(answers): 19 raise InvalidSubmissionError(f"Submission and answers should have the same number of rows") 20 try: 21 pd.to_numeric(submission[target_col]) 22 except ValueError: 23 raise InvalidSubmissionError( 24 f"Expected {target_col} column to be numeric, got {submission[target_col].dtype} instead" 25 ) 26 if submission[target_col].min() < 0 or submission[target_col].max() > 1: 27 raise InvalidSubmissionError( 28 f"Submission {target_col} column should contain probabilities," 29 " and therefore contain values between 0 and 1 inclusive" 30 ) 31 # Sort 32 submission = submission.sort_values(id_col) 33 answers = answers.sort_values(id_col) 34 35 if (submission[id_col].values != answers[id_col].values).any(): 36 raise InvalidSubmissionError(f"Submission and answers should have the same {id_col} values") 37 38 roc_auc_inputs = { 39 "y_true": answers[target_col].to_numpy(), 40 "y_score": submission[target_col].to_numpy(), 41 } 42 43 return roc_auc_inputs 44 45 46def grade(submission: pd.DataFrame, answers: pd.DataFrame) -> float: 47 roc_auc_inputs = prepare_for_auroc_metric( 48 submission=submission, answers=answers, id_col="ID", target_col="ARF_LABEL" 49 ) 50 return roc_auc_score(y_true=roc_auc_inputs["y_true"], y_score=roc_auc_inputs["y_score"]) 51 52 53if __name__ == "__main__": 54 submission_path = "submission.csv" 55 gt_submission_path = "submission_test.csv" 56 submission = pd.read_csv(submission_path) 57 answers = pd.read_csv(gt_submission_path) 58 score = grade(submission=submission, answers=answers) 59 60 print( 61 json.dumps( 62 { 63 "competition_id": "arf-12-hours-prediction-task", 64 "score": score, 65 } 66 ) 67 )At this point, you have created a complete dataset. The correct structure of the dataset should look like this.
ds_data βββ arf-12-hours-prediction-task β βββ train β β βββ ARF_12h.csv β β βββ X.npz β βββ test β β βββ ARF_12h.csv β β βββ X.npz β βββ description.md β βββ sample_submission.csv β βββ sample.py βββ eval β βββ arf-12-hours-prediction-task β βββ grade.py β βββ submission_test.csv β βββ valid.py βββ source_data βββ arf-12-hours-prediction-task βββ ARF_12h.csv βββ prepare.py βββ X.npzThe above shows the complete dataset creation workflow, some of the files are not required, in practice you can customize the dataset according to your own needs.
If we donβt need the test set scores, then we can choose not to generate formatted submission files and standard answer file in the prepare code, and we donβt need to write data checking codes and Code for calculation of scores.
Data sampling code can also be created according to the actual need, if you do not provide data sampling code, RD-Agent will be handed over to the LLM sampling at runtime.
In the default sampling method (
create_debug_data), the default sampling ratio (parameter:min_frac) is 1%, if 1% of the data is less than 5, then 5 data will be sampled (parameter:min_num), you can adjust the sampling ratio by adjusting these two parameters.
If you have customized data sampling code, you need to set
DS_SAMPLE_DATA_BY_LLMtoFalse(default is True) in the.envfile before running, so that the program will use the customized sampling code when running, and you can just execute this line of code in the command line:dotenv set DS_SAMPLE_DATA_BY_LLM FalseIn addition, we provide a data sampling method in rdagent.scenarios.data_science.debug.data.create_debug_data, in this method, the default sampling ratio (parameter:
min_frac) is 1%, if 1% of the data is less than 5, then 5 data will be sampled (parameter:min_num), you can use this method by the following two ways.
You can set
DS_SAMPLE_DATA_BY_LLMtoFalsein the.envfile so that when the program runs, it will use the sampling code provided by RD-Agent.dotenv set DS_SAMPLE_DATA_BY_LLM FalseIf you think that the parameters in the receipt sampling method provided by RD-Agent are not suitable, you can customize the parameters in the following command and run it, and set
DS_SAMPLE_DATA_BY_LLMtoFalsein the.envso that the program will use the sampling data you provided when running.python rdagent/app/data_science/debug.py --dataset_path <dataset path> --competition <competiton_name> --min_frac <sampling ratio> --min_num <minimum number of sampling> dotenv set DS_SAMPLE_DATA_BY_LLM FalseIf you donβt need the scores from the test set and leave the data sampling to the LLM, or if you use the sampling method provided by the RD-Agent, you only need to prepare a minimal dataset. The structure of the simplest dataset should be as shown below.
ds_data βββ arf-12-hours-prediction-task β βββ train β β βββ ARF_12h.csv β β βββ X.npz β βββ test β β βββ ARF_12h.csv β β βββ X.npz β βββ description.md βββ source_data βββ arf-12-hours-prediction-task βββ ARF_12h.csv βββ prepare.py βββ X.npzWe have prepared a dataset based on the above description for your reference. You can download it with the following command.
wget https://github.com/SunsetWolf/rdagent_resource/releases/download/ds_data/arf-12-hours-prediction-task.zip
βοΈ Set up Environment for Customized datasetsΒΆ
dotenv set DS_SCEN rdagent.scenarios.data_science.scen.DataScienceScen dotenv set DS_LOCAL_DATA_PATH <your local directory>/ds_data dotenv set DS_CODER_ON_WHOLE_PIPELINE True
π More Environment Variables (Optional)
If you want to see all the available environment variables, you can refer to the configuration file for Data Science scenarios:
1from pathlib import Path 2from typing import Literal 3 4from pydantic_settings import SettingsConfigDict 5 6from rdagent.app.kaggle.conf import KaggleBasePropSetting 7 8 9class DataScienceBasePropSetting(KaggleBasePropSetting): 10 # TODO: Kaggle Setting should be the subclass of DataScience 11 model_config = SettingsConfigDict(env_prefix="DS_", protected_namespaces=()) 12 13 # Main components 14 ## Scen 15 scen: str = "rdagent.scenarios.data_science.scen.KaggleScen" 16 """ 17 Scenario class for data science tasks. 18 - For Kaggle competitions, use: "rdagent.scenarios.data_science.scen.KaggleScen" 19 - For custom data science scenarios, use: "rdagent.scenarios.data_science.scen.DataScienceScen" 20 """ 21 22 planner: str = "rdagent.scenarios.data_science.proposal.exp_gen.planner.DSExpPlannerHandCraft" 23 hypothesis_gen: str = "rdagent.scenarios.data_science.proposal.exp_gen.router.ParallelMultiTraceExpGen" 24 interactor: str = "rdagent.components.interactor.SkipInteractor" 25 trace_scheduler: str = "rdagent.scenarios.data_science.proposal.exp_gen.trace_scheduler.RoundRobinScheduler" 26 """Hypothesis generation class""" 27 28 summarizer: str = "rdagent.scenarios.data_science.dev.feedback.DSExperiment2Feedback" 29 summarizer_init_kwargs: dict = { 30 "version": "exp_feedback", 31 } 32 ## Workflow Related 33 consecutive_errors: int = 5 34 35 ## Coding Related 36 coding_fail_reanalyze_threshold: int = 3 37 38 debug_recommend_timeout: int = 600 39 """The recommend time limit for running on debugging data""" 40 debug_timeout: int = 600 41 """The timeout limit for running on debugging data""" 42 full_recommend_timeout: int = 3600 43 """The recommend time limit for running on full data""" 44 full_timeout: int = 3600 45 """The timeout limit for running on full data""" 46 47 #### model dump 48 enable_model_dump: bool = False 49 enable_doc_dev: bool = False 50 model_dump_check_level: Literal["medium", "high"] = "medium" 51 52 #### MCP documentation search integration 53 enable_mcp_documentation_search: bool = False 54 """Enable MCP documentation search for error resolution. Requires MCP_ENABLED=true and MCP_CONTEXT7_ENABLED=true in environment.""" 55 56 ### specific feature 57 58 ### notebook integration 59 enable_notebook_conversion: bool = False 60 61 #### enable specification 62 spec_enabled: bool = True 63 64 #### proposal related 65 # proposal_version: str = "v2" deprecated 66 67 coder_on_whole_pipeline: bool = True 68 max_trace_hist: int = 3 69 70 coder_max_loop: int = 10 71 runner_max_loop: int = 3 72 73 sample_data_by_LLM: bool = True 74 use_raw_description: bool = False 75 show_nan_columns: bool = False 76 77 ### knowledge base 78 enable_knowledge_base: bool = False 79 knowledge_base_version: str = "v1" 80 knowledge_base_path: str | None = None 81 idea_pool_json_path: str | None = None 82 83 ### archive log folder after each loop 84 enable_log_archive: bool = True 85 log_archive_path: str | None = None 86 log_archive_temp_path: str | None = ( 87 None # This is to store the mid tar file since writing the tar file is preferred in local storage then copy to target storage 88 ) 89 90 #### Evaluation on Test related 91 eval_sub_dir: str = "eval" # TODO: fixme, this is not a good name 92 """We'll use f"{DS_RD_SETTING.local_data_path}/{DS_RD_SETTING.eval_sub_dir}/{competition}" 93 to find the scriipt to evaluate the submission on test""" 94 95 """---below are the settings for multi-trace---""" 96 97 ### multi-trace related 98 max_trace_num: int = 1 99 """The maximum number of traces to grow before merging""" 100 101 scheduler_temperature: float = 1.0 102 """The temperature for the trace scheduler for softmax calculation, used in ProbabilisticScheduler""" 103 104 # PUCT exploration constant for MCTSScheduler (ignored by other schedulers) 105 scheduler_c_puct: float = 1.0 106 """Exploration constant used by MCTSScheduler (PUCT).""" 107 108 enable_score_reward: bool = False 109 """Enable using score-based reward for trace selection in multi-trace scheduling.""" 110 111 #### multi-trace:checkpoint selector 112 selector_name: str = "rdagent.scenarios.data_science.proposal.exp_gen.select.expand.LatestCKPSelector" 113 """The name of the selector to use""" 114 sota_count_window: int = 5 115 """The number of trials to consider for SOTA count""" 116 sota_count_threshold: int = 1 117 """The threshold for SOTA count""" 118 119 #### multi-trace: SOTA experiment selector 120 sota_exp_selector_name: str = "rdagent.scenarios.data_science.proposal.exp_gen.select.submit.GlobalSOTASelector" 121 """The name of the SOTA experiment selector to use""" 122 123 ### multi-trace:inject optimals for multi-trace 124 # inject diverse when start a new sub-trace 125 enable_inject_diverse: bool = False 126 127 # inject diverse from other traces when start a new sub-trace 128 enable_cross_trace_diversity: bool = True 129 """Enable cross-trace diversity injection when starting a new sub-trace. 130 This is different from `enable_inject_diverse` which is for non-parallel cases.""" 131 132 diversity_injection_strategy: str = ( 133 "rdagent.scenarios.data_science.proposal.exp_gen.diversity_strategy.InjectUntilSOTAGainedStrategy" 134 ) 135 """The strategy to use for injecting diversity context.""" 136 137 # enable different version of DSExpGen for multi-trace 138 enable_multi_version_exp_gen: bool = False 139 exp_gen_version_list: str = "v3,v2" 140 141 #### multi-trace: time for final multi-trace merge 142 merge_hours: float = 0 143 """The time for merge""" 144 145 #### multi-trace: max SOTA-retrieved number, used in AutoSOTAexpSelector 146 # constrains the number of SOTA experiments to retrieve, otherwise too many SOTA experiments to retrieve will cause the exceed of the context window of LLM 147 max_sota_retrieved_num: int = 10 148 """The maximum number of SOTA experiments to retrieve in a LLM call""" 149 150 #### enable draft before first sota experiment 151 enable_draft_before_first_sota: bool = False 152 enable_planner: bool = False 153 154 model_architecture_suggestion_time_percent: float = 0.75 155 allow_longer_timeout: bool = False 156 coder_enable_llm_decide_longer_timeout: bool = False 157 runner_enable_llm_decide_longer_timeout: bool = False 158 coder_longer_timeout_multiplier_upper: int = 3 159 runner_longer_timeout_multiplier_upper: int = 2 160 coder_timeout_increase_stage: float = 0.3 161 runner_timeout_increase_stage: float = 0.3 162 runner_timeout_increase_stage_patience: int = 2 163 """Number of failures tolerated before escalating to next timeout level (stage width). Every 'patience' failures, timeout increases by 'runner_timeout_increase_stage'""" 164 show_hard_limit: bool = True 165 166 #### enable runner code change summary 167 runner_enable_code_change_summary: bool = True 168 169 ### Proposal workflow related 170 171 #### Hypothesis Generate related 172 enable_simple_hypothesis: bool = False 173 """If true, generate simple hypothesis, no more than 2 sentences each.""" 174 175 enable_generate_unique_hypothesis: bool = False 176 """Enable generate unique hypothesis. If True, generate unique hypothesis for each component. If False, generate unique hypothesis for each component.""" 177 178 enable_research_rag: bool = False 179 """Enable research RAG for hypothesis generation.""" 180 181 #### hypothesis critique and rewrite 182 enable_hypo_critique_rewrite: bool = False 183 """Enable hypothesis critique and rewrite stages for improving hypothesis quality""" 184 enable_scale_check: bool = False 185 186 ##### select related 187 ratio_merge_or_ensemble: int = 70 188 """The ratio of merge or ensemble to be considered as a valid solution""" 189 llm_select_hypothesis: bool = False 190 """Whether to use LLM to select hypothesis. If True, use LLM selection; if False, use the existing ranking method.""" 191 192 #### Task Generate related 193 fix_seed_and_data_split: bool = False 194 195 ensemble_time_upper_bound: bool = False 196 197 user_interaction_wait_seconds: int = 6000 # seconds to wait for user interaction 198 user_interaction_mid_folder: Path = Path.cwd() / "git_ignore_folder" / "RD-Agent_user_interaction" 199 200 201DS_RD_SETTING = DataScienceBasePropSetting() 202 203# enable_cross_trace_diversity and llm_select_hypothesis should not be true at the same time 204assert not ( 205 DS_RD_SETTING.enable_cross_trace_diversity and DS_RD_SETTING.llm_select_hypothesis 206), "enable_cross_trace_diversity and llm_select_hypothesis cannot be true at the same time"
These variables allow you to have finer-grained control in Data Science scenarios.
π Run the ApplicationΒΆ
π You can directly run the application by using the following command:
rdagent data_science --competition <Competition ID>
The following shows the command to run based on the
arf-12-hours-prediction-taskdatardagent data_science --competition arf-12-hours-prediction-taskMore CLI Parameters for rdagent data_science command:
- rdagent.app.data_science.loop.main(path: str | None = None, checkout: bool = True, checkout_path: str | None = None, step_n: int | None = None, loop_n: int | None = None, timeout: str | None = None, competition='bms-molecular-translation', replace_timer=True, exp_gen_cls: str | None = None)
ParametersΒΆ
- path :
A path like $LOG_PATH/__session__/1/0_propose. This indicates that we restore the state after finishing step 0 in loop 1.
- checkout :
Used to control the log session path. Boolean type, default is True. - If True, the new loop will use the existing folder and clear logs for sessions after the one corresponding to the given path. - If False, the new loop will use the existing folder but keep the logs for sessions after the one corresponding to the given path.
- checkout_path:
If a checkout_path (or a str like Path) is provided, the new loop will be saved to that path, leaving the original path unchanged.
- step_n :
Number of steps to run; if None, the process will run indefinitely until an error or KeyboardInterrupt occurs.
- loop_n :
Number of loops to run; if None, the process will run indefinitely until an error or KeyboardInterrupt occurs. - If the current loop is incomplete, it will be counted as the first loop for completion. - If both step_n and loop_n are provided, the process will stop as soon as either condition is met.
- timeout :
Maximum duration to run the loop. Accepts a string format recognized by the internal timer. - If None, the loop will run until completion, error, or KeyboardInterrupt.
- competition :
Competition name.
- replace_timer :
If a session is loaded, determines whether to replace the timer with session.timer.
- exp_gen_cls :
When there are different stages, the exp_gen can be replaced with the new proposal.
Auto R&D Evolving loop for models in a Kaggle scenario. You can continue running a session by using the command:
dotenv run -- python rdagent/app/data_science/loop.py [--competition titanic] $LOG_PATH/__session__/1/0_propose --step_n 1 # `step_n` is an optional parameter rdagent kaggle --competition playground-series-s4e8 # This command is recommended.π Visualize the R&D Process
We provide a web UI to visualize the log. You just need to run:
rdagent ui --port <custom port> --log-dir <your log folder like "log/"> --data_science TrueThen you can input the log path and visualize the R&D process.
π§ͺ Scoring the test results
Finally, shutdown the program, and get the test set scores with this command.
dotenv run -- python rdagent/log/mle_summary.py grade <url_to_log>Here, <url_to_log> refers to the parent directory of the log folder generated during the run.
πΉοΈ Kaggle AgentΒΆ
π BackgroundΒΆ
In the landscape of data science competitions, Kaggle serves as the ultimate arena where data enthusiasts harness the power of algorithms to tackle real-world challenges. The Kaggle Agent stands as a pivotal tool, empowering participants to seamlessly integrate cutting-edge models and datasets, transforming raw data into actionable insights.
By utilizing the Kaggle Agent, data scientists can craft innovative solutions that not only uncover hidden patterns but also drive significant advancements in predictive accuracy and model robustness.
π§ Example Guide - Kaggle DatasetΒΆ
π οΈ Preparing For The CompetitionΒΆ
π¨ Configuring the Kaggle API
Register and login on the Kaggle website.
Click on the avatar (usually in the top right corner of the page) ->
Settings->Create New Token, A file calledkaggle.jsonwill be downloaded.Move
kaggle.jsonto~/.config/kaggle/Modify the permissions of the
kaggle.jsonfile.chmod 600 ~/.config/kaggle/kaggle.json
For more information about Kaggle API Settings, refer to the Kaggle API.
π© Setting the Environment variables at .env file
Determine the path where the data will be stored and add it to the
.envfile.
mkdir -p <your local directory>/ds_data dotenv set KG_LOCAL_DATA_PATH <your local directory>/ds_data
π More Environment Variables (Optional)
If you want to see all the available environment variables, you can refer to the configuration file for Data Science scenarios:
1from pathlib import Path 2from typing import Literal 3 4from pydantic_settings import SettingsConfigDict 5 6from rdagent.app.kaggle.conf import KaggleBasePropSetting 7 8 9class DataScienceBasePropSetting(KaggleBasePropSetting): 10 # TODO: Kaggle Setting should be the subclass of DataScience 11 model_config = SettingsConfigDict(env_prefix="DS_", protected_namespaces=()) 12 13 # Main components 14 ## Scen 15 scen: str = "rdagent.scenarios.data_science.scen.KaggleScen" 16 """ 17 Scenario class for data science tasks. 18 - For Kaggle competitions, use: "rdagent.scenarios.data_science.scen.KaggleScen" 19 - For custom data science scenarios, use: "rdagent.scenarios.data_science.scen.DataScienceScen" 20 """ 21 22 planner: str = "rdagent.scenarios.data_science.proposal.exp_gen.planner.DSExpPlannerHandCraft" 23 hypothesis_gen: str = "rdagent.scenarios.data_science.proposal.exp_gen.router.ParallelMultiTraceExpGen" 24 interactor: str = "rdagent.components.interactor.SkipInteractor" 25 trace_scheduler: str = "rdagent.scenarios.data_science.proposal.exp_gen.trace_scheduler.RoundRobinScheduler" 26 """Hypothesis generation class""" 27 28 summarizer: str = "rdagent.scenarios.data_science.dev.feedback.DSExperiment2Feedback" 29 summarizer_init_kwargs: dict = { 30 "version": "exp_feedback", 31 } 32 ## Workflow Related 33 consecutive_errors: int = 5 34 35 ## Coding Related 36 coding_fail_reanalyze_threshold: int = 3 37 38 debug_recommend_timeout: int = 600 39 """The recommend time limit for running on debugging data""" 40 debug_timeout: int = 600 41 """The timeout limit for running on debugging data""" 42 full_recommend_timeout: int = 3600 43 """The recommend time limit for running on full data""" 44 full_timeout: int = 3600 45 """The timeout limit for running on full data""" 46 47 #### model dump 48 enable_model_dump: bool = False 49 enable_doc_dev: bool = False 50 model_dump_check_level: Literal["medium", "high"] = "medium" 51 52 #### MCP documentation search integration 53 enable_mcp_documentation_search: bool = False 54 """Enable MCP documentation search for error resolution. Requires MCP_ENABLED=true and MCP_CONTEXT7_ENABLED=true in environment.""" 55 56 ### specific feature 57 58 ### notebook integration 59 enable_notebook_conversion: bool = False 60 61 #### enable specification 62 spec_enabled: bool = True 63 64 #### proposal related 65 # proposal_version: str = "v2" deprecated 66 67 coder_on_whole_pipeline: bool = True 68 max_trace_hist: int = 3 69 70 coder_max_loop: int = 10 71 runner_max_loop: int = 3 72 73 sample_data_by_LLM: bool = True 74 use_raw_description: bool = False 75 show_nan_columns: bool = False 76 77 ### knowledge base 78 enable_knowledge_base: bool = False 79 knowledge_base_version: str = "v1" 80 knowledge_base_path: str | None = None 81 idea_pool_json_path: str | None = None 82 83 ### archive log folder after each loop 84 enable_log_archive: bool = True 85 log_archive_path: str | None = None 86 log_archive_temp_path: str | None = ( 87 None # This is to store the mid tar file since writing the tar file is preferred in local storage then copy to target storage 88 ) 89 90 #### Evaluation on Test related 91 eval_sub_dir: str = "eval" # TODO: fixme, this is not a good name 92 """We'll use f"{DS_RD_SETTING.local_data_path}/{DS_RD_SETTING.eval_sub_dir}/{competition}" 93 to find the scriipt to evaluate the submission on test""" 94 95 """---below are the settings for multi-trace---""" 96 97 ### multi-trace related 98 max_trace_num: int = 1 99 """The maximum number of traces to grow before merging""" 100 101 scheduler_temperature: float = 1.0 102 """The temperature for the trace scheduler for softmax calculation, used in ProbabilisticScheduler""" 103 104 # PUCT exploration constant for MCTSScheduler (ignored by other schedulers) 105 scheduler_c_puct: float = 1.0 106 """Exploration constant used by MCTSScheduler (PUCT).""" 107 108 enable_score_reward: bool = False 109 """Enable using score-based reward for trace selection in multi-trace scheduling.""" 110 111 #### multi-trace:checkpoint selector 112 selector_name: str = "rdagent.scenarios.data_science.proposal.exp_gen.select.expand.LatestCKPSelector" 113 """The name of the selector to use""" 114 sota_count_window: int = 5 115 """The number of trials to consider for SOTA count""" 116 sota_count_threshold: int = 1 117 """The threshold for SOTA count""" 118 119 #### multi-trace: SOTA experiment selector 120 sota_exp_selector_name: str = "rdagent.scenarios.data_science.proposal.exp_gen.select.submit.GlobalSOTASelector" 121 """The name of the SOTA experiment selector to use""" 122 123 ### multi-trace:inject optimals for multi-trace 124 # inject diverse when start a new sub-trace 125 enable_inject_diverse: bool = False 126 127 # inject diverse from other traces when start a new sub-trace 128 enable_cross_trace_diversity: bool = True 129 """Enable cross-trace diversity injection when starting a new sub-trace. 130 This is different from `enable_inject_diverse` which is for non-parallel cases.""" 131 132 diversity_injection_strategy: str = ( 133 "rdagent.scenarios.data_science.proposal.exp_gen.diversity_strategy.InjectUntilSOTAGainedStrategy" 134 ) 135 """The strategy to use for injecting diversity context.""" 136 137 # enable different version of DSExpGen for multi-trace 138 enable_multi_version_exp_gen: bool = False 139 exp_gen_version_list: str = "v3,v2" 140 141 #### multi-trace: time for final multi-trace merge 142 merge_hours: float = 0 143 """The time for merge""" 144 145 #### multi-trace: max SOTA-retrieved number, used in AutoSOTAexpSelector 146 # constrains the number of SOTA experiments to retrieve, otherwise too many SOTA experiments to retrieve will cause the exceed of the context window of LLM 147 max_sota_retrieved_num: int = 10 148 """The maximum number of SOTA experiments to retrieve in a LLM call""" 149 150 #### enable draft before first sota experiment 151 enable_draft_before_first_sota: bool = False 152 enable_planner: bool = False 153 154 model_architecture_suggestion_time_percent: float = 0.75 155 allow_longer_timeout: bool = False 156 coder_enable_llm_decide_longer_timeout: bool = False 157 runner_enable_llm_decide_longer_timeout: bool = False 158 coder_longer_timeout_multiplier_upper: int = 3 159 runner_longer_timeout_multiplier_upper: int = 2 160 coder_timeout_increase_stage: float = 0.3 161 runner_timeout_increase_stage: float = 0.3 162 runner_timeout_increase_stage_patience: int = 2 163 """Number of failures tolerated before escalating to next timeout level (stage width). Every 'patience' failures, timeout increases by 'runner_timeout_increase_stage'""" 164 show_hard_limit: bool = True 165 166 #### enable runner code change summary 167 runner_enable_code_change_summary: bool = True 168 169 ### Proposal workflow related 170 171 #### Hypothesis Generate related 172 enable_simple_hypothesis: bool = False 173 """If true, generate simple hypothesis, no more than 2 sentences each.""" 174 175 enable_generate_unique_hypothesis: bool = False 176 """Enable generate unique hypothesis. If True, generate unique hypothesis for each component. If False, generate unique hypothesis for each component.""" 177 178 enable_research_rag: bool = False 179 """Enable research RAG for hypothesis generation.""" 180 181 #### hypothesis critique and rewrite 182 enable_hypo_critique_rewrite: bool = False 183 """Enable hypothesis critique and rewrite stages for improving hypothesis quality""" 184 enable_scale_check: bool = False 185 186 ##### select related 187 ratio_merge_or_ensemble: int = 70 188 """The ratio of merge or ensemble to be considered as a valid solution""" 189 llm_select_hypothesis: bool = False 190 """Whether to use LLM to select hypothesis. If True, use LLM selection; if False, use the existing ranking method.""" 191 192 #### Task Generate related 193 fix_seed_and_data_split: bool = False 194 195 ensemble_time_upper_bound: bool = False 196 197 user_interaction_wait_seconds: int = 6000 # seconds to wait for user interaction 198 user_interaction_mid_folder: Path = Path.cwd() / "git_ignore_folder" / "RD-Agent_user_interaction" 199 200 201DS_RD_SETTING = DataScienceBasePropSetting() 202 203# enable_cross_trace_diversity and llm_select_hypothesis should not be true at the same time 204assert not ( 205 DS_RD_SETTING.enable_cross_trace_diversity and DS_RD_SETTING.llm_select_hypothesis 206), "enable_cross_trace_diversity and llm_select_hypothesis cannot be true at the same time"
These variables allow you to have finer-grained control in Data Science scenarios.
π³οΈ Join the competition
If your Kaggle API account has not joined a competition, you will need to join the competition before running the program.
At the bottom of the competition details page, you can find the
Join the competitionbutton, click on it and selectI Understand and Acceptto join the competition.In the Competition List Available below, you can jump to the competition details page.
π₯ Preparing Competition DataDataset && Set up RD-Agent EnvironmentΒΆ
As a subset of data science, kaggleβs dataset still follows the data science format. Based on this, the kaggle dataset can be divided into two categories depending on whether or not it is supported by the MLE-Bench.
What is MLE-Bench?
MLE-Bench is a comprehensive benchmark designed to evaluate the machine learning engineering capabilities of AI systems using real-world scenarios. The dataset includes multiple Kaggle competitions. Since Kaggle does not provide reserved test sets for these competitions, the benchmark includes preparation scripts for splitting publicly available training data into new training and test sets, and scoring scripts for each competition to accurately evaluate submission scores.
Iβm running a competition Is MLE-Bench supported?
You can see all the competitions supported by MLE-Bench here.
Prepare datasets for MLE-Bench supported competitions.
If you agree with the MLE-Bench standard, then you donβt need to prepare the dataset, you just need to configure your
.envfile to automate the download of the dataset.Configure environment variables, add
DS_IF_USING_MLE_DATAto environment variables, and set it toTrue.dotenv set DS_IF_USING_MLE_DATA True
Configure environment variables, add
DS_SAMPLE_DATA_BY_LLMto environment variables, and set it toTrue.dotenv set DS_SAMPLE_DATA_BY_LLM True
Configure environment variables, add
DS_SCENto environment variables, and set it tordagent.scenarios.data_science.scen.KaggleScen.dotenv set DS_SCEN rdagent.scenarios.data_science.scen.KaggleScen
At this point, you are ready to start running your competition, which will automatically download the data, and the LLM will automatically extract the minimum dataset.
After running the program the structure of the ds_data folder should look like this (Using the
tabular-playground-series-dec-2021contest as an example).ds_data βββ tabular-playground-series-dec-2021 β βββ description.md β βββ sample_submission.csv β βββ test.csv β βββ train.csv βββ zip_files βββ tabular-playground-series-dec-2021 βββ tabular-playground-series-dec-2021.zipThe
ds_data/zip_filesfolder contains a zip file of the raw competition data downloaded from kaggle website.
At runtime, RD-Agent will automatically build the Docker image specified at rdagent/scenarios/kaggle/docker/mle_bench_docker/Dockerfile. This image is responsible for downloading the required datasets and grading files for MLE-Bench.
Note: The first run may take longer than subsequent runs as the Docker image and data are being downloaded and set up for the first time.
Prepare datasets for competitions that are not supported by MLE-Bench.
As a subset of data science, we can follow the format and steps of data science dataset to prepare kaggle dataset. Below we will describe the workflow for preparing a kaggle dataset using the competition
playground-series-s4e9as an example.Create a
ds_data/source_data/playground-series-s4e9folder, which will be used to store your raw dataset.The raw files for the competition
playground-series-s4e9have two files:train.csv,test.csv,sample_submission.csv, and there are two ways to get the raw data:You can find the raw data required for the competition on the official kaggle website.
Or you can use the command line to download the raw data for the competition, the download command is as follows.
kaggle competitions download -c playground-series-s4e9
Create a
ds_data/source_data/playground-series-s4e9/prepare.pyfile that splits your raw data into training data, test data, formatted submission file, and standard answer file. (You will need to write a script based on your raw data.)The following shows the preprocessing code for the raw data of
playground-series-s4e9.
ds_data/source_data/playground-series-s4e9/prepare.pyΒΆ1from pathlib import Path 2 3import pandas as pd 4from sklearn.model_selection import train_test_split 5 6 7def prepare(raw: Path, public: Path, private: Path): 8 9 # Create train and test splits from train set 10 old_train = pd.read_csv(raw / "train.csv") 11 new_train, new_test = train_test_split(old_train, test_size=0.1, random_state=0) 12 13 # Create sample submission 14 sample_submission = new_test.copy() 15 sample_submission["price"] = 43878.016 16 sample_submission.drop(sample_submission.columns.difference(["id", "price"]), axis=1, inplace=True) 17 sample_submission.to_csv(public / "sample_submission.csv", index=False) 18 19 # Create private files 20 new_test.to_csv(private / "submission_test.csv", index=False) 21 22 # Create public files visible to agents 23 new_train.to_csv(public / "train.csv", index=False) 24 new_test.drop(["price"], axis=1, inplace=True) 25 new_test.to_csv(public / "test.csv", index=False) 26 27 # Checks 28 assert new_test.shape[1] == 12, "Public test set should have 12 columns" 29 assert new_train.shape[1] == 13, "Public train set should have 13 columns" 30 assert len(new_train) + len(new_test) == len( 31 old_train 32 ), "Length of new_train and new_test should equal length of old_train" 33 34 35if __name__ == "__main__": 36 competitions = "playground-series-s4e9" 37 raw = Path(__file__).resolve().parent 38 prepare( 39 raw=raw, 40 public=raw.parent.parent / competitions, 41 private=raw.parent.parent / "eval" / competitions, 42 )
At the end of program execution, the
ds_datafolder structure will look like this:
ds_data βββ playground-series-s4e9 β βββ train.csv β βββ test.csv β βββ sample_submission.csv βββ eval β βββ playground-series-s4e9 β βββ submission_test.csv βββ source_data βββ playground-series-s4e9 βββ prepare.py βββ sample_submission.csv βββ test.csv βββ train.csvCreate a
ds_data/playground-series-s4e9/description.mdfile to describe your competition, dataset description, and other information. We can find the competition description information and the dataset description information from the Kaggle website.The following shows the description file for
playground-series-s4e9ds_data/playground-series-s4e9/description.mdΒΆ1# Competition name: playground-series-s4e9 2 3## Overview 4 5**Welcome to the 2024 Kaggle Playground Series!** We plan to continue in the spirit of previous playgrounds, providing interesting and approachable datasets for our community to practice their machine learning skills, and anticipate a competition each month. 6 7**Your Goal:** The goal of this competition is to predict the price of used cars based on various attributes. 8 9## Evaluation 10 11### Root Mean Squared Error (RMSE) 12 13Submissions are scored on the root mean squared error. RMSE is defined as: 14 15$$ 16\mathrm{RMSE} = \left( \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 \right)^{\frac{1}{2}} 17$$ 18 19where $\hat{y}_i$ is the predicted value and $y_i$ is the original value for each instance $i$. 20 21### Submission File 22 23For each `id` in the test set, you must predict the `price` of the car. The file should contain a header and have the following format: 24 25``` 26id,price 27188533,43878.016 28188534,43878.016 29188535,43878.016 30etc. 31``` 32 33## Timeline 34- **Start Date** - September 1, 2024 35- **Entry Deadline** - Same as the Final Submission Deadline 36- **Team Merger Deadline** - Same as the Final Submission Deadline 37- **Final Submission Deadline** - September 30, 2024 38 39All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition organizers reserve the right to update the contest timeline if they deem it necessary. 40 41## About the Tabular Playground Series 42 43The goal of the Tabular Playground Series is to provide the Kaggle community with a variety of fairly light-weight challenges that can be used to learn and sharpen skills in different aspects of machine learning and data science. The duration of each competition will generally only last a few weeks, and may have longer or shorter durations depending on the challenge. The challenges will generally use fairly light-weight datasets that are synthetically generated from real-world data, and will provide an opportunity to quickly iterate through various model and feature engineering ideas, create visualizations, etc. 44 45### Synthetically-Generated Datasets 46 47Using synthetic data for Playground competitions allows us to strike a balance between having real-world data (with named features) and ensuring test labels are not publicly available. This allows us to host competitions with more interesting datasets than in the past. While there are still challenges with synthetic data generation, the state-of-the-art is much better now than when we started the Tabular Playground Series two years ago, and that goal is to produce datasets that have far fewer artifacts. Please feel free to give us feedback on the datasets for the different competitions so that we can continue to improve! 48 49## Prizes 50- 1st Place - Choice of Kaggle merchandise 51- 2nd Place - Choice of Kaggle merchandise 52- 3rd Place - Choice of Kaggle merchandise 53 54**Please note**: In order to encourage more participation from beginners, Kaggle merchandise will only be awarded once per person in this series. If a person has previously won, we'll skip to the next team. 55 56## Citation 57 58Walter Reade and Ashley Chow. Regression of Used Car Prices. https://kaggle.com/competitions/playground-series-s4e9, 2024. Kaggle. 59 60## Dataset Description 61 62The dataset for this competition (both train and test) was generated from a deep learning model trained on the [Used Car Price Prediction Dataset](https://www.kaggle.com/datasets/taeefnajib/used-car-price-prediction-dataset). Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance. 63 64## Files 65 66- **train.csv** - the training dataset; `price` is the continuous target 67- **test.csv** - the test dataset; your objective is to predict the value of `price` for each row 68- **sample_submission.csv** - a sample submission file in the correct format
Create a
ds_data/eval/playground-series-s4e9/valid.pyfile, which is used to check the validity of the submission files to ensure that their formatting is consistent with the reference file.The following shows a script that checks the validity of a submission based on the
playground-series-s4e9data.
ds_data/eval/playground-series-s4e9/valid.pyΒΆ1from pathlib import Path 2 3# Check if our submission file exists 4assert Path("submission.csv").exists(), "Error: submission.csv not found" 5 6submission_lines = Path("submission.csv").read_text().splitlines() # θͺε¨ηζη 7test_lines = Path("submission_test.csv").read_text().splitlines() # test.csv 8 9is_valid = len(submission_lines) == len(test_lines) 10 11if is_valid: 12 message = "submission.csv and submission_test.csv have the same number of lines." 13else: 14 message = ( 15 f"submission.csv has {len(submission_lines)} lines, while submission_test.csv has {len(test_lines)} lines." 16 ) 17 18print(message) 19 20if not is_valid: 21 raise AssertionError("Submission is invalid")
Create a
ds_data/eval/playground-series-s4e9/grade.pyfile, which is used to calculate the score based on the submission file and the standard answer file, and output the result in JSON format.The following shows a grading script based on the
playground-series-s4e9data implementation.
ds_data/eval/playground-series-s4e9/grade.pyΒΆ1import datetime 2import json 3 4import numpy as np 5import pandas as pd 6from sklearn.metrics import mean_absolute_error, mean_squared_error, roc_auc_score 7 8 9class InvalidSubmissionError(Exception): 10 """ 11 A custom exception for when the agent submission cannot be graded. 12 """ 13 14 pass 15 16 17def prepare_for_metric(submission: pd.DataFrame, answers: pd.DataFrame) -> dict: 18 19 if "id" not in submission.columns or "id" not in answers.columns: 20 raise InvalidSubmissionError("Both submission and answers DataFrames must contain an 'id' column.") 21 22 if "price" not in submission.columns: 23 raise InvalidSubmissionError("Submission DataFrame must contain 'price' columns.") 24 25 assert "price" in answers.columns, "Answers DataFrame must contain 'price' columns." 26 27 if len(submission) != len(answers): 28 raise InvalidSubmissionError("Submission must be the same length as the answers.") 29 30 answers_sorted = answers.sort_values("id") 31 submission_sorted = submission.sort_values("id") 32 33 if (submission_sorted["id"].values != answers_sorted["id"].values).any(): 34 raise InvalidSubmissionError("Submission and answers have mismatched 'id' columns") 35 36 y_true = answers_sorted[["price"]].to_numpy() 37 y_score = submission_sorted[["price"]].to_numpy() 38 39 return {"y_true": y_true, "y_score": y_score} 40 41 42def grade(submission: pd.DataFrame, answers: pd.DataFrame) -> float: 43 metric_inputs = prepare_for_metric(submission, answers) 44 return np.sqrt(mean_squared_error(metric_inputs["y_true"], metric_inputs["y_score"])) 45 46 47if __name__ == "__main__": 48 submission_path = "submission.csv" 49 gt_submission_path = "submission_test.csv" 50 submission = pd.read_csv(submission_path) 51 answers = pd.read_csv(gt_submission_path) 52 score = grade(submission=submission, answers=answers) 53 54 # This `thresholds` can be customized according to the leaderboard page of the Kaggle website and your own needs. 55 # Refs: https://www.kaggle.com/competitions/playground-series-s4e9/leaderboard 56 thresholds = { 57 "gold": 62917.05988, 58 "silver": 62945.91714, 59 "bronze": 62958.13747, 60 "median": 63028.69429, 61 } 62 63 # The output must be in json format. To configure the full output, 64 # you can run the command `rdagent grade_summary --log-folder` to summarize the scores at the end of the program. 65 # If you don't need it, you can just provide the `competition_id`` and `score``. 66 print( 67 json.dumps( 68 { 69 "competition_id": "arf-12-hours-prediction-task", 70 "score": score, 71 "gold_threshold": thresholds["gold"], 72 "silver_threshold": thresholds["silver"], 73 "bronze_threshold": thresholds["bronze"], 74 "median_threshold": thresholds["median"], 75 "any_medal": bool(score >= thresholds["bronze"]), 76 "gold_medal": bool(score >= thresholds["gold"]), 77 "silver_medal": bool(score >= thresholds["silver"]), 78 "bronze_medal": bool(score >= thresholds["bronze"]), 79 "above_median": bool(score >= thresholds["median"]), 80 "submission_exists": True, 81 "valid_submission": True, 82 "is_lower_better": False, 83 "created_at": str(datetime.datetime.now().isoformat()), 84 "submission_path": submission_path, 85 } 86 ) 87 )
In this example we donβt create a
ds_data/eval/playground-series-s4e9/sample.py, we use the sample method provided by RD-Agent by default.At this point, you have created a complete dataset. The correct structure of the dataset should look like this.
ds_data βββ playground-series-s4e9 β βββ train.csv β βββ test.csv β βββ description.md β βββ sample_submission.csv βββ eval β βββ playground-series-s4e9 β βββ grade.py β βββ submission_test.csv β βββ valid.py βββ source_data βββ playground-series-s4e9 βββ prepare.py βββ sample_submission.csv βββ test.csv βββ train.csvWe have prepared a dataset based on the above description for your reference. You can download it with the following command.
wget https://github.com/SunsetWolf/rdagent_resource/releases/download/ds_data/playground-series-s4e9.zipNext, we need to configure the environment for the
playground-series-s4e9contest. You can do this by executing the following command at the command line.dotenv set DS_IF_USING_MLE_DATA False dotenv set DS_SAMPLE_DATA_BY_LLM False dotenv set DS_SCEN rdagent.scenarios.data_science.scen.KaggleScen
π Run the ApplicationΒΆ
π You can directly run the application by using the following command:
rdagent data_science --competition <Competition ID>
The following shows the command to run based on the
playground-series-s4e9datardagent data_science --competition playground-series-s4e9More CLI Parameters for rdagent data_science command:
- rdagent.app.data_science.loop.main(path: str | None = None, checkout: bool = True, checkout_path: str | None = None, step_n: int | None = None, loop_n: int | None = None, timeout: str | None = None, competition='bms-molecular-translation', replace_timer=True, exp_gen_cls: str | None = None)
ParametersΒΆ
- path :
A path like $LOG_PATH/__session__/1/0_propose. This indicates that we restore the state after finishing step 0 in loop 1.
- checkout :
Used to control the log session path. Boolean type, default is True. - If True, the new loop will use the existing folder and clear logs for sessions after the one corresponding to the given path. - If False, the new loop will use the existing folder but keep the logs for sessions after the one corresponding to the given path.
- checkout_path:
If a checkout_path (or a str like Path) is provided, the new loop will be saved to that path, leaving the original path unchanged.
- step_n :
Number of steps to run; if None, the process will run indefinitely until an error or KeyboardInterrupt occurs.
- loop_n :
Number of loops to run; if None, the process will run indefinitely until an error or KeyboardInterrupt occurs. - If the current loop is incomplete, it will be counted as the first loop for completion. - If both step_n and loop_n are provided, the process will stop as soon as either condition is met.
- timeout :
Maximum duration to run the loop. Accepts a string format recognized by the internal timer. - If None, the loop will run until completion, error, or KeyboardInterrupt.
- competition :
Competition name.
- replace_timer :
If a session is loaded, determines whether to replace the timer with session.timer.
- exp_gen_cls :
When there are different stages, the exp_gen can be replaced with the new proposal.
Auto R&D Evolving loop for models in a Kaggle scenario. You can continue running a session by using the command:
dotenv run -- python rdagent/app/data_science/loop.py [--competition titanic] $LOG_PATH/__session__/1/0_propose --step_n 1 # `step_n` is an optional parameter rdagent kaggle --competition playground-series-s4e8 # This command is recommended.π Visualize the R&D Process
We provide a web UI to visualize the log. You just need to run:
rdagent ui --port <custom port> --log-dir <your log folder like "log/"> --data_science TrueThen you can input the log path and visualize the R&D process.
π§ͺ Scoring the test results
Finally, shutdown the program, and get the test set scores with this command.
dotenv run -- python rdagent/log/mle_summary.py grade <url_to_log>
If you have configured the full output in
ds_data/eval/playground-series-s4e9/grade.py, or if you are running a competition that receives MLE-Bench support, you can also summarize the scores by running the following command.rdagent grade_summary --log-folder=<url_to_log>Here, <url_to_log> refers to the parent directory of the log folder generated during the run.