Data Science AgentΒΆ

πŸ€– Automated Feature Engineering & Model Tuning EvolutionΒΆ

The Data Science Agent is an agent that can automatically perform feature engineering and model tuning. It can be used to solve various data science problems, such as image classification, time series forecasting, and text classification.

🌟 Introduction¢

In this scenario, our automated system proposes hypothesis, choose action, implements code, conducts validation, and utilizes feedback in a continuous, iterative process.

The goal is to automatically optimize performance metrics within the validation set or Kaggle Leaderboard, ultimately discovering the most efficient features and models through autonomous research and development.

Here’s an enhanced outline of the steps:

Step 1 : Hypothesis Generation πŸ”

  • Generate and propose initial hypotheses based on previous experiment analysis and domain expertise, with thorough reasoning and financial justification.

Step 2 : Experiment Creation ✨

  • Transform the hypothesis into a task.

  • Choose a specific action within feature engineering or model tuning.

  • Develop, define, and implement a new feature or model, including its name, description, and formulation.

Step 3 : Model/Feature Implementation πŸ‘¨β€πŸ’»

  • Implement the model code based on the detailed description.

  • Evolve the model iteratively as a developer would, ensuring accuracy and efficiency.

Step 4 : Validation on Test Set or Kaggle πŸ“‰

  • Validate the newly developed model using the test set or Kaggle dataset.

  • Assess the model’s effectiveness and performance based on the validation results.

Step 5: Feedback Analysis πŸ”

  • Analyze validation results to assess performance.

  • Use insights to refine hypotheses and enhance the model.

Step 6: Hypothesis Refinement ♻️

  • Adjust hypotheses based on validation feedback.

  • Iterate the process to continuously improve the model.

πŸ“– Data Science BackgroundΒΆ

In the evolving landscape of artificial intelligence, Data Science represents a powerful paradigm where machines engage in autonomous exploration, hypothesis testing, and model development across diverse domains β€” from healthcare and finance to logistics and research.

The Data Science Agent stands as a central engine in this transformation, enabling users to automate the entire machine learning workflow: from hypothesis generation to code implementation, validation, and refinement β€” all guided by performance feedback.

By leveraging the Data Science Agent, researchers and developers can accelerate experimentation cycles. Whether fine-tuning custom models or competing in high-stakes benchmarks like Kaggle, the Data Science Agent unlocks new frontiers in intelligent, self-directed discovery.

🧭 Example Guide - Customized dataset¢

πŸ”§ Set up RD-Agent EnvironmentΒΆ

  • Before you start, please make sure you have installed RD-Agent and configured the environment for RD-Agent correctly. If you want to know how to install and configure the RD-Agent, please refer to the documentation.

  • πŸ”© Setting the Environment variables at .env file

    • Determine the path where the data will be stored and add it to the .env file.

    dotenv set DS_LOCAL_DATA_PATH <your local directory>/ds_data
    dotenv set DS_SCEN rdagent.scenarios.data_science.scen.DataScienceScen
    

πŸ“₯ Prepare Customized datasetsΒΆ

  • A data science competition dataset usually consists of two parts: competition dataset and evaluation dataset. (We provide a sample of a customized dataset named: arf-12-hours-prediction-task as a reference.)

    • The competition dataset contains training data, test data, description files, formatted submission files, data sampling codes.

    • The evaluation dataset contains standard answer file, data checking codes, and Code for calculation of scores.

  • We use the arf-12-hours-prediction-task data as a sample to introduce the preparation workflow for the competition dataset.

    • Create a ds_data/source_data/arf-12-hours-prediction-task folder, which will be used to store your raw dataset.

      • The raw files for the competition arf-12-hours-prediction-task have two files: ARF_12h.csv and X.npz.

    • Create a ds_data/source_data/arf-12-hours-prediction-task/prepare.py file that splits your raw data into training data, test data, formatted submission file, and standard answer file. (You will need to write a script based on your raw data.)

      • The following shows the preprocessing code for the raw data of arf-12-hours-prediction-task.

      ds_data/source_data/arf-12-hours-prediction-task/prepare.pyΒΆ
       1import random
       2from pathlib import Path
       3
       4import numpy as np
       5import pandas as pd
       6import sparse
       7
       8CURRENT_DIR = Path(__file__).resolve().parent
       9ROOT_DIR = CURRENT_DIR.parent.parent
      10
      11raw_feature_path = CURRENT_DIR / "X.npz"
      12raw_label_path = CURRENT_DIR / "ARF_12h.csv"
      13
      14public = ROOT_DIR / "arf-12-hours-prediction-task"
      15private = ROOT_DIR / "eval" / "arf-12-hours-prediction-task"
      16
      17if not (public / "test").exists():
      18    (public / "test").mkdir(parents=True, exist_ok=True)
      19
      20if not (public / "train").exists():
      21    (public / "train").mkdir(parents=True, exist_ok=True)
      22
      23if not private.exists():
      24    private.mkdir(parents=True, exist_ok=True)
      25
      26SEED = 42
      27random.seed(SEED)
      28np.random.seed(SEED)
      29
      30X_sparse = sparse.load_npz(raw_feature_path)  # COO matrix, shape: [N, D, T]
      31df_label = pd.read_csv(raw_label_path)  # Contains column 'ARF_LABEL'
      32N = X_sparse.shape[0]
      33
      34indices = np.arange(N)
      35np.random.shuffle(indices)
      36split = int(0.7 * N)
      37train_idx, test_idx = indices[:split], indices[split:]
      38
      39X_train = X_sparse[train_idx]
      40X_test = X_sparse[test_idx]
      41
      42df_train = df_label.iloc[train_idx].reset_index(drop=True)
      43df_test = df_label.iloc[test_idx].reset_index(drop=True)
      44
      45submission_df = df_test.copy()
      46submission_df["ARF_LABEL"] = 0
      47submission_df.drop(submission_df.columns.difference(["ID", "ARF_LABEL"]), axis=1, inplace=True)
      48submission_df.to_csv(public / "sample_submission.csv", index=False)
      49
      50df_test.to_csv(private / "submission_test.csv", index=False)
      51
      52df_test.drop(["ARF_LABEL"], axis=1, inplace=True)
      53df_test.to_csv(public / "test" / "ARF_12h.csv", index=False)
      54sparse.save_npz(public / "test" / "X.npz", X_test)
      55
      56sparse.save_npz(public / "train" / "X.npz", X_train)
      57df_train.to_csv(public / "train" / "ARF_12h.csv", index=False)
      58
      59assert (
      60    X_train.shape[0] == df_train.shape[0]
      61), f"Mismatch: X_train rows ({X_train.shape[0]}) != df_train rows ({df_train.shape[0]})"
      62assert (
      63    X_test.shape[0] == df_test.shape[0]
      64), f"Mismatch: X_test rows ({X_test.shape[0]}) != df_test rows ({df_test.shape[0]})"
      65assert df_test.shape[1] == 2, "Public test set should have 2 columns"
      66assert df_train.shape[1] == 3, "Public train set should have 3 columns"
      67assert len(df_train) + len(df_test) == len(
      68    df_label
      69), "Length of new_train and new_test should equal length of old_train"
      
      • At the end of program execution, the ds_data folder structure will look like this:

      ds_data
      β”œβ”€β”€ arf-12-hours-prediction-task
      β”‚   β”œβ”€β”€ train
      β”‚   β”‚   β”œβ”€β”€ ARF_12h.csv
      β”‚   β”‚   └── X.npz
      β”‚   β”œβ”€β”€ test
      β”‚   β”‚   β”œβ”€β”€ ARF_12h.csv
      β”‚   β”‚   └── X.npz
      β”‚   └── sample_submission.csv
      β”œβ”€β”€ eval
      β”‚   └── arf-12-hours-prediction-task
      β”‚       └── submission_test.csv
      └── source_data
          └── arf-12-hours-prediction-task
              β”œβ”€β”€ ARF_12h.csv
              β”œβ”€β”€ prepare.py
              └── X.npz
      
    • Create a ds_data/arf-12-hours-prediction-task/description.md file to describe your competition, Objective, dataset, and other information.

      • The following shows the description file for arf-12-hours-prediction-task

      ds_data/arf-12-hours-prediction-task/description.mdΒΆ
       1# Competition name: ARF 12-Hour Prediction Task
       2
       3## Overview
       4
       5### Description
       6
       7Acute Respiratory Failure (ARF) is a life-threatening condition that often develops rapidly in critically ill patients. Accurate early prediction of ARF is crucial in intensive care units (ICUs) to enable timely clinical interventions and resource allocation. In this task, you are asked to build a machine learning model that predicts whether a patient will develop ARF within the next **12 hours**, based on multivariate clinical time series data.
       8
       9The dataset is extracted from electronic health records (EHRs) and preprocessed using the **FIDDLE** pipeline to generate structured temporal features for each patient.
      10
      11### Objective
      12
      13**Your Goal** is to develop a binary classification model that takes a 12-hour time series as input and predicts whether ARF will occur (1) or not (0) in the following 12 hours.
      14
      15---
      16
      17## Data Description
      18
      191. train/ARF_12h.csv: A CSV file containing the ICU stay ID, the hour of ARF onset, and the binary label indicating whether ARF will occur in the next 12 hours.
      20
      21    * Columns: ID, ARF_ONSET_HOUR, ARF_LABEL
      22
      232. train/X.npz: N Γ— T Γ— D sparse tensor containing time-dependent features.
      24
      25    * N: Number of samples (number of ICU stays) 
      26    * T: Time step (12 hours of records per sample)
      27    * D: Dynamic feature dimension (how many features per hour) 
      28
      293. test/ARF_12h.csv: Ground truth labels (used for evaluation only).
      30
      314. test/X.npz: Test feature set in the same format as training data.
      32
      33---
      34
      35## Data usage Notes
      36
      37To load the features, you need python and the sparse package.
      38
      39import sparse
      40
      41X = sparse.load_npz("<url>/X.npz").todense()
      42
      43
      44To load the labels, use pandas or an alternative csv reader.
      45
      46import pandas as pd
      47
      48df = pd.read_csv("<url>/ARF_12h.csv")
      49
      50
      51---
      52
      53## Modeling
      54
      55Each sample is a 12-hour multivariate time series of ICU patient observations, represented as a tensor of shape (12, D).
      56The goal is to predict whether the patient will develop ARF (1) or not (0) in the following 12 hours.
      57
      58* **Input**: 12 Γ— D matrix of clinical features
      59* **Output**: Binary prediction: 0 (no ARF) or 1 (ARF onset)
      60* **Loss Function**: BCEWithLogitsLoss, CrossEntropyLoss or equivalent
      61* **Evaluation Metric**: **AUROC** (Area Under the Receiver Operating Characteristic Curve)
      62
      63Note: Although the output is binary, AUROC evaluates the ranking quality of predicted scores. Therefore, your model should output a confidence score during training, which is then thresholded to produce 0 or 1 for final submission.
      64
      65---
      66
      67## Evaluation
      68
      69### Area Under the Receiver Operating Characteristic curve (AUROC)
      70
      71The submissions are scored according to the area under the receiver operating characteristic curve. AUROC is defined as:
      72
      73$$
      74\text{AUROC} = \frac{1}{|P| \cdot |N|} \sum_{i \in P} \sum_{j \in N} \left[ \mathbb{1}(s_i > s_j) + \frac{1}{2} \cdot \mathbb{1}(s_i = s_j) \right]
      75$$
      76
      77AUROC reflects the model's ability to rank positive samples higher than negative ones. A score of 1.0 means perfect discrimination, and 0.5 means random guessing.
      78
      79### Submission Format
      80
      81For each `ID'' in the ARF_12h.csv file of the test dataset, you must predict whether ARF will occur (label = 1) or not (label = 0) in the following 12 hours(ARF_LABEL), based on the X.npz (sparse tensor, time-varying feature). The file should contain the following format:
      82
      83ID,ARF_LABEL
      84246505,0
      85291335,0
      86286713,0
      87etc.
      88
      89
      90Note: Although the submission is binary, AUROC evaluates the ranking quality of your model. It is recommended to output probabilities during training and apply a threshold (e.g., 0.5) to convert to binary labels for submission.
      91
      92---
      
    • Create a ds_data/arf-12-hours-prediction-task/sample.py file to construct the debugging sample data.

      • The following shows the script for constructing the debugging sample data based on the arf-12-hours-prediction-task dataset implementation.

      ds_data/arf-12-hours-prediction-task/sample.pyΒΆ
       1import shutil
       2from pathlib import Path
       3
       4import numpy as np
       5import pandas as pd
       6import sparse
       7from tqdm import tqdm
       8
       9
      10def sample_and_copy_subfolder(
      11    input_dir: Path,
      12    output_dir: Path,
      13    min_frac: float,
      14    min_num: int,
      15    seed: int = 42,
      16):
      17    np.random.seed(seed)
      18
      19    feature_path = input_dir / "X.npz"
      20    label_path = input_dir / "ARF_12h.csv"
      21
      22    # Load sparse features and label
      23    X_sparse = sparse.load_npz(feature_path)
      24    df_label = pd.read_csv(label_path)
      25
      26    N = X_sparse.shape[0]
      27    n_keep = max(int(N * min_frac), min_num)
      28    idx = np.random.choice(N, n_keep, replace=False)
      29
      30    X_sample = X_sparse[idx]
      31    df_sample = df_label.iloc[idx].reset_index(drop=True)
      32
      33    output_dir.mkdir(parents=True, exist_ok=True)
      34    sparse.save_npz(output_dir / "X.npz", X_sample)
      35    df_sample.to_csv(output_dir / "ARF_12h.csv", index=False)
      36
      37    print(f"[INFO] Sampled {n_keep} of {N} from {input_dir.name}")
      38
      39    # Copy additional files
      40    for f in input_dir.glob("*"):
      41        if f.name not in {"X.npz", "ARF_12h.csv"} and f.is_file():
      42            shutil.copy(f, output_dir / f.name)
      43            print(f"[COPY] Extra file: {f.name}")
      44
      45
      46def copy_other_file(source: Path, target: Path):
      47    for item in source.iterdir():
      48        if item.name in {"train", "test"}:
      49            continue
      50
      51        relative_path = item.relative_to(source)
      52        target_path = target / relative_path
      53
      54        if item.is_dir():
      55            shutil.copytree(item, target_path, dirs_exist_ok=True)
      56            print(f"[COPY DIR] {item} -> {target_path}")
      57        elif item.is_file():
      58            target_path.parent.mkdir(parents=True, exist_ok=True)
      59            shutil.copy2(item, target_path)
      60            print(f"[COPY FILE] {item} -> {target_path}")
      61
      62
      63def create_debug_data(
      64    dataset_path: str,
      65    output_path: str,
      66    min_frac: float = 0.02,
      67    min_num: int = 10,
      68):
      69    dataset_root = Path(dataset_path) / "arf-12-hours-prediction-task"
      70    output_root = Path(output_path)
      71
      72    for sub in ["train", "test"]:
      73        input_dir = dataset_root / sub
      74        output_dir = output_root / sub
      75        print(f"\n[PROCESS] {sub} subset")
      76        sample_and_copy_subfolder(
      77            input_dir=input_dir,
      78            output_dir=output_dir,
      79            min_frac=min_frac,
      80            min_num=min_num,
      81            seed=42 if sub == "train" else 123,
      82        )
      83    print(dataset_root.resolve())
      84    print(output_root.resolve())
      85    copy_other_file(source=dataset_root, target=output_root)
      86
      87    print(f"\n[INFO] Sampling complete β†’ Output in: {output_root}")
      88
      89
      90if __name__ == "__main__" or globals().get("__name__") == "<run_path>":
      91    dataset_path = globals().get("dataset_path", "./")
      92    output_path = globals().get("output_path", "./sample")
      93    create_debug_data(
      94        dataset_path=dataset_path,
      95        output_path=output_path,
      96        min_frac=0.02,
      97        min_num=10,
      98    )
      
    • Create a ds_data/eval/arf-12-hours-prediction-task/valid.py file, which is used to check the validity of the submission files to ensure that their formatting is consistent with the reference file.

      • The following shows a script that checks the validity of a submission based on the arf-12-hours-prediction-task data.

      ds_data/eval/arf-12-hours-prediction-task/valid.pyΒΆ
       1from pathlib import Path
       2
       3# Check if our submission file exists
       4assert Path("submission.csv").exists(), "Error: submission.csv not found"
       5
       6submission_lines = Path("submission.csv").read_text().splitlines()
       7test_lines = Path("submission_test.csv").read_text().splitlines()
       8
       9is_valid = len(submission_lines) == len(test_lines)
      10
      11if is_valid:
      12    message = "submission.csv and submission_test.csv have the same number of lines."
      13else:
      14    message = (
      15        f"submission.csv has {len(submission_lines)} lines, while submission_test.csv has {len(test_lines)} lines."
      16    )
      17
      18print(message)
      19
      20if not is_valid:
      21    raise AssertionError("Submission is invalid")
      
    • Create a ds_data/eval/arf-12-hours-prediction-task/grade.py file, which is used to calculate the score based on the submission file and the standard answer file, and output the result in JSON format.

      • The following shows a grading script based on the arf-12-hours-prediction-task data implementation.

      ds_data/eval/arf-12-hours-prediction-task/grade.pyΒΆ
       1import json
       2
       3import pandas as pd
       4from sklearn.metrics import roc_auc_score
       5
       6
       7def prepare_for_auroc_metric(submission: pd.DataFrame, answers: pd.DataFrame, id_col: str, target_col: str) -> dict:
       8
       9    # Answers checks
      10    assert id_col in answers.columns, f"answers dataframe should have an {id_col} column"
      11    assert target_col in answers.columns, f"answers dataframe should have a {target_col} column"
      12
      13    # Submission checks
      14    if id_col not in submission.columns:
      15        raise InvalidSubmissionError(f"Submission should have an {id_col} column")
      16    if target_col not in submission.columns:
      17        raise InvalidSubmissionError(f"Submission should have a {target_col} column")
      18    if len(submission) != len(answers):
      19        raise InvalidSubmissionError(f"Submission and answers should have the same number of rows")
      20    try:
      21        pd.to_numeric(submission[target_col])
      22    except ValueError:
      23        raise InvalidSubmissionError(
      24            f"Expected {target_col} column to be numeric, got {submission[target_col].dtype} instead"
      25        )
      26    if submission[target_col].min() < 0 or submission[target_col].max() > 1:
      27        raise InvalidSubmissionError(
      28            f"Submission {target_col} column should contain probabilities,"
      29            " and therefore contain values between 0 and 1 inclusive"
      30        )
      31    # Sort
      32    submission = submission.sort_values(id_col)
      33    answers = answers.sort_values(id_col)
      34
      35    if (submission[id_col].values != answers[id_col].values).any():
      36        raise InvalidSubmissionError(f"Submission and answers should have the same {id_col} values")
      37
      38    roc_auc_inputs = {
      39        "y_true": answers[target_col].to_numpy(),
      40        "y_score": submission[target_col].to_numpy(),
      41    }
      42
      43    return roc_auc_inputs
      44
      45
      46def grade(submission: pd.DataFrame, answers: pd.DataFrame) -> float:
      47    roc_auc_inputs = prepare_for_auroc_metric(
      48        submission=submission, answers=answers, id_col="ID", target_col="ARF_LABEL"
      49    )
      50    return roc_auc_score(y_true=roc_auc_inputs["y_true"], y_score=roc_auc_inputs["y_score"])
      51
      52
      53if __name__ == "__main__":
      54    submission_path = "submission.csv"
      55    gt_submission_path = "submission_test.csv"
      56    submission = pd.read_csv(submission_path)
      57    answers = pd.read_csv(gt_submission_path)
      58    score = grade(submission=submission, answers=answers)
      59
      60    print(
      61        json.dumps(
      62            {
      63                "competition_id": "arf-12-hours-prediction-task",
      64                "score": score,
      65            }
      66        )
      67    )
      
  • At this point, you have created a complete dataset. The correct structure of the dataset should look like this.

    ds_data
    β”œβ”€β”€ arf-12-hours-prediction-task
    β”‚   β”œβ”€β”€ train
    β”‚   β”‚   β”œβ”€β”€ ARF_12h.csv
    β”‚   β”‚   └── X.npz
    β”‚   β”œβ”€β”€ test
    β”‚   β”‚   β”œβ”€β”€ ARF_12h.csv
    β”‚   β”‚   └── X.npz
    β”‚   β”œβ”€β”€ description.md
    β”‚   β”œβ”€β”€ sample_submission.csv
    β”‚   └── sample.py
    β”œβ”€β”€ eval
    β”‚   └── arf-12-hours-prediction-task
    β”‚       β”œβ”€β”€ grade.py
    β”‚       β”œβ”€β”€ submission_test.csv
    β”‚       └── valid.py
    └── source_data
        └── arf-12-hours-prediction-task
            β”œβ”€β”€ ARF_12h.csv
            β”œβ”€β”€ prepare.py
            └── X.npz
    
  • The above shows the complete dataset creation workflow, some of the files are not required, in practice you can customize the dataset according to your own needs.

    • If we don’t need the test set scores, then we can choose not to generate formatted submission files and standard answer file in the prepare code, and we don’t need to write data checking codes and Code for calculation of scores.

    • Data sampling code can also be created according to the actual need, if you do not provide data sampling code, RD-Agent will be handed over to the LLM sampling at runtime.

      • In the default sampling method (create_debug_data), the default sampling ratio (parameter: min_frac) is 1%, if 1% of the data is less than 5, then 5 data will be sampled (parameter: min_num), you can adjust the sampling ratio by adjusting these two parameters.

        • If you have customized data sampling code, you need to set DS_SAMPLE_DATA_BY_LLM to False (default is True) in the .env file before running, so that the program will use the customized sampling code when running, and you can just execute this line of code in the command line:

          dotenv set DS_SAMPLE_DATA_BY_LLM False
          
        • In addition, we provide a data sampling method in rdagent.scenarios.data_science.debug.data.create_debug_data, in this method, the default sampling ratio (parameter: min_frac) is 1%, if 1% of the data is less than 5, then 5 data will be sampled (parameter: min_num), you can use this method by the following two ways.

          • You can set DS_SAMPLE_DATA_BY_LLM to False in the .env file so that when the program runs, it will use the sampling code provided by RD-Agent.

            dotenv set DS_SAMPLE_DATA_BY_LLM False
            
          • If you think that the parameters in the receipt sampling method provided by RD-Agent are not suitable, you can customize the parameters in the following command and run it, and set DS_SAMPLE_DATA_BY_LLM to False in the .env so that the program will use the sampling data you provided when running.

            python rdagent/app/data_science/debug.py --dataset_path <dataset path> --competition <competiton_name> --min_frac <sampling ratio> --min_num <minimum number of sampling>
            dotenv set DS_SAMPLE_DATA_BY_LLM False
            
  • If you don’t need the scores from the test set and leave the data sampling to the LLM, or if you use the sampling method provided by the RD-Agent, you only need to prepare a minimal dataset. The structure of the simplest dataset should be as shown below.

    ds_data
    β”œβ”€β”€ arf-12-hours-prediction-task
    β”‚   β”œβ”€β”€ train
    β”‚   β”‚   β”œβ”€β”€ ARF_12h.csv
    β”‚   β”‚   └── X.npz
    β”‚   β”œβ”€β”€ test
    β”‚   β”‚   β”œβ”€β”€ ARF_12h.csv
    β”‚   β”‚   └── X.npz
    β”‚   └── description.md
    └── source_data
        └── arf-12-hours-prediction-task
            β”œβ”€β”€ ARF_12h.csv
            β”œβ”€β”€ prepare.py
            └── X.npz
    
  • We have prepared a dataset based on the above description for your reference. You can download it with the following command.

    wget https://github.com/SunsetWolf/rdagent_resource/releases/download/ds_data/arf-12-hours-prediction-task.zip
    

βš™οΈ Set up Environment for Customized datasetsΒΆ

dotenv set DS_SCEN rdagent.scenarios.data_science.scen.DataScienceScen
dotenv set DS_LOCAL_DATA_PATH <your local directory>/ds_data
dotenv set DS_CODER_ON_WHOLE_PIPELINE True
  • πŸ“˜ More Environment Variables (Optional)

    • If you want to see all the available environment variables, you can refer to the configuration file for Data Science scenarios:

      1from pathlib import Path
      2from typing import Literal
      3
      4from pydantic_settings import SettingsConfigDict
      5
      6from rdagent.app.kaggle.conf import KaggleBasePropSetting
      7
      8
      9class DataScienceBasePropSetting(KaggleBasePropSetting):
     10    # TODO: Kaggle Setting should be the subclass of DataScience
     11    model_config = SettingsConfigDict(env_prefix="DS_", protected_namespaces=())
     12
     13    # Main components
     14    ## Scen
     15    scen: str = "rdagent.scenarios.data_science.scen.KaggleScen"
     16    """
     17    Scenario class for data science tasks.
     18    - For Kaggle competitions, use: "rdagent.scenarios.data_science.scen.KaggleScen"
     19    - For custom data science scenarios, use: "rdagent.scenarios.data_science.scen.DataScienceScen"
     20    """
     21
     22    planner: str = "rdagent.scenarios.data_science.proposal.exp_gen.planner.DSExpPlannerHandCraft"
     23    hypothesis_gen: str = "rdagent.scenarios.data_science.proposal.exp_gen.router.ParallelMultiTraceExpGen"
     24    interactor: str = "rdagent.components.interactor.SkipInteractor"
     25    trace_scheduler: str = "rdagent.scenarios.data_science.proposal.exp_gen.trace_scheduler.RoundRobinScheduler"
     26    """Hypothesis generation class"""
     27
     28    summarizer: str = "rdagent.scenarios.data_science.dev.feedback.DSExperiment2Feedback"
     29    summarizer_init_kwargs: dict = {
     30        "version": "exp_feedback",
     31    }
     32    ## Workflow Related
     33    consecutive_errors: int = 5
     34
     35    ## Coding Related
     36    coding_fail_reanalyze_threshold: int = 3
     37
     38    debug_recommend_timeout: int = 600
     39    """The recommend time limit for running on debugging data"""
     40    debug_timeout: int = 600
     41    """The timeout limit for running on debugging data"""
     42    full_recommend_timeout: int = 3600
     43    """The recommend time limit for running on full data"""
     44    full_timeout: int = 3600
     45    """The timeout limit for running on full data"""
     46
     47    #### model dump
     48    enable_model_dump: bool = False
     49    enable_doc_dev: bool = False
     50    model_dump_check_level: Literal["medium", "high"] = "medium"
     51
     52    #### MCP documentation search integration
     53    enable_mcp_documentation_search: bool = False
     54    """Enable MCP documentation search for error resolution. Requires MCP_ENABLED=true and MCP_CONTEXT7_ENABLED=true in environment."""
     55
     56    ### specific feature
     57
     58    ### notebook integration
     59    enable_notebook_conversion: bool = False
     60
     61    #### enable specification
     62    spec_enabled: bool = True
     63
     64    #### proposal related
     65    # proposal_version: str = "v2" deprecated
     66
     67    coder_on_whole_pipeline: bool = True
     68    max_trace_hist: int = 3
     69
     70    coder_max_loop: int = 10
     71    runner_max_loop: int = 3
     72
     73    sample_data_by_LLM: bool = True
     74    use_raw_description: bool = False
     75    show_nan_columns: bool = False
     76
     77    ### knowledge base
     78    enable_knowledge_base: bool = False
     79    knowledge_base_version: str = "v1"
     80    knowledge_base_path: str | None = None
     81    idea_pool_json_path: str | None = None
     82
     83    ### archive log folder after each loop
     84    enable_log_archive: bool = True
     85    log_archive_path: str | None = None
     86    log_archive_temp_path: str | None = (
     87        None  # This is to store the mid tar file since writing the tar file is preferred in local storage then copy to target storage
     88    )
     89
     90    #### Evaluation on Test related
     91    eval_sub_dir: str = "eval"  # TODO: fixme, this is not a good name
     92    """We'll use f"{DS_RD_SETTING.local_data_path}/{DS_RD_SETTING.eval_sub_dir}/{competition}"
     93    to find the scriipt to evaluate the submission on test"""
     94
     95    """---below are the settings for multi-trace---"""
     96
     97    ### multi-trace related
     98    max_trace_num: int = 1
     99    """The maximum number of traces to grow before merging"""
    100
    101    scheduler_temperature: float = 1.0
    102    """The temperature for the trace scheduler for softmax calculation, used in ProbabilisticScheduler"""
    103
    104    # PUCT exploration constant for MCTSScheduler (ignored by other schedulers)
    105    scheduler_c_puct: float = 1.0
    106    """Exploration constant used by MCTSScheduler (PUCT)."""
    107
    108    enable_score_reward: bool = False
    109    """Enable using score-based reward for trace selection in multi-trace scheduling."""
    110
    111    #### multi-trace:checkpoint selector
    112    selector_name: str = "rdagent.scenarios.data_science.proposal.exp_gen.select.expand.LatestCKPSelector"
    113    """The name of the selector to use"""
    114    sota_count_window: int = 5
    115    """The number of trials to consider for SOTA count"""
    116    sota_count_threshold: int = 1
    117    """The threshold for SOTA count"""
    118
    119    #### multi-trace: SOTA experiment selector
    120    sota_exp_selector_name: str = "rdagent.scenarios.data_science.proposal.exp_gen.select.submit.GlobalSOTASelector"
    121    """The name of the SOTA experiment selector to use"""
    122
    123    ### multi-trace:inject optimals for multi-trace
    124    # inject diverse when start a new sub-trace
    125    enable_inject_diverse: bool = False
    126
    127    # inject diverse from other traces when start a new sub-trace
    128    enable_cross_trace_diversity: bool = True
    129    """Enable cross-trace diversity injection when starting a new sub-trace.
    130    This is different from `enable_inject_diverse` which is for non-parallel cases."""
    131
    132    diversity_injection_strategy: str = (
    133        "rdagent.scenarios.data_science.proposal.exp_gen.diversity_strategy.InjectUntilSOTAGainedStrategy"
    134    )
    135    """The strategy to use for injecting diversity context."""
    136
    137    # enable different version of DSExpGen for multi-trace
    138    enable_multi_version_exp_gen: bool = False
    139    exp_gen_version_list: str = "v3,v2"
    140
    141    #### multi-trace: time for final multi-trace merge
    142    merge_hours: float = 0
    143    """The time for merge"""
    144
    145    #### multi-trace: max SOTA-retrieved number, used in AutoSOTAexpSelector
    146    # constrains the number of SOTA experiments to retrieve, otherwise too many SOTA experiments to retrieve will cause the exceed of the context window of LLM
    147    max_sota_retrieved_num: int = 10
    148    """The maximum number of SOTA experiments to retrieve in a LLM call"""
    149
    150    #### enable draft before first sota experiment
    151    enable_draft_before_first_sota: bool = False
    152    enable_planner: bool = False
    153
    154    model_architecture_suggestion_time_percent: float = 0.75
    155    allow_longer_timeout: bool = False
    156    coder_enable_llm_decide_longer_timeout: bool = False
    157    runner_enable_llm_decide_longer_timeout: bool = False
    158    coder_longer_timeout_multiplier_upper: int = 3
    159    runner_longer_timeout_multiplier_upper: int = 2
    160    coder_timeout_increase_stage: float = 0.3
    161    runner_timeout_increase_stage: float = 0.3
    162    runner_timeout_increase_stage_patience: int = 2
    163    """Number of failures tolerated before escalating to next timeout level (stage width). Every 'patience' failures, timeout increases by 'runner_timeout_increase_stage'"""
    164    show_hard_limit: bool = True
    165
    166    #### enable runner code change summary
    167    runner_enable_code_change_summary: bool = True
    168
    169    ### Proposal workflow related
    170
    171    #### Hypothesis Generate related
    172    enable_simple_hypothesis: bool = False
    173    """If true, generate simple hypothesis, no more than 2 sentences each."""
    174
    175    enable_generate_unique_hypothesis: bool = False
    176    """Enable generate unique hypothesis. If True, generate unique hypothesis for each component. If False, generate unique hypothesis for each component."""
    177
    178    enable_research_rag: bool = False
    179    """Enable research RAG for hypothesis generation."""
    180
    181    #### hypothesis critique and rewrite
    182    enable_hypo_critique_rewrite: bool = False
    183    """Enable hypothesis critique and rewrite stages for improving hypothesis quality"""
    184    enable_scale_check: bool = False
    185
    186    ##### select related
    187    ratio_merge_or_ensemble: int = 70
    188    """The ratio of merge or ensemble to be considered as a valid solution"""
    189    llm_select_hypothesis: bool = False
    190    """Whether to use LLM to select hypothesis. If True, use LLM selection; if False, use the existing ranking method."""
    191
    192    #### Task Generate related
    193    fix_seed_and_data_split: bool = False
    194
    195    ensemble_time_upper_bound: bool = False
    196
    197    user_interaction_wait_seconds: int = 6000  # seconds to wait for user interaction
    198    user_interaction_mid_folder: Path = Path.cwd() / "git_ignore_folder" / "RD-Agent_user_interaction"
    199
    200
    201DS_RD_SETTING = DataScienceBasePropSetting()
    202
    203# enable_cross_trace_diversity and llm_select_hypothesis should not be true at the same time
    204assert not (
    205    DS_RD_SETTING.enable_cross_trace_diversity and DS_RD_SETTING.llm_select_hypothesis
    206), "enable_cross_trace_diversity and llm_select_hypothesis cannot be true at the same time"
    
    • These variables allow you to have finer-grained control in Data Science scenarios.

πŸš€ Run the ApplicationΒΆ

  • 🌏 You can directly run the application by using the following command:

    rdagent data_science --competition <Competition ID>
    
    • The following shows the command to run based on the arf-12-hours-prediction-task data

      rdagent data_science --competition arf-12-hours-prediction-task
      
    • More CLI Parameters for rdagent data_science command:

    rdagent.app.data_science.loop.main(path: str | None = None, checkout: bool = True, checkout_path: str | None = None, step_n: int | None = None, loop_n: int | None = None, timeout: str | None = None, competition='bms-molecular-translation', replace_timer=True, exp_gen_cls: str | None = None)
    ParametersΒΆ
    path :

    A path like $LOG_PATH/__session__/1/0_propose. This indicates that we restore the state after finishing step 0 in loop 1.

    checkout :

    Used to control the log session path. Boolean type, default is True. - If True, the new loop will use the existing folder and clear logs for sessions after the one corresponding to the given path. - If False, the new loop will use the existing folder but keep the logs for sessions after the one corresponding to the given path.

    checkout_path:

    If a checkout_path (or a str like Path) is provided, the new loop will be saved to that path, leaving the original path unchanged.

    step_n :

    Number of steps to run; if None, the process will run indefinitely until an error or KeyboardInterrupt occurs.

    loop_n :

    Number of loops to run; if None, the process will run indefinitely until an error or KeyboardInterrupt occurs. - If the current loop is incomplete, it will be counted as the first loop for completion. - If both step_n and loop_n are provided, the process will stop as soon as either condition is met.

    timeout :

    Maximum duration to run the loop. Accepts a string format recognized by the internal timer. - If None, the loop will run until completion, error, or KeyboardInterrupt.

    competition :

    Competition name.

    replace_timer :

    If a session is loaded, determines whether to replace the timer with session.timer.

    exp_gen_cls :

    When there are different stages, the exp_gen can be replaced with the new proposal.

    Auto R&D Evolving loop for models in a Kaggle scenario. You can continue running a session by using the command:

    dotenv run -- python rdagent/app/data_science/loop.py [--competition titanic] $LOG_PATH/__session__/1/0_propose  --step_n 1   # `step_n` is an optional parameter
    rdagent kaggle --competition playground-series-s4e8  # This command is recommended.
    
  • πŸ“ˆ Visualize the R&D Process

    • We provide a web UI to visualize the log. You just need to run:

      rdagent ui --port <custom port> --log-dir <your log folder like "log/"> --data_science True
      
    • Then you can input the log path and visualize the R&D process.

  • πŸ§ͺ Scoring the test results

    • Finally, shutdown the program, and get the test set scores with this command.

    dotenv run -- python rdagent/log/mle_summary.py grade <url_to_log>
    

    Here, <url_to_log> refers to the parent directory of the log folder generated during the run.

πŸ•ΉοΈ Kaggle AgentΒΆ

πŸ“– BackgroundΒΆ

In the landscape of data science competitions, Kaggle serves as the ultimate arena where data enthusiasts harness the power of algorithms to tackle real-world challenges. The Kaggle Agent stands as a pivotal tool, empowering participants to seamlessly integrate cutting-edge models and datasets, transforming raw data into actionable insights.

By utilizing the Kaggle Agent, data scientists can craft innovative solutions that not only uncover hidden patterns but also drive significant advancements in predictive accuracy and model robustness.

🧭 Example Guide - Kaggle Dataset¢

πŸ› οΈ Preparing For The CompetitionΒΆ
  • πŸ”¨ Configuring the Kaggle API

    • Register and login on the Kaggle website.

    • Click on the avatar (usually in the top right corner of the page) -> Settings -> Create New Token, A file called kaggle.json will be downloaded.

    • Move kaggle.json to ~/.config/kaggle/

    • Modify the permissions of the kaggle.json file.

      chmod 600 ~/.config/kaggle/kaggle.json
      
    • For more information about Kaggle API Settings, refer to the Kaggle API.

  • πŸ”© Setting the Environment variables at .env file

    • Determine the path where the data will be stored and add it to the .env file.

    mkdir -p <your local directory>/ds_data
    dotenv set KG_LOCAL_DATA_PATH <your local directory>/ds_data
    
    • πŸ“˜ More Environment Variables (Optional)

      • If you want to see all the available environment variables, you can refer to the configuration file for Data Science scenarios:

        1from pathlib import Path
        2from typing import Literal
        3
        4from pydantic_settings import SettingsConfigDict
        5
        6from rdagent.app.kaggle.conf import KaggleBasePropSetting
        7
        8
        9class DataScienceBasePropSetting(KaggleBasePropSetting):
       10    # TODO: Kaggle Setting should be the subclass of DataScience
       11    model_config = SettingsConfigDict(env_prefix="DS_", protected_namespaces=())
       12
       13    # Main components
       14    ## Scen
       15    scen: str = "rdagent.scenarios.data_science.scen.KaggleScen"
       16    """
       17    Scenario class for data science tasks.
       18    - For Kaggle competitions, use: "rdagent.scenarios.data_science.scen.KaggleScen"
       19    - For custom data science scenarios, use: "rdagent.scenarios.data_science.scen.DataScienceScen"
       20    """
       21
       22    planner: str = "rdagent.scenarios.data_science.proposal.exp_gen.planner.DSExpPlannerHandCraft"
       23    hypothesis_gen: str = "rdagent.scenarios.data_science.proposal.exp_gen.router.ParallelMultiTraceExpGen"
       24    interactor: str = "rdagent.components.interactor.SkipInteractor"
       25    trace_scheduler: str = "rdagent.scenarios.data_science.proposal.exp_gen.trace_scheduler.RoundRobinScheduler"
       26    """Hypothesis generation class"""
       27
       28    summarizer: str = "rdagent.scenarios.data_science.dev.feedback.DSExperiment2Feedback"
       29    summarizer_init_kwargs: dict = {
       30        "version": "exp_feedback",
       31    }
       32    ## Workflow Related
       33    consecutive_errors: int = 5
       34
       35    ## Coding Related
       36    coding_fail_reanalyze_threshold: int = 3
       37
       38    debug_recommend_timeout: int = 600
       39    """The recommend time limit for running on debugging data"""
       40    debug_timeout: int = 600
       41    """The timeout limit for running on debugging data"""
       42    full_recommend_timeout: int = 3600
       43    """The recommend time limit for running on full data"""
       44    full_timeout: int = 3600
       45    """The timeout limit for running on full data"""
       46
       47    #### model dump
       48    enable_model_dump: bool = False
       49    enable_doc_dev: bool = False
       50    model_dump_check_level: Literal["medium", "high"] = "medium"
       51
       52    #### MCP documentation search integration
       53    enable_mcp_documentation_search: bool = False
       54    """Enable MCP documentation search for error resolution. Requires MCP_ENABLED=true and MCP_CONTEXT7_ENABLED=true in environment."""
       55
       56    ### specific feature
       57
       58    ### notebook integration
       59    enable_notebook_conversion: bool = False
       60
       61    #### enable specification
       62    spec_enabled: bool = True
       63
       64    #### proposal related
       65    # proposal_version: str = "v2" deprecated
       66
       67    coder_on_whole_pipeline: bool = True
       68    max_trace_hist: int = 3
       69
       70    coder_max_loop: int = 10
       71    runner_max_loop: int = 3
       72
       73    sample_data_by_LLM: bool = True
       74    use_raw_description: bool = False
       75    show_nan_columns: bool = False
       76
       77    ### knowledge base
       78    enable_knowledge_base: bool = False
       79    knowledge_base_version: str = "v1"
       80    knowledge_base_path: str | None = None
       81    idea_pool_json_path: str | None = None
       82
       83    ### archive log folder after each loop
       84    enable_log_archive: bool = True
       85    log_archive_path: str | None = None
       86    log_archive_temp_path: str | None = (
       87        None  # This is to store the mid tar file since writing the tar file is preferred in local storage then copy to target storage
       88    )
       89
       90    #### Evaluation on Test related
       91    eval_sub_dir: str = "eval"  # TODO: fixme, this is not a good name
       92    """We'll use f"{DS_RD_SETTING.local_data_path}/{DS_RD_SETTING.eval_sub_dir}/{competition}"
       93    to find the scriipt to evaluate the submission on test"""
       94
       95    """---below are the settings for multi-trace---"""
       96
       97    ### multi-trace related
       98    max_trace_num: int = 1
       99    """The maximum number of traces to grow before merging"""
      100
      101    scheduler_temperature: float = 1.0
      102    """The temperature for the trace scheduler for softmax calculation, used in ProbabilisticScheduler"""
      103
      104    # PUCT exploration constant for MCTSScheduler (ignored by other schedulers)
      105    scheduler_c_puct: float = 1.0
      106    """Exploration constant used by MCTSScheduler (PUCT)."""
      107
      108    enable_score_reward: bool = False
      109    """Enable using score-based reward for trace selection in multi-trace scheduling."""
      110
      111    #### multi-trace:checkpoint selector
      112    selector_name: str = "rdagent.scenarios.data_science.proposal.exp_gen.select.expand.LatestCKPSelector"
      113    """The name of the selector to use"""
      114    sota_count_window: int = 5
      115    """The number of trials to consider for SOTA count"""
      116    sota_count_threshold: int = 1
      117    """The threshold for SOTA count"""
      118
      119    #### multi-trace: SOTA experiment selector
      120    sota_exp_selector_name: str = "rdagent.scenarios.data_science.proposal.exp_gen.select.submit.GlobalSOTASelector"
      121    """The name of the SOTA experiment selector to use"""
      122
      123    ### multi-trace:inject optimals for multi-trace
      124    # inject diverse when start a new sub-trace
      125    enable_inject_diverse: bool = False
      126
      127    # inject diverse from other traces when start a new sub-trace
      128    enable_cross_trace_diversity: bool = True
      129    """Enable cross-trace diversity injection when starting a new sub-trace.
      130    This is different from `enable_inject_diverse` which is for non-parallel cases."""
      131
      132    diversity_injection_strategy: str = (
      133        "rdagent.scenarios.data_science.proposal.exp_gen.diversity_strategy.InjectUntilSOTAGainedStrategy"
      134    )
      135    """The strategy to use for injecting diversity context."""
      136
      137    # enable different version of DSExpGen for multi-trace
      138    enable_multi_version_exp_gen: bool = False
      139    exp_gen_version_list: str = "v3,v2"
      140
      141    #### multi-trace: time for final multi-trace merge
      142    merge_hours: float = 0
      143    """The time for merge"""
      144
      145    #### multi-trace: max SOTA-retrieved number, used in AutoSOTAexpSelector
      146    # constrains the number of SOTA experiments to retrieve, otherwise too many SOTA experiments to retrieve will cause the exceed of the context window of LLM
      147    max_sota_retrieved_num: int = 10
      148    """The maximum number of SOTA experiments to retrieve in a LLM call"""
      149
      150    #### enable draft before first sota experiment
      151    enable_draft_before_first_sota: bool = False
      152    enable_planner: bool = False
      153
      154    model_architecture_suggestion_time_percent: float = 0.75
      155    allow_longer_timeout: bool = False
      156    coder_enable_llm_decide_longer_timeout: bool = False
      157    runner_enable_llm_decide_longer_timeout: bool = False
      158    coder_longer_timeout_multiplier_upper: int = 3
      159    runner_longer_timeout_multiplier_upper: int = 2
      160    coder_timeout_increase_stage: float = 0.3
      161    runner_timeout_increase_stage: float = 0.3
      162    runner_timeout_increase_stage_patience: int = 2
      163    """Number of failures tolerated before escalating to next timeout level (stage width). Every 'patience' failures, timeout increases by 'runner_timeout_increase_stage'"""
      164    show_hard_limit: bool = True
      165
      166    #### enable runner code change summary
      167    runner_enable_code_change_summary: bool = True
      168
      169    ### Proposal workflow related
      170
      171    #### Hypothesis Generate related
      172    enable_simple_hypothesis: bool = False
      173    """If true, generate simple hypothesis, no more than 2 sentences each."""
      174
      175    enable_generate_unique_hypothesis: bool = False
      176    """Enable generate unique hypothesis. If True, generate unique hypothesis for each component. If False, generate unique hypothesis for each component."""
      177
      178    enable_research_rag: bool = False
      179    """Enable research RAG for hypothesis generation."""
      180
      181    #### hypothesis critique and rewrite
      182    enable_hypo_critique_rewrite: bool = False
      183    """Enable hypothesis critique and rewrite stages for improving hypothesis quality"""
      184    enable_scale_check: bool = False
      185
      186    ##### select related
      187    ratio_merge_or_ensemble: int = 70
      188    """The ratio of merge or ensemble to be considered as a valid solution"""
      189    llm_select_hypothesis: bool = False
      190    """Whether to use LLM to select hypothesis. If True, use LLM selection; if False, use the existing ranking method."""
      191
      192    #### Task Generate related
      193    fix_seed_and_data_split: bool = False
      194
      195    ensemble_time_upper_bound: bool = False
      196
      197    user_interaction_wait_seconds: int = 6000  # seconds to wait for user interaction
      198    user_interaction_mid_folder: Path = Path.cwd() / "git_ignore_folder" / "RD-Agent_user_interaction"
      199
      200
      201DS_RD_SETTING = DataScienceBasePropSetting()
      202
      203# enable_cross_trace_diversity and llm_select_hypothesis should not be true at the same time
      204assert not (
      205    DS_RD_SETTING.enable_cross_trace_diversity and DS_RD_SETTING.llm_select_hypothesis
      206), "enable_cross_trace_diversity and llm_select_hypothesis cannot be true at the same time"
      
      • These variables allow you to have finer-grained control in Data Science scenarios.

  • πŸ—³οΈ Join the competition

    • If your Kaggle API account has not joined a competition, you will need to join the competition before running the program.

      • At the bottom of the competition details page, you can find the Join the competition button, click on it and select I Understand and Accept to join the competition.

      • In the Competition List Available below, you can jump to the competition details page.

πŸ“₯ Preparing Competition DataDataset && Set up RD-Agent EnvironmentΒΆ
  • As a subset of data science, kaggle’s dataset still follows the data science format. Based on this, the kaggle dataset can be divided into two categories depending on whether or not it is supported by the MLE-Bench.

    • What is MLE-Bench?

      • MLE-Bench is a comprehensive benchmark designed to evaluate the machine learning engineering capabilities of AI systems using real-world scenarios. The dataset includes multiple Kaggle competitions. Since Kaggle does not provide reserved test sets for these competitions, the benchmark includes preparation scripts for splitting publicly available training data into new training and test sets, and scoring scripts for each competition to accurately evaluate submission scores.

    • I’m running a competition Is MLE-Bench supported?

      • You can see all the competitions supported by MLE-Bench here.

  • Prepare datasets for MLE-Bench supported competitions.

    • If you agree with the MLE-Bench standard, then you don’t need to prepare the dataset, you just need to configure your .env file to automate the download of the dataset.

      • Configure environment variables, add DS_IF_USING_MLE_DATA to environment variables, and set it to True.

        dotenv set DS_IF_USING_MLE_DATA True
        
      • Configure environment variables, add DS_SAMPLE_DATA_BY_LLM to environment variables, and set it to True.

        dotenv set DS_SAMPLE_DATA_BY_LLM True
        
      • Configure environment variables, add DS_SCEN to environment variables, and set it to rdagent.scenarios.data_science.scen.KaggleScen.

        dotenv set DS_SCEN rdagent.scenarios.data_science.scen.KaggleScen
        
    • At this point, you are ready to start running your competition, which will automatically download the data, and the LLM will automatically extract the minimum dataset.

      • After running the program the structure of the ds_data folder should look like this (Using the tabular-playground-series-dec-2021 contest as an example).

        ds_data
        β”œβ”€β”€ tabular-playground-series-dec-2021
        β”‚   β”œβ”€β”€ description.md
        β”‚   β”œβ”€β”€ sample_submission.csv
        β”‚   β”œβ”€β”€ test.csv
        β”‚   └── train.csv
        └── zip_files
            └── tabular-playground-series-dec-2021
                └── tabular-playground-series-dec-2021.zip
        
        • The ds_data/zip_files folder contains a zip file of the raw competition data downloaded from kaggle website.

    • At runtime, RD-Agent will automatically build the Docker image specified at rdagent/scenarios/kaggle/docker/mle_bench_docker/Dockerfile. This image is responsible for downloading the required datasets and grading files for MLE-Bench.

    Note: The first run may take longer than subsequent runs as the Docker image and data are being downloaded and set up for the first time.

  • Prepare datasets for competitions that are not supported by MLE-Bench.

    • As a subset of data science, we can follow the format and steps of data science dataset to prepare kaggle dataset. Below we will describe the workflow for preparing a kaggle dataset using the competition playground-series-s4e9 as an example.

      • Create a ds_data/source_data/playground-series-s4e9 folder, which will be used to store your raw dataset.

        • The raw files for the competition playground-series-s4e9 have two files: train.csv, test.csv, sample_submission.csv, and there are two ways to get the raw data:

          • You can find the raw data required for the competition on the official kaggle website.

          • Or you can use the command line to download the raw data for the competition, the download command is as follows.

            kaggle competitions download -c playground-series-s4e9
            
      • Create a ds_data/source_data/playground-series-s4e9/prepare.py file that splits your raw data into training data, test data, formatted submission file, and standard answer file. (You will need to write a script based on your raw data.)

        • The following shows the preprocessing code for the raw data of playground-series-s4e9.

        ds_data/source_data/playground-series-s4e9/prepare.pyΒΆ
         1from pathlib import Path
         2
         3import pandas as pd
         4from sklearn.model_selection import train_test_split
         5
         6
         7def prepare(raw: Path, public: Path, private: Path):
         8
         9    # Create train and test splits from train set
        10    old_train = pd.read_csv(raw / "train.csv")
        11    new_train, new_test = train_test_split(old_train, test_size=0.1, random_state=0)
        12
        13    # Create sample submission
        14    sample_submission = new_test.copy()
        15    sample_submission["price"] = 43878.016
        16    sample_submission.drop(sample_submission.columns.difference(["id", "price"]), axis=1, inplace=True)
        17    sample_submission.to_csv(public / "sample_submission.csv", index=False)
        18
        19    # Create private files
        20    new_test.to_csv(private / "submission_test.csv", index=False)
        21
        22    # Create public files visible to agents
        23    new_train.to_csv(public / "train.csv", index=False)
        24    new_test.drop(["price"], axis=1, inplace=True)
        25    new_test.to_csv(public / "test.csv", index=False)
        26
        27    # Checks
        28    assert new_test.shape[1] == 12, "Public test set should have 12 columns"
        29    assert new_train.shape[1] == 13, "Public train set should have 13 columns"
        30    assert len(new_train) + len(new_test) == len(
        31        old_train
        32    ), "Length of new_train and new_test should equal length of old_train"
        33
        34
        35if __name__ == "__main__":
        36    competitions = "playground-series-s4e9"
        37    raw = Path(__file__).resolve().parent
        38    prepare(
        39        raw=raw,
        40        public=raw.parent.parent / competitions,
        41        private=raw.parent.parent / "eval" / competitions,
        42    )
        
        • At the end of program execution, the ds_data folder structure will look like this:

        ds_data
        β”œβ”€β”€ playground-series-s4e9
        β”‚   β”œβ”€β”€ train.csv
        β”‚   β”œβ”€β”€ test.csv
        β”‚   └── sample_submission.csv
        β”œβ”€β”€ eval
        β”‚   └── playground-series-s4e9
        β”‚       └── submission_test.csv
        └── source_data
            └── playground-series-s4e9
                β”œβ”€β”€ prepare.py
                β”œβ”€β”€ sample_submission.csv
                β”œβ”€β”€ test.csv
                └── train.csv
        
      • Create a ds_data/playground-series-s4e9/description.md file to describe your competition, dataset description, and other information. We can find the competition description information and the dataset description information from the Kaggle website.

        • The following shows the description file for playground-series-s4e9

          ds_data/playground-series-s4e9/description.mdΒΆ
           1# Competition name: playground-series-s4e9
           2
           3## Overview
           4
           5**Welcome to the 2024 Kaggle Playground Series!** We plan to continue in the spirit of previous playgrounds, providing interesting and approachable datasets for our community to practice their machine learning skills, and anticipate a competition each month.
           6
           7**Your Goal:** The goal of this competition is to predict the price of used cars based on various attributes.
           8
           9## Evaluation
          10
          11### Root Mean Squared Error (RMSE)
          12
          13Submissions are scored on the root mean squared error. RMSE is defined as:
          14
          15$$
          16\mathrm{RMSE} = \left( \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 \right)^{\frac{1}{2}}
          17$$
          18
          19where $\hat{y}_i$ is the predicted value and $y_i$ is the original value for each instance $i$.
          20
          21### Submission File
          22
          23For each `id` in the test set, you must predict the `price` of the car. The file should contain a header and have the following format:
          24
          25```
          26id,price
          27188533,43878.016
          28188534,43878.016
          29188535,43878.016
          30etc.
          31```
          32
          33## Timeline
          34- **Start Date** - September 1, 2024
          35- **Entry Deadline** - Same as the Final Submission Deadline
          36- **Team Merger Deadline** - Same as the Final Submission Deadline
          37- **Final Submission Deadline** - September 30, 2024
          38
          39All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition organizers reserve the right to update the contest timeline if they deem it necessary.
          40
          41## About the Tabular Playground Series
          42
          43The goal of the Tabular Playground Series is to provide the Kaggle community with a variety of fairly light-weight challenges that can be used to learn and sharpen skills in different aspects of machine learning and data science. The duration of each competition will generally only last a few weeks, and may have longer or shorter durations depending on the challenge. The challenges will generally use fairly light-weight datasets that are synthetically generated from real-world data, and will provide an opportunity to quickly iterate through various model and feature engineering ideas, create visualizations, etc.
          44
          45### Synthetically-Generated Datasets
          46
          47Using synthetic data for Playground competitions allows us to strike a balance between having real-world data (with named features) and ensuring test labels are not publicly available. This allows us to host competitions with more interesting datasets than in the past. While there are still challenges with synthetic data generation, the state-of-the-art is much better now than when we started the Tabular Playground Series two years ago, and that goal is to produce datasets that have far fewer artifacts. Please feel free to give us feedback on the datasets for the different competitions so that we can continue to improve!
          48
          49## Prizes
          50- 1st Place - Choice of Kaggle merchandise
          51- 2nd Place - Choice of Kaggle merchandise
          52- 3rd Place - Choice of Kaggle merchandise
          53
          54**Please note**: In order to encourage more participation from beginners, Kaggle merchandise will only be awarded once per person in this series. If a person has previously won, we'll skip to the next team.
          55
          56## Citation
          57
          58Walter Reade and Ashley Chow. Regression of Used Car Prices. https://kaggle.com/competitions/playground-series-s4e9, 2024. Kaggle.
          59
          60## Dataset Description
          61
          62The dataset for this competition (both train and test) was generated from a deep learning model trained on the [Used Car Price Prediction Dataset](https://www.kaggle.com/datasets/taeefnajib/used-car-price-prediction-dataset). Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance.
          63
          64## Files
          65
          66- **train.csv** - the training dataset; `price` is the continuous target
          67- **test.csv** - the test dataset; your objective is to predict the value of `price` for each row
          68- **sample_submission.csv** - a sample submission file in the correct format
          
      • Create a ds_data/eval/playground-series-s4e9/valid.py file, which is used to check the validity of the submission files to ensure that their formatting is consistent with the reference file.

        • The following shows a script that checks the validity of a submission based on the playground-series-s4e9 data.

        ds_data/eval/playground-series-s4e9/valid.pyΒΆ
         1from pathlib import Path
         2
         3# Check if our submission file exists
         4assert Path("submission.csv").exists(), "Error: submission.csv not found"
         5
         6submission_lines = Path("submission.csv").read_text().splitlines()  # θ‡ͺεŠ¨η”Ÿζˆηš„
         7test_lines = Path("submission_test.csv").read_text().splitlines()  # test.csv
         8
         9is_valid = len(submission_lines) == len(test_lines)
        10
        11if is_valid:
        12    message = "submission.csv and submission_test.csv have the same number of lines."
        13else:
        14    message = (
        15        f"submission.csv has {len(submission_lines)} lines, while submission_test.csv has {len(test_lines)} lines."
        16    )
        17
        18print(message)
        19
        20if not is_valid:
        21    raise AssertionError("Submission is invalid")
        
      • Create a ds_data/eval/playground-series-s4e9/grade.py file, which is used to calculate the score based on the submission file and the standard answer file, and output the result in JSON format.

        • The following shows a grading script based on the playground-series-s4e9 data implementation.

        ds_data/eval/playground-series-s4e9/grade.pyΒΆ
         1import datetime
         2import json
         3
         4import numpy as np
         5import pandas as pd
         6from sklearn.metrics import mean_absolute_error, mean_squared_error, roc_auc_score
         7
         8
         9class InvalidSubmissionError(Exception):
        10    """
        11    A custom exception for when the agent submission cannot be graded.
        12    """
        13
        14    pass
        15
        16
        17def prepare_for_metric(submission: pd.DataFrame, answers: pd.DataFrame) -> dict:
        18
        19    if "id" not in submission.columns or "id" not in answers.columns:
        20        raise InvalidSubmissionError("Both submission and answers DataFrames must contain an 'id' column.")
        21
        22    if "price" not in submission.columns:
        23        raise InvalidSubmissionError("Submission DataFrame must contain 'price' columns.")
        24
        25    assert "price" in answers.columns, "Answers DataFrame must contain 'price' columns."
        26
        27    if len(submission) != len(answers):
        28        raise InvalidSubmissionError("Submission must be the same length as the answers.")
        29
        30    answers_sorted = answers.sort_values("id")
        31    submission_sorted = submission.sort_values("id")
        32
        33    if (submission_sorted["id"].values != answers_sorted["id"].values).any():
        34        raise InvalidSubmissionError("Submission and answers have mismatched 'id' columns")
        35
        36    y_true = answers_sorted[["price"]].to_numpy()
        37    y_score = submission_sorted[["price"]].to_numpy()
        38
        39    return {"y_true": y_true, "y_score": y_score}
        40
        41
        42def grade(submission: pd.DataFrame, answers: pd.DataFrame) -> float:
        43    metric_inputs = prepare_for_metric(submission, answers)
        44    return np.sqrt(mean_squared_error(metric_inputs["y_true"], metric_inputs["y_score"]))
        45
        46
        47if __name__ == "__main__":
        48    submission_path = "submission.csv"
        49    gt_submission_path = "submission_test.csv"
        50    submission = pd.read_csv(submission_path)
        51    answers = pd.read_csv(gt_submission_path)
        52    score = grade(submission=submission, answers=answers)
        53
        54    # This `thresholds` can be customized according to the leaderboard page of the Kaggle website and your own needs.
        55    # Refs: https://www.kaggle.com/competitions/playground-series-s4e9/leaderboard
        56    thresholds = {
        57        "gold": 62917.05988,
        58        "silver": 62945.91714,
        59        "bronze": 62958.13747,
        60        "median": 63028.69429,
        61    }
        62
        63    # The output must be in json format. To configure the full output,
        64    # you can run the command `rdagent grade_summary --log-folder` to summarize the scores at the end of the program.
        65    # If you don't need it, you can just provide the `competition_id`` and `score``.
        66    print(
        67        json.dumps(
        68            {
        69                "competition_id": "arf-12-hours-prediction-task",
        70                "score": score,
        71                "gold_threshold": thresholds["gold"],
        72                "silver_threshold": thresholds["silver"],
        73                "bronze_threshold": thresholds["bronze"],
        74                "median_threshold": thresholds["median"],
        75                "any_medal": bool(score >= thresholds["bronze"]),
        76                "gold_medal": bool(score >= thresholds["gold"]),
        77                "silver_medal": bool(score >= thresholds["silver"]),
        78                "bronze_medal": bool(score >= thresholds["bronze"]),
        79                "above_median": bool(score >= thresholds["median"]),
        80                "submission_exists": True,
        81                "valid_submission": True,
        82                "is_lower_better": False,
        83                "created_at": str(datetime.datetime.now().isoformat()),
        84                "submission_path": submission_path,
        85            }
        86        )
        87    )
        
    • In this example we don’t create a ds_data/eval/playground-series-s4e9/sample.py, we use the sample method provided by RD-Agent by default.

    • At this point, you have created a complete dataset. The correct structure of the dataset should look like this.

      ds_data
      β”œβ”€β”€ playground-series-s4e9
      β”‚   β”œβ”€β”€ train.csv
      β”‚   β”œβ”€β”€ test.csv
      β”‚   β”œβ”€β”€ description.md
      β”‚   └── sample_submission.csv
      β”œβ”€β”€ eval
      β”‚   └── playground-series-s4e9
      β”‚       β”œβ”€β”€ grade.py
      β”‚       β”œβ”€β”€ submission_test.csv
      β”‚       └── valid.py
      └── source_data
          └── playground-series-s4e9
              β”œβ”€β”€ prepare.py
              β”œβ”€β”€ sample_submission.csv
              β”œβ”€β”€ test.csv
              └── train.csv
      
    • We have prepared a dataset based on the above description for your reference. You can download it with the following command.

      wget https://github.com/SunsetWolf/rdagent_resource/releases/download/ds_data/playground-series-s4e9.zip
      
    • Next, we need to configure the environment for the playground-series-s4e9 contest. You can do this by executing the following command at the command line.

      dotenv set DS_IF_USING_MLE_DATA False
      dotenv set DS_SAMPLE_DATA_BY_LLM False
      dotenv set DS_SCEN rdagent.scenarios.data_science.scen.KaggleScen
      
πŸš€ Run the ApplicationΒΆ
  • 🌏 You can directly run the application by using the following command:

    rdagent data_science --competition <Competition ID>
    
    • The following shows the command to run based on the playground-series-s4e9 data

      rdagent data_science --competition playground-series-s4e9
      
    • More CLI Parameters for rdagent data_science command:

    rdagent.app.data_science.loop.main(path: str | None = None, checkout: bool = True, checkout_path: str | None = None, step_n: int | None = None, loop_n: int | None = None, timeout: str | None = None, competition='bms-molecular-translation', replace_timer=True, exp_gen_cls: str | None = None)
    ParametersΒΆ
    path :

    A path like $LOG_PATH/__session__/1/0_propose. This indicates that we restore the state after finishing step 0 in loop 1.

    checkout :

    Used to control the log session path. Boolean type, default is True. - If True, the new loop will use the existing folder and clear logs for sessions after the one corresponding to the given path. - If False, the new loop will use the existing folder but keep the logs for sessions after the one corresponding to the given path.

    checkout_path:

    If a checkout_path (or a str like Path) is provided, the new loop will be saved to that path, leaving the original path unchanged.

    step_n :

    Number of steps to run; if None, the process will run indefinitely until an error or KeyboardInterrupt occurs.

    loop_n :

    Number of loops to run; if None, the process will run indefinitely until an error or KeyboardInterrupt occurs. - If the current loop is incomplete, it will be counted as the first loop for completion. - If both step_n and loop_n are provided, the process will stop as soon as either condition is met.

    timeout :

    Maximum duration to run the loop. Accepts a string format recognized by the internal timer. - If None, the loop will run until completion, error, or KeyboardInterrupt.

    competition :

    Competition name.

    replace_timer :

    If a session is loaded, determines whether to replace the timer with session.timer.

    exp_gen_cls :

    When there are different stages, the exp_gen can be replaced with the new proposal.

    Auto R&D Evolving loop for models in a Kaggle scenario. You can continue running a session by using the command:

    dotenv run -- python rdagent/app/data_science/loop.py [--competition titanic] $LOG_PATH/__session__/1/0_propose  --step_n 1   # `step_n` is an optional parameter
    rdagent kaggle --competition playground-series-s4e8  # This command is recommended.
    
  • πŸ“ˆ Visualize the R&D Process

    • We provide a web UI to visualize the log. You just need to run:

      rdagent ui --port <custom port> --log-dir <your log folder like "log/"> --data_science True
      
    • Then you can input the log path and visualize the R&D process.

  • πŸ§ͺ Scoring the test results

    • Finally, shutdown the program, and get the test set scores with this command.

    dotenv run -- python rdagent/log/mle_summary.py grade <url_to_log>
    
    • If you have configured the full output in ds_data/eval/playground-series-s4e9/grade.py, or if you are running a competition that receives MLE-Bench support, you can also summarize the scores by running the following command.

    rdagent grade_summary --log-folder=<url_to_log>
    

    Here, <url_to_log> refers to the parent directory of the log folder generated during the run.