Data Science Agent¶

🤖 Automated Feature Engineering & Model Tuning Evolution¶

The Data Science Agent is an agent that can automatically perform feature engineering and model tuning. It can be used to solve various data science problems, such as image classification, time series forecasting, and text classification.

🌟 Introduction¶

In this scenario, our automated system proposes hypothesis, choose action, implements code, conducts validation, and utilizes feedback in a continuous, iterative process.

The goal is to automatically optimize performance metrics within the validation set or Kaggle Leaderboard, ultimately discovering the most efficient features and models through autonomous research and development.

Here’s an enhanced outline of the steps:

Step 1 : Hypothesis Generation 🔍

Generate and propose initial hypotheses based on previous experiment analysis and domain expertise, with thorough reasoning and financial justification.

Step 2 : Experiment Creation ✨

Transform the hypothesis into a task.
Choose a specific action within feature engineering or model tuning.
Develop, define, and implement a new feature or model, including its name, description, and formulation.

Step 3 : Model/Feature Implementation 👨‍💻

Implement the model code based on the detailed description.
Evolve the model iteratively as a developer would, ensuring accuracy and efficiency.

Step 4 : Validation on Test Set or Kaggle 📉

Validate the newly developed model using the test set or Kaggle dataset.
Assess the model’s effectiveness and performance based on the validation results.

Step 5: Feedback Analysis 🔍

Analyze validation results to assess performance.
Use insights to refine hypotheses and enhance the model.

Step 6: Hypothesis Refinement ♻️

Adjust hypotheses based on validation feedback.
Iterate the process to continuously improve the model.

📖 Data Science Background¶

In the evolving landscape of artificial intelligence, Data Science represents a powerful paradigm where machines engage in autonomous exploration, hypothesis testing, and model development across diverse domains — from healthcare and finance to logistics and research.

The Data Science Agent stands as a central engine in this transformation, enabling users to automate the entire machine learning workflow: from hypothesis generation to code implementation, validation, and refinement — all guided by performance feedback.

By leveraging the Data Science Agent, researchers and developers can accelerate experimentation cycles. Whether fine-tuning custom models or competing in high-stakes benchmarks like Kaggle, the Data Science Agent unlocks new frontiers in intelligent, self-directed discovery.

🧭 Example Guide - Customized dataset¶

🔧 Set up RD-Agent Environment¶

Before you start, please make sure you have installed RD-Agent and configured the environment for RD-Agent correctly. If you want to know how to install and configure the RD-Agent, please refer to the documentation.

🔩 Setting the Environment variables at .env file

Determine the path where the data will be stored and add it to the .env file.

dotenv set DS_LOCAL_DATA_PATH <your local directory>/ds_data
dotenv set DS_SCEN rdagent.scenarios.data_science.scen.DataScienceScen

📥 Prepare Customized datasets¶

A data science competition dataset usually consists of two parts: competition dataset and evaluation dataset. (We provide a sample of a customized dataset named: arf-12-hours-prediction-task as a reference.)
- The competition dataset contains training data, test data, description files, formatted submission files, data sampling codes.
- The evaluation dataset contains standard answer file, data checking codes, and Code for calculation of scores.

We use the arf-12-hours-prediction-task data as a sample to introduce the preparation workflow for the competition dataset.

Create a ds_data/source_data/arf-12-hours-prediction-task folder, which will be used to store your raw dataset.
- The raw files for the competition arf-12-hours-prediction-task have two files: ARF_12h.csv and X.npz.

Create a ds_data/source_data/arf-12-hours-prediction-task/prepare.py file that splits your raw data into training data, test data, formatted submission file, and standard answer file. (You will need to write a script based on your raw data.)

The following shows the preprocessing code for the raw data of arf-12-hours-prediction-task.

ds_data/source_data/arf-12-hours-prediction-task/prepare.py¶

import random
from pathlib import Path

import numpy as np
import pandas as pd
import sparse

CURRENT_DIR = Path(__file__).resolve().parent
ROOT_DIR = CURRENT_DIR.parent.parent

raw_feature_path = CURRENT_DIR / "X.npz"
raw_label_path = CURRENT_DIR / "ARF_12h.csv"

public = ROOT_DIR / "arf-12-hours-prediction-task"
private = ROOT_DIR / "eval" / "arf-12-hours-prediction-task"

if not (public / "test").exists():
    (public / "test").mkdir(parents=True, exist_ok=True)

if not (public / "train").exists():
    (public / "train").mkdir(parents=True, exist_ok=True)

if not private.exists():
    private.mkdir(parents=True, exist_ok=True)

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

X_sparse = sparse.load_npz(raw_feature_path)  # COO matrix, shape: [N, D, T]
df_label = pd.read_csv(raw_label_path)  # Contains column 'ARF_LABEL'
N = X_sparse.shape[0]

indices = np.arange(N)
np.random.shuffle(indices)
split = int(0.7 * N)
train_idx, test_idx = indices[:split], indices[split:]

X_train = X_sparse[train_idx]
X_test = X_sparse[test_idx]

df_train = df_label.iloc[train_idx].reset_index(drop=True)
df_test = df_label.iloc[test_idx].reset_index(drop=True)

submission_df = df_test.copy()
submission_df["ARF_LABEL"] = 0
submission_df.drop(submission_df.columns.difference(["ID", "ARF_LABEL"]), axis=1, inplace=True)
submission_df.to_csv(public / "sample_submission.csv", index=False)

df_test.to_csv(private / "submission_test.csv", index=False)

df_test.drop(["ARF_LABEL"], axis=1, inplace=True)
df_test.to_csv(public / "test" / "ARF_12h.csv", index=False)
sparse.save_npz(public / "test" / "X.npz", X_test)

sparse.save_npz(public / "train" / "X.npz", X_train)
df_train.to_csv(public / "train" / "ARF_12h.csv", index=False)

assert (
    X_train.shape[0] == df_train.shape[0]
), f"Mismatch: X_train rows ({X_train.shape[0]}) != df_train rows ({df_train.shape[0]})"
assert (
    X_test.shape[0] == df_test.shape[0]
), f"Mismatch: X_test rows ({X_test.shape[0]}) != df_test rows ({df_test.shape[0]})"
assert df_test.shape[1] == 2, "Public test set should have 2 columns"
assert df_train.shape[1] == 3, "Public train set should have 3 columns"
assert len(df_train) + len(df_test) == len(
    df_label
), "Length of new_train and new_test should equal length of old_train"

At the end of program execution, the ds_data folder structure will look like this:

ds_data
├── arf-12-hours-prediction-task
│   ├── train
│   │   ├── ARF_12h.csv
│   │   └── X.npz
│   ├── test
│   │   ├── ARF_12h.csv
│   │   └── X.npz
│   └── sample_submission.csv
├── eval
│   └── arf-12-hours-prediction-task
│       └── submission_test.csv
└── source_data
    └── arf-12-hours-prediction-task
        ├── ARF_12h.csv
        ├── prepare.py
        └── X.npz

Create a ds_data/arf-12-hours-prediction-task/description.md file to describe your competition, Objective, dataset, and other information.

The following shows the description file for arf-12-hours-prediction-task

ds_data/arf-12-hours-prediction-task/description.md¶

# Competition name: ARF 12-Hour Prediction Task

## Overview

### Description

Acute Respiratory Failure (ARF) is a life-threatening condition that often develops rapidly in critically ill patients. Accurate early prediction of ARF is crucial in intensive care units (ICUs) to enable timely clinical interventions and resource allocation. In this task, you are asked to build a machine learning model that predicts whether a patient will develop ARF within the next **12 hours**, based on multivariate clinical time series data.

The dataset is extracted from electronic health records (EHRs) and preprocessed using the **FIDDLE** pipeline to generate structured temporal features for each patient.

### Objective

**Your Goal** is to develop a binary classification model that takes a 12-hour time series as input and predicts whether ARF will occur (1) or not (0) in the following 12 hours.

---

## Data Description

1. train/ARF_12h.csv: A CSV file containing the ICU stay ID, the hour of ARF onset, and the binary label indicating whether ARF will occur in the next 12 hours.

    * Columns: ID, ARF_ONSET_HOUR, ARF_LABEL

2. train/X.npz: N × T × D sparse tensor containing time-dependent features.

    * N: Number of samples (number of ICU stays) 
    * T: Time step (12 hours of records per sample)
    * D: Dynamic feature dimension (how many features per hour) 

3. test/ARF_12h.csv: Ground truth labels (used for evaluation only).

4. test/X.npz: Test feature set in the same format as training data.

---

## Data usage Notes

To load the features, you need python and the sparse package.

import sparse

X = sparse.load_npz("<url>/X.npz").todense()

To load the labels, use pandas or an alternative csv reader.

import pandas as pd

df = pd.read_csv("<url>/ARF_12h.csv")

---

## Modeling

Each sample is a 12-hour multivariate time series of ICU patient observations, represented as a tensor of shape (12, D).
The goal is to predict whether the patient will develop ARF (1) or not (0) in the following 12 hours.

* **Input**: 12 × D matrix of clinical features
* **Output**: Binary prediction: 0 (no ARF) or 1 (ARF onset)
* **Loss Function**: BCEWithLogitsLoss, CrossEntropyLoss or equivalent
* **Evaluation Metric**: **AUROC** (Area Under the Receiver Operating Characteristic Curve)

Note: Although the output is binary, AUROC evaluates the ranking quality of predicted scores. Therefore, your model should output a confidence score during training, which is then thresholded to produce 0 or 1 for final submission.

---

## Evaluation

### Area Under the Receiver Operating Characteristic curve (AUROC)

The submissions are scored according to the area under the receiver operating characteristic curve. AUROC is defined as:

$$
\text{AUROC} = \frac{1}{|P| \cdot |N|} \sum_{i \in P} \sum_{j \in N} \left[ \mathbb{1}(s_i > s_j) + \frac{1}{2} \cdot \mathbb{1}(s_i = s_j) \right]
$$

AUROC reflects the model's ability to rank positive samples higher than negative ones. A score of 1.0 means perfect discrimination, and 0.5 means random guessing.

### Submission Format

For each `ID'' in the ARF_12h.csv file of the test dataset, you must predict whether ARF will occur (label = 1) or not (label = 0) in the following 12 hours(ARF_LABEL), based on the X.npz (sparse tensor, time-varying feature). The file should contain the following format:

ID,ARF_LABEL
246505,0
291335,0
286713,0
etc.

Note: Although the submission is binary, AUROC evaluates the ranking quality of your model. It is recommended to output probabilities during training and apply a threshold (e.g., 0.5) to convert to binary labels for submission.

---

Create a ds_data/arf-12-hours-prediction-task/sample.py file to construct the debugging sample data.

The following shows the script for constructing the debugging sample data based on the arf-12-hours-prediction-task dataset implementation.

ds_data/arf-12-hours-prediction-task/sample.py¶

import shutil
from pathlib import Path

import numpy as np
import pandas as pd
import sparse
from tqdm import tqdm


def sample_and_copy_subfolder(
    input_dir: Path,
    output_dir: Path,
    min_frac: float,
    min_num: int,
    seed: int = 42,
):
    np.random.seed(seed)

    feature_path = input_dir / "X.npz"
    label_path = input_dir / "ARF_12h.csv"

    # Load sparse features and label
    X_sparse = sparse.load_npz(feature_path)
    df_label = pd.read_csv(label_path)

    N = X_sparse.shape[0]
    n_keep = max(int(N * min_frac), min_num)
    idx = np.random.choice(N, n_keep, replace=False)

    X_sample = X_sparse[idx]
    df_sample = df_label.iloc[idx].reset_index(drop=True)

    output_dir.mkdir(parents=True, exist_ok=True)
    sparse.save_npz(output_dir / "X.npz", X_sample)
    df_sample.to_csv(output_dir / "ARF_12h.csv", index=False)

    print(f"[INFO] Sampled {n_keep} of {N} from {input_dir.name}")

    # Copy additional files
    for f in input_dir.glob("*"):
        if f.name not in {"X.npz", "ARF_12h.csv"} and f.is_file():
            shutil.copy(f, output_dir / f.name)
            print(f"[COPY] Extra file: {f.name}")


def copy_other_file(source: Path, target: Path):
    for item in source.iterdir():
        if item.name in {"train", "test"}:
            continue

        relative_path = item.relative_to(source)
        target_path = target / relative_path

        if item.is_dir():
            shutil.copytree(item, target_path, dirs_exist_ok=True)
            print(f"[COPY DIR] {item} -> {target_path}")
        elif item.is_file():
            target_path.parent.mkdir(parents=True, exist_ok=True)
            shutil.copy2(item, target_path)
            print(f"[COPY FILE] {item} -> {target_path}")


def create_debug_data(
    dataset_path: str,
    output_path: str,
    min_frac: float = 0.02,
    min_num: int = 10,
):
    dataset_root = Path(dataset_path) / "arf-12-hours-prediction-task"
    output_root = Path(output_path)

    for sub in ["train", "test"]:
        input_dir = dataset_root / sub
        output_dir = output_root / sub
        print(f"\n[PROCESS] {sub} subset")
        sample_and_copy_subfolder(
            input_dir=input_dir,
            output_dir=output_dir,
            min_frac=min_frac,
            min_num=min_num,
            seed=42 if sub == "train" else 123,
        )
    print(dataset_root.resolve())
    print(output_root.resolve())
    copy_other_file(source=dataset_root, target=output_root)

    print(f"\n[INFO] Sampling complete → Output in: {output_root}")


if __name__ == "__main__" or globals().get("__name__") == "<run_path>":
    dataset_path = globals().get("dataset_path", "./")
    output_path = globals().get("output_path", "./sample")
    create_debug_data(
        dataset_path=dataset_path,
        output_path=output_path,
        min_frac=0.02,
        min_num=10,
    )

Create a ds_data/eval/arf-12-hours-prediction-task/valid.py file, which is used to check the validity of the submission files to ensure that their formatting is consistent with the reference file.

The following shows a script that checks the validity of a submission based on the arf-12-hours-prediction-task data.

ds_data/eval/arf-12-hours-prediction-task/valid.py¶

from pathlib import Path

# Check if our submission file exists
assert Path("submission.csv").exists(), "Error: submission.csv not found"

submission_lines = Path("submission.csv").read_text().splitlines()
test_lines = Path("submission_test.csv").read_text().splitlines()

is_valid = len(submission_lines) == len(test_lines)

if is_valid:
    message = "submission.csv and submission_test.csv have the same number of lines."
else:
    message = (
        f"submission.csv has {len(submission_lines)} lines, while submission_test.csv has {len(test_lines)} lines."
    )

print(message)

if not is_valid:
    raise AssertionError("Submission is invalid")

Create a ds_data/eval/arf-12-hours-prediction-task/grade.py file, which is used to calculate the score based on the submission file and the standard answer file, and output the result in JSON format.

The following shows a grading script based on the arf-12-hours-prediction-task data implementation.

ds_data/eval/arf-12-hours-prediction-task/grade.py¶

import json

import pandas as pd
from sklearn.metrics import roc_auc_score


def prepare_for_auroc_metric(submission: pd.DataFrame, answers: pd.DataFrame, id_col: str, target_col: str) -> dict:

    # Answers checks
    assert id_col in answers.columns, f"answers dataframe should have an {id_col} column"
    assert target_col in answers.columns, f"answers dataframe should have a {target_col} column"

    # Submission checks
    if id_col not in submission.columns:
        raise InvalidSubmissionError(f"Submission should have an {id_col} column")
    if target_col not in submission.columns:
        raise InvalidSubmissionError(f"Submission should have a {target_col} column")
    if len(submission) != len(answers):
        raise InvalidSubmissionError(f"Submission and answers should have the same number of rows")
    try:
        pd.to_numeric(submission[target_col])
    except ValueError:
        raise InvalidSubmissionError(
            f"Expected {target_col} column to be numeric, got {submission[target_col].dtype} instead"
        )
    if submission[target_col].min() < 0 or submission[target_col].max() > 1:
        raise InvalidSubmissionError(
            f"Submission {target_col} column should contain probabilities,"
            " and therefore contain values between 0 and 1 inclusive"
        )
    # Sort
    submission = submission.sort_values(id_col)
    answers = answers.sort_values(id_col)

    if (submission[id_col].values != answers[id_col].values).any():
        raise InvalidSubmissionError(f"Submission and answers should have the same {id_col} values")

    roc_auc_inputs = {
        "y_true": answers[target_col].to_numpy(),
        "y_score": submission[target_col].to_numpy(),
    }

    return roc_auc_inputs


def grade(submission: pd.DataFrame, answers: pd.DataFrame) -> float:
    roc_auc_inputs = prepare_for_auroc_metric(
        submission=submission, answers=answers, id_col="ID", target_col="ARF_LABEL"
    )
    return roc_auc_score(y_true=roc_auc_inputs["y_true"], y_score=roc_auc_inputs["y_score"])


if __name__ == "__main__":
    submission_path = "submission.csv"
    gt_submission_path = "submission_test.csv"
    submission = pd.read_csv(submission_path)
    answers = pd.read_csv(gt_submission_path)
    score = grade(submission=submission, answers=answers)

    print(
        json.dumps(
            {
                "competition_id": "arf-12-hours-prediction-task",
                "score": score,
            }
        )
    )

At this point, you have created a complete dataset. The correct structure of the dataset should look like this.

ds_data
├── arf-12-hours-prediction-task
│   ├── train
│   │   ├── ARF_12h.csv
│   │   └── X.npz
│   ├── test
│   │   ├── ARF_12h.csv
│   │   └── X.npz
│   ├── description.md
│   ├── sample_submission.csv
│   └── sample.py
├── eval
│   └── arf-12-hours-prediction-task
│       ├── grade.py
│       ├── submission_test.csv
│       └── valid.py
└── source_data
    └── arf-12-hours-prediction-task
        ├── ARF_12h.csv
        ├── prepare.py
        └── X.npz

The above shows the complete dataset creation workflow, some of the files are not required, in practice you can customize the dataset according to your own needs.
- If we don’t need the test set scores, then we can choose not to generate formatted submission files and standard answer file in the prepare code, and we don’t need to write data checking codes and Code for calculation of scores.
- Data sampling code can also be created according to the actual need, if you do not provide data sampling code, RD-Agent will be handed over to the LLM sampling at runtime.
  - In the default sampling method (create_debug_data), the default sampling ratio (parameter: min_frac) is 1%, if 1% of the data is less than 5, then 5 data will be sampled (parameter: min_num), you can adjust the sampling ratio by adjusting these two parameters.
    - If you have customized data sampling code, you need to set DS_SAMPLE_DATA_BY_LLM to False (default is True) in the .env file before running, so that the program will use the customized sampling code when running, and you can just execute this line of code in the command line:
      dotenv set DS_SAMPLE_DATA_BY_LLM False
    - In addition, we provide a data sampling method in rdagent.scenarios.data_science.debug.data.create_debug_data, in this method, the default sampling ratio (parameter: min_frac) is 1%, if 1% of the data is less than 5, then 5 data will be sampled (parameter: min_num), you can use this method by the following two ways.
      - You can set DS_SAMPLE_DATA_BY_LLM to False in the .env file so that when the program runs, it will use the sampling code provided by RD-Agent.
        
        dotenv set DS_SAMPLE_DATA_BY_LLM False
      - If you think that the parameters in the receipt sampling method provided by RD-Agent are not suitable, you can customize the parameters in the following command and run it, and set DS_SAMPLE_DATA_BY_LLM to False in the .env so that the program will use the sampling data you provided when running.
        
        python rdagent/app/data_science/debug.py --dataset_path <dataset path> --competition <competiton_name> --min_frac <sampling ratio> --min_num <minimum number of sampling> dotenv set DS_SAMPLE_DATA_BY_LLM False

If you don’t need the scores from the test set and leave the data sampling to the LLM, or if you use the sampling method provided by the RD-Agent, you only need to prepare a minimal dataset. The structure of the simplest dataset should be as shown below.

ds_data
├── arf-12-hours-prediction-task
│   ├── train
│   │   ├── ARF_12h.csv
│   │   └── X.npz
│   ├── test
│   │   ├── ARF_12h.csv
│   │   └── X.npz
│   └── description.md
└── source_data
    └── arf-12-hours-prediction-task
        ├── ARF_12h.csv
        ├── prepare.py
        └── X.npz

We have prepared a dataset based on the above description for your reference. You can download it with the following command.
```
wget https://github.com/SunsetWolf/rdagent_resource/releases/download/ds_data/arf-12-hours-prediction-task.zip
```

⚙️ Set up Environment for Customized datasets¶

dotenv set DS_SCEN rdagent.scenarios.data_science.scen.DataScienceScen
dotenv set DS_LOCAL_DATA_PATH <your local directory>/ds_data
dotenv set DS_CODER_ON_WHOLE_PIPELINE True

📘 More Environment Variables (Optional)

If you want to see all the available environment variables, you can refer to the configuration file for Data Science scenarios:

from pathlib import Path
from typing import Literal

from pydantic_settings import SettingsConfigDict

from rdagent.app.kaggle.conf import KaggleBasePropSetting


class DataScienceBasePropSetting(KaggleBasePropSetting):
    # TODO: Kaggle Setting should be the subclass of DataScience
    model_config = SettingsConfigDict(env_prefix="DS_", protected_namespaces=())

    # Main components
    ## Scen
    scen: str = "rdagent.scenarios.data_science.scen.KaggleScen"
    """
    Scenario class for data science tasks.
    - For Kaggle competitions, use: "rdagent.scenarios.data_science.scen.KaggleScen"
    - For custom data science scenarios, use: "rdagent.scenarios.data_science.scen.DataScienceScen"
    """

    planner: str = "rdagent.scenarios.data_science.proposal.exp_gen.planner.DSExpPlannerHandCraft"
    hypothesis_gen: str = "rdagent.scenarios.data_science.proposal.exp_gen.router.ParallelMultiTraceExpGen"
    interactor: str = "rdagent.components.interactor.SkipInteractor"
    trace_scheduler: str = "rdagent.scenarios.data_science.proposal.exp_gen.trace_scheduler.RoundRobinScheduler"
    """Hypothesis generation class"""

    summarizer: str = "rdagent.scenarios.data_science.dev.feedback.DSExperiment2Feedback"
    summarizer_init_kwargs: dict = {
        "version": "exp_feedback",
    }
    ## Workflow Related
    consecutive_errors: int = 5

    ## Coding Related
    coding_fail_reanalyze_threshold: int = 3

    debug_recommend_timeout: int = 600
    """The recommend time limit for running on debugging data"""
    debug_timeout: int = 600
    """The timeout limit for running on debugging data"""
    full_recommend_timeout: int = 3600
    """The recommend time limit for running on full data"""
    full_timeout: int = 3600
    """The timeout limit for running on full data"""

    #### model dump
    enable_model_dump: bool = False
    enable_doc_dev: bool = False
    model_dump_check_level: Literal["medium", "high"] = "medium"

    #### MCP documentation search integration
    enable_mcp_documentation_search: bool = False
    """Enable MCP documentation search for error resolution. Requires MCP_ENABLED=true and MCP_CONTEXT7_ENABLED=true in environment."""

    ### specific feature

    ### notebook integration
    enable_notebook_conversion: bool = False

    #### enable specification
    spec_enabled: bool = True

    #### proposal related
    # proposal_version: str = "v2" deprecated

    coder_on_whole_pipeline: bool = True
    max_trace_hist: int = 3

    coder_max_loop: int = 10
    runner_max_loop: int = 3

    sample_data_by_LLM: bool = True
    use_raw_description: bool = False
    show_nan_columns: bool = False

    ### knowledge base
    enable_knowledge_base: bool = False
    knowledge_base_version: str = "v1"
    knowledge_base_path: str | None = None
    idea_pool_json_path: str | None = None

    ### archive log folder after each loop
    enable_log_archive: bool = True
    log_archive_path: str | None = None
    log_archive_temp_path: str | None = (
        None  # This is to store the mid tar file since writing the tar file is preferred in local storage then copy to target storage
    )

    #### Evaluation on Test related
    eval_sub_dir: str = "eval"  # TODO: fixme, this is not a good name
    """We'll use f"{DS_RD_SETTING.local_data_path}/{DS_RD_SETTING.eval_sub_dir}/{competition}"
    to find the scriipt to evaluate the submission on test"""

    """---below are the settings for multi-trace---"""

    ### multi-trace related
    max_trace_num: int = 1
    """The maximum number of traces to grow before merging"""

    scheduler_temperature: float = 1.0
    """The temperature for the trace scheduler for softmax calculation, used in ProbabilisticScheduler"""

    # PUCT exploration constant for MCTSScheduler (ignored by other schedulers)
    scheduler_c_puct: float = 1.0
    """Exploration constant used by MCTSScheduler (PUCT)."""

    enable_score_reward: bool = False
    """Enable using score-based reward for trace selection in multi-trace scheduling."""

    #### multi-trace:checkpoint selector
    selector_name: str = "rdagent.scenarios.data_science.proposal.exp_gen.select.expand.LatestCKPSelector"
    """The name of the selector to use"""
    sota_count_window: int = 5
    """The number of trials to consider for SOTA count"""
    sota_count_threshold: int = 1
    """The threshold for SOTA count"""

    #### multi-trace: SOTA experiment selector
    sota_exp_selector_name: str = "rdagent.scenarios.data_science.proposal.exp_gen.select.submit.GlobalSOTASelector"
    """The name of the SOTA experiment selector to use"""

    ### multi-trace:inject optimals for multi-trace
    # inject diverse when start a new sub-trace
    enable_inject_diverse: bool = False

    # inject diverse from other traces when start a new sub-trace
    enable_cross_trace_diversity: bool = True
    """Enable cross-trace diversity injection when starting a new sub-trace.
    This is different from `enable_inject_diverse` which is for non-parallel cases."""

    diversity_injection_strategy: str = (
        "rdagent.scenarios.data_science.proposal.exp_gen.diversity_strategy.InjectUntilSOTAGainedStrategy"
    )
    """The strategy to use for injecting diversity context."""

    # enable different version of DSExpGen for multi-trace
    enable_multi_version_exp_gen: bool = False
    exp_gen_version_list: str = "v3,v2"

    #### multi-trace: time for final multi-trace merge
    merge_hours: float = 0
    """The time for merge"""

    #### multi-trace: max SOTA-retrieved number, used in AutoSOTAexpSelector
    # constrains the number of SOTA experiments to retrieve, otherwise too many SOTA experiments to retrieve will cause the exceed of the context window of LLM
    max_sota_retrieved_num: int = 10
    """The maximum number of SOTA experiments to retrieve in a LLM call"""

    #### enable draft before first sota experiment
    enable_draft_before_first_sota: bool = False
    enable_planner: bool = False

    model_architecture_suggestion_time_percent: float = 0.75
    allow_longer_timeout: bool = False
    coder_enable_llm_decide_longer_timeout: bool = False
    runner_enable_llm_decide_longer_timeout: bool = False
    coder_longer_timeout_multiplier_upper: int = 3
    runner_longer_timeout_multiplier_upper: int = 2
    coder_timeout_increase_stage: float = 0.3
    runner_timeout_increase_stage: float = 0.3
    runner_timeout_increase_stage_patience: int = 2
    """Number of failures tolerated before escalating to next timeout level (stage width). Every 'patience' failures, timeout increases by 'runner_timeout_increase_stage'"""
    show_hard_limit: bool = True

    #### enable runner code change summary
    runner_enable_code_change_summary: bool = True

    ### Proposal workflow related

    #### Hypothesis Generate related
    enable_simple_hypothesis: bool = False
    """If true, generate simple hypothesis, no more than 2 sentences each."""

    enable_generate_unique_hypothesis: bool = False
    """Enable generate unique hypothesis. If True, generate unique hypothesis for each component. If False, generate unique hypothesis for each component."""

    enable_research_rag: bool = False
    """Enable research RAG for hypothesis generation."""

    #### hypothesis critique and rewrite
    enable_hypo_critique_rewrite: bool = False
    """Enable hypothesis critique and rewrite stages for improving hypothesis quality"""
    enable_scale_check: bool = False

    ##### select related
    ratio_merge_or_ensemble: int = 70
    """The ratio of merge or ensemble to be considered as a valid solution"""
    llm_select_hypothesis: bool = False
    """Whether to use LLM to select hypothesis. If True, use LLM selection; if False, use the existing ranking method."""

    #### Task Generate related
    fix_seed_and_data_split: bool = False

    ensemble_time_upper_bound: bool = False

    user_interaction_wait_seconds: int = 6000  # seconds to wait for user interaction
    user_interaction_mid_folder: Path = Path.cwd() / "git_ignore_folder" / "RD-Agent_user_interaction"


DS_RD_SETTING = DataScienceBasePropSetting()

# enable_cross_trace_diversity and llm_select_hypothesis should not be true at the same time
assert not (
    DS_RD_SETTING.enable_cross_trace_diversity and DS_RD_SETTING.llm_select_hypothesis
), "enable_cross_trace_diversity and llm_select_hypothesis cannot be true at the same time"

These variables allow you to have finer-grained control in Data Science scenarios.

🚀 Run the Application¶

🌏 You can directly run the application by using the following command:
rdagent data_science --competition <Competition ID>
The following shows the command to run based on the arf-12-hours-prediction-task data
rdagent data_science --competition arf-12-hours-prediction-task
More CLI Parameters for rdagent data_science command:
rdagent.app.data_science.loop.main(path: str | None = None, checkout: bool = True, checkout_path: str | None = None, step_n: int | None = None, loop_n: int | None = None, timeout: str | None = None, competition='bms-molecular-translation', replace_timer=True, exp_gen_cls: str | None = None)
Parameters¶

path :
A path like $LOG_PATH/__session__/1/0_propose. This indicates that we restore the state after finishing step 0 in loop 1.

checkout :
Used to control the log session path. Boolean type, default is True. - If True, the new loop will use the existing folder and clear logs for sessions after the one corresponding to the given path. - If False, the new loop will use the existing folder but keep the logs for sessions after the one corresponding to the given path.

checkout_path:
If a checkout_path (or a str like Path) is provided, the new loop will be saved to that path, leaving the original path unchanged.

step_n :
Number of steps to run; if None, the process will run indefinitely until an error or KeyboardInterrupt occurs.

loop_n :
Number of loops to run; if None, the process will run indefinitely until an error or KeyboardInterrupt occurs. - If the current loop is incomplete, it will be counted as the first loop for completion. - If both step_n and loop_n are provided, the process will stop as soon as either condition is met.

timeout :
Maximum duration to run the loop. Accepts a string format recognized by the internal timer. - If None, the loop will run until completion, error, or KeyboardInterrupt.

competition :
Competition name.

replace_timer :
If a session is loaded, determines whether to replace the timer with session.timer.

exp_gen_cls :
When there are different stages, the exp_gen can be replaced with the new proposal.

Auto R&D Evolving loop for models in a Kaggle scenario. You can continue running a session by using the command:
dotenv run -- python rdagent/app/data_science/loop.py [--competition titanic] $LOG_PATH/__session__/1/0_propose  --step_n 1   # `step_n` is an optional parameter
rdagent kaggle --competition playground-series-s4e8  # This command is recommended.
📈 Visualize the R&D Process
We provide a web UI to visualize the log. You just need to run:
rdagent ui --port <custom port> --log-dir <your log folder like "log/"> --data_science True
Then you can input the log path and visualize the R&D process.
🧪 Scoring the test results

Finally, shutdown the program, and get the test set scores with this command.
dotenv run -- python rdagent/log/mle_summary.py grade <url_to_log>
Here, <url_to_log> refers to the parent directory of the log folder generated during the run.

🕹️ Kaggle Agent¶

📖 Background¶

In the landscape of data science competitions, Kaggle serves as the ultimate arena where data enthusiasts harness the power of algorithms to tackle real-world challenges. The Kaggle Agent stands as a pivotal tool, empowering participants to seamlessly integrate cutting-edge models and datasets, transforming raw data into actionable insights.

By utilizing the Kaggle Agent, data scientists can craft innovative solutions that not only uncover hidden patterns but also drive significant advancements in predictive accuracy and model robustness.

🧭 Example Guide - Kaggle Dataset¶

🛠️ Preparing For The Competition¶

🔨 Configuring the Kaggle API
- Register and login on the Kaggle website.
- Click on the avatar (usually in the top right corner of the page) -> Settings -> Create New Token, A file called kaggle.json will be downloaded.
- Move kaggle.json to ~/.config/kaggle/
- Modify the permissions of the kaggle.json file.
```
chmod 600 ~/.config/kaggle/kaggle.json
```
- For more information about Kaggle API Settings, refer to the Kaggle API.

🔩 Setting the Environment variables at .env file

Determine the path where the data will be stored and add it to the .env file.

mkdir -p <your local directory>/ds_data
dotenv set KG_LOCAL_DATA_PATH <your local directory>/ds_data

📘 More Environment Variables (Optional)

If you want to see all the available environment variables, you can refer to the configuration file for Data Science scenarios:

from pathlib import Path
from typing import Literal

from pydantic_settings import SettingsConfigDict

from rdagent.app.kaggle.conf import KaggleBasePropSetting


class DataScienceBasePropSetting(KaggleBasePropSetting):
    # TODO: Kaggle Setting should be the subclass of DataScience
    model_config = SettingsConfigDict(env_prefix="DS_", protected_namespaces=())

    # Main components
    ## Scen
    scen: str = "rdagent.scenarios.data_science.scen.KaggleScen"
    """
    Scenario class for data science tasks.
    - For Kaggle competitions, use: "rdagent.scenarios.data_science.scen.KaggleScen"
    - For custom data science scenarios, use: "rdagent.scenarios.data_science.scen.DataScienceScen"
    """

    planner: str = "rdagent.scenarios.data_science.proposal.exp_gen.planner.DSExpPlannerHandCraft"
    hypothesis_gen: str = "rdagent.scenarios.data_science.proposal.exp_gen.router.ParallelMultiTraceExpGen"
    interactor: str = "rdagent.components.interactor.SkipInteractor"
    trace_scheduler: str = "rdagent.scenarios.data_science.proposal.exp_gen.trace_scheduler.RoundRobinScheduler"
    """Hypothesis generation class"""

    summarizer: str = "rdagent.scenarios.data_science.dev.feedback.DSExperiment2Feedback"
    summarizer_init_kwargs: dict = {
        "version": "exp_feedback",
    }
    ## Workflow Related
    consecutive_errors: int = 5

    ## Coding Related
    coding_fail_reanalyze_threshold: int = 3

    debug_recommend_timeout: int = 600
    """The recommend time limit for running on debugging data"""
    debug_timeout: int = 600
    """The timeout limit for running on debugging data"""
    full_recommend_timeout: int = 3600
    """The recommend time limit for running on full data"""
    full_timeout: int = 3600
    """The timeout limit for running on full data"""

    #### model dump
    enable_model_dump: bool = False
    enable_doc_dev: bool = False
    model_dump_check_level: Literal["medium", "high"] = "medium"

    #### MCP documentation search integration
    enable_mcp_documentation_search: bool = False
    """Enable MCP documentation search for error resolution. Requires MCP_ENABLED=true and MCP_CONTEXT7_ENABLED=true in environment."""

    ### specific feature

    ### notebook integration
    enable_notebook_conversion: bool = False

    #### enable specification
    spec_enabled: bool = True

    #### proposal related
    # proposal_version: str = "v2" deprecated

    coder_on_whole_pipeline: bool = True
    max_trace_hist: int = 3

    coder_max_loop: int = 10
    runner_max_loop: int = 3

    sample_data_by_LLM: bool = True
    use_raw_description: bool = False
    show_nan_columns: bool = False

    ### knowledge base
    enable_knowledge_base: bool = False
    knowledge_base_version: str = "v1"
    knowledge_base_path: str | None = None
    idea_pool_json_path: str | None = None

    ### archive log folder after each loop
    enable_log_archive: bool = True
    log_archive_path: str | None = None
    log_archive_temp_path: str | None = (
        None  # This is to store the mid tar file since writing the tar file is preferred in local storage then copy to target storage
    )

    #### Evaluation on Test related
    eval_sub_dir: str = "eval"  # TODO: fixme, this is not a good name
    """We'll use f"{DS_RD_SETTING.local_data_path}/{DS_RD_SETTING.eval_sub_dir}/{competition}"
    to find the scriipt to evaluate the submission on test"""

    """---below are the settings for multi-trace---"""

    ### multi-trace related
    max_trace_num: int = 1
    """The maximum number of traces to grow before merging"""

    scheduler_temperature: float = 1.0
    """The temperature for the trace scheduler for softmax calculation, used in ProbabilisticScheduler"""

    # PUCT exploration constant for MCTSScheduler (ignored by other schedulers)
    scheduler_c_puct: float = 1.0
    """Exploration constant used by MCTSScheduler (PUCT)."""

    enable_score_reward: bool = False
    """Enable using score-based reward for trace selection in multi-trace scheduling."""

    #### multi-trace:checkpoint selector
    selector_name: str = "rdagent.scenarios.data_science.proposal.exp_gen.select.expand.LatestCKPSelector"
    """The name of the selector to use"""
    sota_count_window: int = 5
    """The number of trials to consider for SOTA count"""
    sota_count_threshold: int = 1
    """The threshold for SOTA count"""

    #### multi-trace: SOTA experiment selector
    sota_exp_selector_name: str = "rdagent.scenarios.data_science.proposal.exp_gen.select.submit.GlobalSOTASelector"
    """The name of the SOTA experiment selector to use"""

    ### multi-trace:inject optimals for multi-trace
    # inject diverse when start a new sub-trace
    enable_inject_diverse: bool = False

    # inject diverse from other traces when start a new sub-trace
    enable_cross_trace_diversity: bool = True
    """Enable cross-trace diversity injection when starting a new sub-trace.
    This is different from `enable_inject_diverse` which is for non-parallel cases."""

    diversity_injection_strategy: str = (
        "rdagent.scenarios.data_science.proposal.exp_gen.diversity_strategy.InjectUntilSOTAGainedStrategy"
    )
    """The strategy to use for injecting diversity context."""

    # enable different version of DSExpGen for multi-trace
    enable_multi_version_exp_gen: bool = False
    exp_gen_version_list: str = "v3,v2"

    #### multi-trace: time for final multi-trace merge
    merge_hours: float = 0
    """The time for merge"""

    #### multi-trace: max SOTA-retrieved number, used in AutoSOTAexpSelector
    # constrains the number of SOTA experiments to retrieve, otherwise too many SOTA experiments to retrieve will cause the exceed of the context window of LLM
    max_sota_retrieved_num: int = 10
    """The maximum number of SOTA experiments to retrieve in a LLM call"""

    #### enable draft before first sota experiment
    enable_draft_before_first_sota: bool = False
    enable_planner: bool = False

    model_architecture_suggestion_time_percent: float = 0.75
    allow_longer_timeout: bool = False
    coder_enable_llm_decide_longer_timeout: bool = False
    runner_enable_llm_decide_longer_timeout: bool = False
    coder_longer_timeout_multiplier_upper: int = 3
    runner_longer_timeout_multiplier_upper: int = 2
    coder_timeout_increase_stage: float = 0.3
    runner_timeout_increase_stage: float = 0.3
    runner_timeout_increase_stage_patience: int = 2
    """Number of failures tolerated before escalating to next timeout level (stage width). Every 'patience' failures, timeout increases by 'runner_timeout_increase_stage'"""
    show_hard_limit: bool = True

    #### enable runner code change summary
    runner_enable_code_change_summary: bool = True

    ### Proposal workflow related

    #### Hypothesis Generate related
    enable_simple_hypothesis: bool = False
    """If true, generate simple hypothesis, no more than 2 sentences each."""

    enable_generate_unique_hypothesis: bool = False
    """Enable generate unique hypothesis. If True, generate unique hypothesis for each component. If False, generate unique hypothesis for each component."""

    enable_research_rag: bool = False
    """Enable research RAG for hypothesis generation."""

    #### hypothesis critique and rewrite
    enable_hypo_critique_rewrite: bool = False
    """Enable hypothesis critique and rewrite stages for improving hypothesis quality"""
    enable_scale_check: bool = False

    ##### select related
    ratio_merge_or_ensemble: int = 70
    """The ratio of merge or ensemble to be considered as a valid solution"""
    llm_select_hypothesis: bool = False
    """Whether to use LLM to select hypothesis. If True, use LLM selection; if False, use the existing ranking method."""

    #### Task Generate related
    fix_seed_and_data_split: bool = False

    ensemble_time_upper_bound: bool = False

    user_interaction_wait_seconds: int = 6000  # seconds to wait for user interaction
    user_interaction_mid_folder: Path = Path.cwd() / "git_ignore_folder" / "RD-Agent_user_interaction"


DS_RD_SETTING = DataScienceBasePropSetting()

# enable_cross_trace_diversity and llm_select_hypothesis should not be true at the same time
assert not (
    DS_RD_SETTING.enable_cross_trace_diversity and DS_RD_SETTING.llm_select_hypothesis
), "enable_cross_trace_diversity and llm_select_hypothesis cannot be true at the same time"

These variables allow you to have finer-grained control in Data Science scenarios.

🗳️ Join the competition
- If your Kaggle API account has not joined a competition, you will need to join the competition before running the program.
  - At the bottom of the competition details page, you can find the Join the competition button, click on it and select I Understand and Accept to join the competition.
  - In the Competition List Available below, you can jump to the competition details page.

📥 Preparing Competition DataDataset && Set up RD-Agent Environment¶

As a subset of data science, kaggle’s dataset still follows the data science format. Based on this, the kaggle dataset can be divided into two categories depending on whether or not it is supported by the MLE-Bench.
- What is MLE-Bench?
  - MLE-Bench is a comprehensive benchmark designed to evaluate the machine learning engineering capabilities of AI systems using real-world scenarios. The dataset includes multiple Kaggle competitions. Since Kaggle does not provide reserved test sets for these competitions, the benchmark includes preparation scripts for splitting publicly available training data into new training and test sets, and scoring scripts for each competition to accurately evaluate submission scores.
- I’m running a competition Is MLE-Bench supported?
  - You can see all the competitions supported by MLE-Bench here.
Prepare datasets for MLE-Bench supported competitions.
- If you agree with the MLE-Bench standard, then you don’t need to prepare the dataset, you just need to configure your .env file to automate the download of the dataset.
  - Configure environment variables, add DS_IF_USING_MLE_DATA to environment variables, and set it to True.
    dotenv set DS_IF_USING_MLE_DATA True
  - Configure environment variables, add DS_SAMPLE_DATA_BY_LLM to environment variables, and set it to True.
    dotenv set DS_SAMPLE_DATA_BY_LLM True
  - Configure environment variables, add DS_SCEN to environment variables, and set it to rdagent.scenarios.data_science.scen.KaggleScen.
    dotenv set DS_SCEN rdagent.scenarios.data_science.scen.KaggleScen
- At this point, you are ready to start running your competition, which will automatically download the data, and the LLM will automatically extract the minimum dataset.
  - After running the program the structure of the ds_data folder should look like this (Using the tabular-playground-series-dec-2021 contest as an example).
    ds_data ├── tabular-playground-series-dec-2021 │ ├── description.md │ ├── sample_submission.csv │ ├── test.csv │ └── train.csv └── zip_files └── tabular-playground-series-dec-2021 └── tabular-playground-series-dec-2021.zip
    - The ds_data/zip_files folder contains a zip file of the raw competition data downloaded from kaggle website.
- At runtime, RD-Agent will automatically build the Docker image specified at rdagent/scenarios/kaggle/docker/mle_bench_docker/Dockerfile. This image is responsible for downloading the required datasets and grading files for MLE-Bench.
Note: The first run may take longer than subsequent runs as the Docker image and data are being downloaded and set up for the first time.

Prepare datasets for competitions that are not supported by MLE-Bench.

As a subset of data science, we can follow the format and steps of data science dataset to prepare kaggle dataset. Below we will describe the workflow for preparing a kaggle dataset using the competition playground-series-s4e9 as an example.

Create a ds_data/source_data/playground-series-s4e9 folder, which will be used to store your raw dataset.
- The raw files for the competition playground-series-s4e9 have two files: train.csv, test.csv, sample_submission.csv, and there are two ways to get the raw data:
  - You can find the raw data required for the competition on the official kaggle website.
  - Or you can use the command line to download the raw data for the competition, the download command is as follows.
    kaggle competitions download -c playground-series-s4e9

Create a ds_data/source_data/playground-series-s4e9/prepare.py file that splits your raw data into training data, test data, formatted submission file, and standard answer file. (You will need to write a script based on your raw data.)

The following shows the preprocessing code for the raw data of playground-series-s4e9.

ds_data/source_data/playground-series-s4e9/prepare.py¶

from pathlib import Path

import pandas as pd
from sklearn.model_selection import train_test_split


def prepare(raw: Path, public: Path, private: Path):

    # Create train and test splits from train set
    old_train = pd.read_csv(raw / "train.csv")
    new_train, new_test = train_test_split(old_train, test_size=0.1, random_state=0)

    # Create sample submission
    sample_submission = new_test.copy()
    sample_submission["price"] = 43878.016
    sample_submission.drop(sample_submission.columns.difference(["id", "price"]), axis=1, inplace=True)
    sample_submission.to_csv(public / "sample_submission.csv", index=False)

    # Create private files
    new_test.to_csv(private / "submission_test.csv", index=False)

    # Create public files visible to agents
    new_train.to_csv(public / "train.csv", index=False)
    new_test.drop(["price"], axis=1, inplace=True)
    new_test.to_csv(public / "test.csv", index=False)

    # Checks
    assert new_test.shape[1] == 12, "Public test set should have 12 columns"
    assert new_train.shape[1] == 13, "Public train set should have 13 columns"
    assert len(new_train) + len(new_test) == len(
        old_train
    ), "Length of new_train and new_test should equal length of old_train"


if __name__ == "__main__":
    competitions = "playground-series-s4e9"
    raw = Path(__file__).resolve().parent
    prepare(
        raw=raw,
        public=raw.parent.parent / competitions,
        private=raw.parent.parent / "eval" / competitions,
    )

At the end of program execution, the ds_data folder structure will look like this:

ds_data
├── playground-series-s4e9
│   ├── train.csv
│   ├── test.csv
│   └── sample_submission.csv
├── eval
│   └── playground-series-s4e9
│       └── submission_test.csv
└── source_data
    └── playground-series-s4e9
        ├── prepare.py
        ├── sample_submission.csv
        ├── test.csv
        └── train.csv

Create a ds_data/playground-series-s4e9/description.md file to describe your competition, dataset description, and other information. We can find the competition description information and the dataset description information from the Kaggle website.

The following shows the description file for playground-series-s4e9

ds_data/playground-series-s4e9/description.md¶

# Competition name: playground-series-s4e9

## Overview

**Welcome to the 2024 Kaggle Playground Series!** We plan to continue in the spirit of previous playgrounds, providing interesting and approachable datasets for our community to practice their machine learning skills, and anticipate a competition each month.

**Your Goal:** The goal of this competition is to predict the price of used cars based on various attributes.

## Evaluation

### Root Mean Squared Error (RMSE)

Submissions are scored on the root mean squared error. RMSE is defined as:

$$
\mathrm{RMSE} = \left( \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 \right)^{\frac{1}{2}}
$$

where $\hat{y}_i$ is the predicted value and $y_i$ is the original value for each instance $i$.

### Submission File

For each `id` in the test set, you must predict the `price` of the car. The file should contain a header and have the following format:

```
id,price
188533,43878.016
188534,43878.016
188535,43878.016
etc.
```

## Timeline
- **Start Date** - September 1, 2024
- **Entry Deadline** - Same as the Final Submission Deadline
- **Team Merger Deadline** - Same as the Final Submission Deadline
- **Final Submission Deadline** - September 30, 2024

All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition organizers reserve the right to update the contest timeline if they deem it necessary.

## About the Tabular Playground Series

The goal of the Tabular Playground Series is to provide the Kaggle community with a variety of fairly light-weight challenges that can be used to learn and sharpen skills in different aspects of machine learning and data science. The duration of each competition will generally only last a few weeks, and may have longer or shorter durations depending on the challenge. The challenges will generally use fairly light-weight datasets that are synthetically generated from real-world data, and will provide an opportunity to quickly iterate through various model and feature engineering ideas, create visualizations, etc.

### Synthetically-Generated Datasets

Using synthetic data for Playground competitions allows us to strike a balance between having real-world data (with named features) and ensuring test labels are not publicly available. This allows us to host competitions with more interesting datasets than in the past. While there are still challenges with synthetic data generation, the state-of-the-art is much better now than when we started the Tabular Playground Series two years ago, and that goal is to produce datasets that have far fewer artifacts. Please feel free to give us feedback on the datasets for the different competitions so that we can continue to improve!

## Prizes
- 1st Place - Choice of Kaggle merchandise
- 2nd Place - Choice of Kaggle merchandise
- 3rd Place - Choice of Kaggle merchandise

**Please note**: In order to encourage more participation from beginners, Kaggle merchandise will only be awarded once per person in this series. If a person has previously won, we'll skip to the next team.

## Citation

Walter Reade and Ashley Chow. Regression of Used Car Prices. https://kaggle.com/competitions/playground-series-s4e9, 2024. Kaggle.

## Dataset Description

The dataset for this competition (both train and test) was generated from a deep learning model trained on the [Used Car Price Prediction Dataset](https://www.kaggle.com/datasets/taeefnajib/used-car-price-prediction-dataset). Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance.

## Files

- **train.csv** - the training dataset; `price` is the continuous target
- **test.csv** - the test dataset; your objective is to predict the value of `price` for each row
- **sample_submission.csv** - a sample submission file in the correct format

Create a ds_data/eval/playground-series-s4e9/valid.py file, which is used to check the validity of the submission files to ensure that their formatting is consistent with the reference file.

The following shows a script that checks the validity of a submission based on the playground-series-s4e9 data.

ds_data/eval/playground-series-s4e9/valid.py¶

from pathlib import Path

# Check if our submission file exists
assert Path("submission.csv").exists(), "Error: submission.csv not found"

submission_lines = Path("submission.csv").read_text().splitlines()  # 自动生成的
test_lines = Path("submission_test.csv").read_text().splitlines()  # test.csv

is_valid = len(submission_lines) == len(test_lines)

if is_valid:
    message = "submission.csv and submission_test.csv have the same number of lines."
else:
    message = (
        f"submission.csv has {len(submission_lines)} lines, while submission_test.csv has {len(test_lines)} lines."
    )

print(message)

if not is_valid:
    raise AssertionError("Submission is invalid")

Create a ds_data/eval/playground-series-s4e9/grade.py file, which is used to calculate the score based on the submission file and the standard answer file, and output the result in JSON format.

The following shows a grading script based on the playground-series-s4e9 data implementation.

ds_data/eval/playground-series-s4e9/grade.py¶

import datetime
import json

import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error, roc_auc_score


class InvalidSubmissionError(Exception):
    """
    A custom exception for when the agent submission cannot be graded.
    """

    pass


def prepare_for_metric(submission: pd.DataFrame, answers: pd.DataFrame) -> dict:

    if "id" not in submission.columns or "id" not in answers.columns:
        raise InvalidSubmissionError("Both submission and answers DataFrames must contain an 'id' column.")

    if "price" not in submission.columns:
        raise InvalidSubmissionError("Submission DataFrame must contain 'price' columns.")

    assert "price" in answers.columns, "Answers DataFrame must contain 'price' columns."

    if len(submission) != len(answers):
        raise InvalidSubmissionError("Submission must be the same length as the answers.")

    answers_sorted = answers.sort_values("id")
    submission_sorted = submission.sort_values("id")

    if (submission_sorted["id"].values != answers_sorted["id"].values).any():
        raise InvalidSubmissionError("Submission and answers have mismatched 'id' columns")

    y_true = answers_sorted[["price"]].to_numpy()
    y_score = submission_sorted[["price"]].to_numpy()

    return {"y_true": y_true, "y_score": y_score}


def grade(submission: pd.DataFrame, answers: pd.DataFrame) -> float:
    metric_inputs = prepare_for_metric(submission, answers)
    return np.sqrt(mean_squared_error(metric_inputs["y_true"], metric_inputs["y_score"]))


if __name__ == "__main__":
    submission_path = "submission.csv"
    gt_submission_path = "submission_test.csv"
    submission = pd.read_csv(submission_path)
    answers = pd.read_csv(gt_submission_path)
    score = grade(submission=submission, answers=answers)

    # This `thresholds` can be customized according to the leaderboard page of the Kaggle website and your own needs.
    # Refs: https://www.kaggle.com/competitions/playground-series-s4e9/leaderboard
    thresholds = {
        "gold": 62917.05988,
        "silver": 62945.91714,
        "bronze": 62958.13747,
        "median": 63028.69429,
    }

    # The output must be in json format. To configure the full output,
    # you can run the command `rdagent grade_summary --log-folder` to summarize the scores at the end of the program.
    # If you don't need it, you can just provide the `competition_id`` and `score``.
    print(
        json.dumps(
            {
                "competition_id": "arf-12-hours-prediction-task",
                "score": score,
                "gold_threshold": thresholds["gold"],
                "silver_threshold": thresholds["silver"],
                "bronze_threshold": thresholds["bronze"],
                "median_threshold": thresholds["median"],
                "any_medal": bool(score >= thresholds["bronze"]),
                "gold_medal": bool(score >= thresholds["gold"]),
                "silver_medal": bool(score >= thresholds["silver"]),
                "bronze_medal": bool(score >= thresholds["bronze"]),
                "above_median": bool(score >= thresholds["median"]),
                "submission_exists": True,
                "valid_submission": True,
                "is_lower_better": False,
                "created_at": str(datetime.datetime.now().isoformat()),
                "submission_path": submission_path,
            }
        )
    )

In this example we don’t create a ds_data/eval/playground-series-s4e9/sample.py, we use the sample method provided by RD-Agent by default.

At this point, you have created a complete dataset. The correct structure of the dataset should look like this.

ds_data
├── playground-series-s4e9
│   ├── train.csv
│   ├── test.csv
│   ├── description.md
│   └── sample_submission.csv
├── eval
│   └── playground-series-s4e9
│       ├── grade.py
│       ├── submission_test.csv
│       └── valid.py
└── source_data
    └── playground-series-s4e9
        ├── prepare.py
        ├── sample_submission.csv
        ├── test.csv
        └── train.csv

We have prepared a dataset based on the above description for your reference. You can download it with the following command.
```
wget https://github.com/SunsetWolf/rdagent_resource/releases/download/ds_data/playground-series-s4e9.zip
```

Next, we need to configure the environment for the playground-series-s4e9 contest. You can do this by executing the following command at the command line.

dotenv set DS_IF_USING_MLE_DATA False
dotenv set DS_SAMPLE_DATA_BY_LLM False
dotenv set DS_SCEN rdagent.scenarios.data_science.scen.KaggleScen

🚀 Run the Application¶

🌏 You can directly run the application by using the following command:
rdagent data_science --competition <Competition ID>
The following shows the command to run based on the playground-series-s4e9 data
rdagent data_science --competition playground-series-s4e9
More CLI Parameters for rdagent data_science command:
rdagent.app.data_science.loop.main(path: str | None = None, checkout: bool = True, checkout_path: str | None = None, step_n: int | None = None, loop_n: int | None = None, timeout: str | None = None, competition='bms-molecular-translation', replace_timer=True, exp_gen_cls: str | None = None)
Parameters¶

path :
A path like $LOG_PATH/__session__/1/0_propose. This indicates that we restore the state after finishing step 0 in loop 1.

checkout :
Used to control the log session path. Boolean type, default is True. - If True, the new loop will use the existing folder and clear logs for sessions after the one corresponding to the given path. - If False, the new loop will use the existing folder but keep the logs for sessions after the one corresponding to the given path.

checkout_path:
If a checkout_path (or a str like Path) is provided, the new loop will be saved to that path, leaving the original path unchanged.

step_n :
Number of steps to run; if None, the process will run indefinitely until an error or KeyboardInterrupt occurs.

loop_n :
Number of loops to run; if None, the process will run indefinitely until an error or KeyboardInterrupt occurs. - If the current loop is incomplete, it will be counted as the first loop for completion. - If both step_n and loop_n are provided, the process will stop as soon as either condition is met.

timeout :
Maximum duration to run the loop. Accepts a string format recognized by the internal timer. - If None, the loop will run until completion, error, or KeyboardInterrupt.

competition :
Competition name.

replace_timer :
If a session is loaded, determines whether to replace the timer with session.timer.

exp_gen_cls :
When there are different stages, the exp_gen can be replaced with the new proposal.

Auto R&D Evolving loop for models in a Kaggle scenario. You can continue running a session by using the command:
dotenv run -- python rdagent/app/data_science/loop.py [--competition titanic] $LOG_PATH/__session__/1/0_propose  --step_n 1   # `step_n` is an optional parameter
rdagent kaggle --competition playground-series-s4e8  # This command is recommended.
📈 Visualize the R&D Process
We provide a web UI to visualize the log. You just need to run:
rdagent ui --port <custom port> --log-dir <your log folder like "log/"> --data_science True
Then you can input the log path and visualize the R&D process.
🧪 Scoring the test results

Finally, shutdown the program, and get the test set scores with this command.
dotenv run -- python rdagent/log/mle_summary.py grade <url_to_log>
If you have configured the full output in ds_data/eval/playground-series-s4e9/grade.py, or if you are running a competition that receives MLE-Bench support, you can also summarize the scores by running the following command.
rdagent grade_summary --log-folder=<url_to_log>
Here, <url_to_log> refers to the parent directory of the log folder generated during the run.