DACO

Towards Application-Driven and Comprehensive Data Analysis via Code Generation

Xueqing Wu¹, Rui Zheng², Te-Lin Wu¹, Hanyu Zhou¹, Tang Mohan¹,
Kai-Wei Chang¹, Nanyun Peng¹, Haoran Huang³

¹University of California Los Angeles, ²Fudan University, ³ByteDance AI Lab

NeurIPS Dataset and Benchmark Track 2024

📁

Data analysis is a crucial analytical process to uncover valuable insights from the given data, requiring a chain of mathematical and logical reasoning and interacting with the data.

We construct the DACO dataset for this task, containing (1) 440 databases (of tabular data) collected from real-world scenarios, (2) ~2k answer annotations automatically generated by GPT-4 that can serve as weak supervision for model fine-tuning, and (3) a high-quality human refined subset that serves as our main evaluation benchmark. For generating the automatic annotations, we leverage the code generation capabilities of LLMs and propose a multi-turn prompting technique to automate data analysis for each query. Experiments show that equipping models with code generation improves data analysis performance for both zero-shot large language models (LLM) and fine-tuned models.

We further propose the DACO-RL algorithm to align models with human preference. We use reinforcement learning to encourage generating analysis perceived by human as helpful, and design a set of dense rewards to propagate the sparse human preference reward to intermediate code generation steps. DACO-RL is rated by human annotators to produce more helpful answers than supervised fine-tuning (SFT) baseline in 57.72% cases.

DACO Task and Dataset

In our proposed task, the input is a database of tabular data and a query, and the output answer is formatted as two lists of findings and suggestions respectively.

We construct our DACO dataset through four stages: (1) database collection, (2) query collection, (3) automatic annotation collection, and (4) human refinement. The DACO dataset contains 440 databases and 1,942 associated user queries, which can be used for both model fine-tuning and evaluation. To provide a refined benchmarking resource, we curate a high-quality test set of 100 samples through comprehensive human annotations.

Statistics of DACO dataset.
Train, Dev and Test^A sets are automatically generated with GPT-4.
Test^H is the high-quality human refined subset.

Domain distribution of DACO databases.
The 10 topics cover various domains such as business and sports.
This demonstrates the diverse domain coverage of DACO.

Demonstration of user query diversity.
Top 15 verbs and their top 3 direct noun objectives.

Evaluation

We use helpfulness as the main metric to evaluate the quality of generated data analysis.

Motivated by literature in the data analysis field, we define helpfulness as:

Relevance to the query,
Effective and insightful data interpretation, and
Diversity in terms of analysis perspectives.

We evaluate helpfulness through pairwise comparison: given two analyses generated by two different systems, the annotator (either human or ChatGPT) selects the more helpful one based on our criteria. The winning rate is reported as helpfulness score.

To obtain a comparable set of numbers for all models, we report the winning rate of each model against Test^A and Test^H annotations.

DACO-RL

The supervised fine-tuning (SFT) model, trained using GPT-4-generated annotations, iteratively generates and executes Python code to perform data analysis, as on the right side of the image below.

Left: DACO-RL algorithm. Right: code generation pipeline.

We use RLHF to further align the models with human preference.

We design three reward models for our DACO-RL algorithm, as on the left side of the image above:

Answer RM: Answer RM measures the helpfulness of answer, which is our end optimization goal. It is trained from pair-wise comparison data of final answer helpfulness.
Contribution RM: Contribution RM is a dense reward that evaluates the helpfulness of each coding step. To avoid the labor-intensive annotation of all coding steps, we heuristically define helpfulness as how much an intermediate step contributes to the final answer, measured by the similarity between final answer and code execution outputs.
Regularization RM: The heuristically defined contribution RM may not necessarily perfectly align with the true helpfulness, and thus may leads to reward hacking. Regularization RM assigns lower scores to typical reward hacking behaviors (exhibited by a policy that hacks contribution RM) and thus regularizes reward hacking.

Results

We experiment with the following models:

Prompt-based zero-shot LLM including ChatGPT and GPT-4, with or w/o code generation.
Fine-tuned model, including:
- SFT model trained from GPT-4-generated annotations
- A baseline counterpart that does not use code generation
- DACO-RL model
Models pre-trained on tabular data, including:
- TAPAS is a BERT-style model pre-trained to select relevant information from a table based on user query. We first use TAPAS to select relevant information and then use ChatGPT to interpret the selected information.
- TAPEX is a pre-trained encoder-to-decoder model. We fine-tune TAPEX with GPT-4-generated annotations.

Performance of all models on DACO dataset. We report three metrics: helpfulness, entailment, and BLEU.
We report numbers on both Test^A and Test^H test sets.

We have the following observations:

Code generation significantly helps data analysis, especially for zero-shot LLMs.
Our SFT model learns reasonable data analysis capabilities.
DACO-RL significantly improves over SFT, especially on helpfulness and entailment metrics.
Qualatitave analysis shows that contribution RM favors API calls that extracts important information from tabular data (e.g. nlargest, mean, sort_values) but is also vulnerable to reward hacking.

BibTeX


@inproceedings{wudaco,
  title={DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation},
  author={Wu, Xueqing and Zheng, Rui and Sha, Jingzhen and Wu, Te-Lin and Zhou, Hanyu and Mohan, Tang and Chang, Kai-Wei and Peng, Nanyun and Huang, Haoran},
  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track}
}