Despite the large amount of data being produced by various real world applications, supervised learning problems still frequently face training data scarcity due to the high cost of collecting labelled data. This can potentially lead to poor predictive models. One way to deal with such scarcity is to use data augmentation techniques.
Even though there are several data augmentation techniques for classification problems, very few studies have investigated data augmentation for regression problems.
In this talk, I will present a novel data augmentation technique for regression problems. The technique generates synthetic data by perturbing the input and output features of existing training examples. Experiments using datasets from the context of software effort estimation are performed to (1) evaluate the effectiveness of the proposed technique and (2) to understand when and why it works well. The experiments show that the proposed technique is able to significantly improve the predictive performance of baseline models especially when the training data is insufficient and when the baseline models are global rather than local learning algorithms.
When the proposed technique cannot significantly improve predictive performance, it is not detrimental either. Besides, our technique performs similarly or better than an existing data augmentation technique for software effort estimation. Therefore, our proposed technique is useful to tackle training data scarcity in regression problems.Webinar ID: 753-679-603