Data splitting for artificial neural networks using SOM-based stratified sampling

May, R.; Maier, H.; Dandy, G.

Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/60816

Scopus	Web of Science®	Altmetric
Citations
?	?

Full metadata record

DC Field	Value	Language
dc.contributor.author	May, R.	-
dc.contributor.author	Maier, H.	-
dc.contributor.author	Dandy, G.	-
dc.date.issued	2010	-
dc.identifier.citation	Neural Networks, 2010; 23(2):283-294	-
dc.identifier.issn	0893-6080	-
dc.identifier.issn	1879-2782	-
dc.identifier.uri	http://hdl.handle.net/2440/60816	-
dc.description.abstract	Data splitting is an important consideration during artificial neural network (ANN) development where hold-out cross-validation is commonly employed to ensure generalization. Even for a moderate sample size, the sampling methodology used for data splitting can have a significant effect on the quality of the subsets used for training, testing and validating an ANN. Poor data splitting can result in inaccurate and highly variable model performance; however, the choice of sampling methodology is rarely given due consideration by ANN modellers. Increased confidence in the sampling is of paramount importance, since the hold-out sampling is generally performed only once during ANN development. This paper considers the variability in the quality of subsets that are obtained using different data splitting approaches. A novel approach to stratified sampling, based on Neyman sampling of the self-organizing map (SOM), is developed, with several guidelines identified for setting the SOM size and sample allocation in order to minimize the bias and variance in the datasets. Using an example ANN function approximation task, the SOM-based approach is evaluated in comparison to random sampling, DUPLEX, systematic stratified sampling, and trial-and-error sampling to minimize the statistical differences between data sets. Of these approaches, DUPLEX is found to provide benchmark performance with good model performance, with no variability. The results show that the SOM-based approach also reliably generates high-quality samples and can therefore be used with greater confidence than other approaches, especially in the case of non-uniform datasets, with the benefit of scalability to perform data splitting on large datasets.	-
dc.description.statementofresponsibility	R. J. May, H. R. Maier and G. C. Dandy	-
dc.language.iso	en	-
dc.publisher	Pergamon-Elsevier Science Ltd	-
dc.rights	© 2009 Elsevier	-
dc.source.uri	http://dx.doi.org/10.1016/j.neunet.2009.11.009	-
dc.subject	Multivariate Analysis	-
dc.subject	Cluster Analysis	-
dc.subject	Models, Statistical	-
dc.subject	Reproducibility of Results	-
dc.subject	Learning	-
dc.subject	Algorithms	-
dc.subject	Databases, Factual	-
dc.subject	Databases as Topic	-
dc.subject	Neural Networks, Computer	-
dc.title	Data splitting for artificial neural networks using SOM-based stratified sampling	-
dc.type	Journal article	-
dc.identifier.doi	10.1016/j.neunet.2009.11.009	-
pubs.publication-status	Published	-
dc.identifier.orcid	Maier, H. [0000-0002-0277-6887]	-
dc.identifier.orcid	Dandy, G. [0000-0001-5846-7365]	-
Appears in Collections:	Aurora harvest Civil and Environmental Engineering publications Environment Institute publications

Files in This Item:

There are no files associated with this item.

Show simple item record

Adelaide Research & Scholarship