Comparing the performance of xgboost, Gradient Boosting and GBLUP models under different genomic prediction scenarios

Document Type : Original Research Article (Regular Paper)

Author

Department of Animal Science, Faculty of Agriculture, Bu-Ali Sina University, Hamedan, Iran

Abstract

Abstract
The aim of this study was to study the performance of xgboost algorithm in genomic evaluation of complex traits as an alternative for Gradient Boosting algorithm (GBM). To this end, genotypic matrices containing genotypic information for, respectively, 5,000 (S1), 10,000 (S2) and 50,000 (S3) single nucleotide polymorphisms (SNP) for 1000 individuals was simulated. Beside xgboost and GBM, the GBLUP which is known as an efficient algorithm in terms of accuracy, computing time and memory requirement was also used to predict genomic breeding values. xgboost, GBM and GBLUP were run in R software using xgboost, gbm and synbreed packages. All the analyses were done using a machine equipped with a Core i7-6800K CPU which had 6 physical cores. In addition, 32 gigabyte of memory was installed on the machine. The Person's correlation between predicted and true breeding values (rp,t) and the mean squared error (MSE) of prediction were computed to compare predictive performance of different methods. While GBLUP was the most efficient user of memory, GBM required a considerably high amount of memory to run. By increasing size of data from S1 to S3, GBM went out from the competition mainly due to its high demand for memory. Parallel computing with xgboost reduced running time by %99 compared to GBM. The speedup ratios (the ratio of the GBM runtime to the time taken by the parallel computing by xgboost) were 444 and 554 for the S1 and S2 scenarios, respectively. In addition, parallelization efficiency (speed up ratio/number of cores) were, respectively, 74 and 92 for the S1 and S2 scenarios, indicating that by increasing the size of data, the efficiency of parallel computing increased. The xgboost was considerably faster than GBLUP in all the scenarios studied. Accuracy of genomic breeding values predicted by xgboost was similar to those predicted by GBM. While the accuracy of prediction in terms of rp,t was higher for GBLUP, the MSE of prediction was lower for xgboost, specially for larger datasets. Our results showed that xgboost could be an efficient alternative for GBM as it had the same accuracy of prediction, was extremely fast and needed significantly lower memory requirement to predict the genomic breeding values.
 

Keywords

Main Subjects


References
Abdollahi-Arpanahi, R., Pakdel, A., Nejati-Javaremi, A, Moradi Shahre Babak, M., 2013. Comparison of different methods of genomic evaluation in traits with different genetic architecture. Journal of Animal Production 15, 65-77 (In Farsi).
Auinger, H.S., Wimmer, V., Auinger, H.J., Albrecht, T., Schoen, C.C., Schaeffer, L.,  Erbe, M., Ober, U., Reimer, C., Badke, Y., VandeHaar, P., 2018. Framework for the analysis of genomic prediction data using R (synbreed). Available at https://cran.rproject.org/web/packages/synbreed/index.html
Bernardo, R., Yu, J., 2007. Prospects for genome-wide selection for quantitative traits in maize. Crop  Science 47, 1082-1090.
Carlborg, Ö., Andersson-Eklund, L., Andersson, L., 2001. Parallel computing in interval mapping of quantitative trait loci. Journal of Heredity 92, 449-451.
Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., Chen, K., Mitchell, R., Cano, I., Zhou, T., Li, Mu., Xie, J., Lin, M., Geng, Y., Li, Y., 2019. xgboost: Extreme Gradient Gradient Boosting. Available at: https://cran.r-project.org/web/packages/xgboost/index.html.
Fernando, RL., Grossman, M., 1989. Marker-assisted selection using best linear unbiased prediction. Genetic Selection Evolution 2, 246-477.
Ghafouri-Kesbi, F., Rahimi-Mianji, G., Honarvar, M., Nejati-Javaremi, A., 2017. Predictive ability of random forests, Gradient Boosting, support vector machines and genomic best linear unbiased prediction in different scenarios of genomic evaluation. Animal Production Science 57, 229-236.
González-Recio, O., Rosa, GJM., Gianola, D., 2014. Machine learning methods and predictive ability metrics for genome-wide prediction of complex traits. Livestock Science 166 217-231.
Greenwell, B., Bradley, B., Cunningham, J., 2019. gbm: Generalized Boosted Regression Models. Available at: https://cran.r-project.org/web/packages/gbm/index.html
Guo, P., Zhu, B., Niu, H., Wang, Z., Liang, Y., Chen, Y., Zhang, L., Ni, H., Guo, Y., El Hamidi, AH., Gao, X., Gao, H., Wu, X., Xu, L., Li, J., 2018. Fast genomic prediction of breeding values using parallel Markov chain Monte Carlo with convergence diagnosis. BMC Bioinformatics 19, 3.
Hastie, T.J., Tibshirani, R., Friedman, J., 2009. The Elements of Statistical Learning. 2nd ed., Springer, New York, USA.
Intel® Hyper-Threading Technology., 2003. Technical User’s Guide. Available at: http://www.cslab.ece.ntua.gr/courses/advcomparch/2007/material/readings/Intel%20Hyper-Threading%20Technology.pdf
Ma, L., Birali, Runesha, H., Dvorkin, D., Garbe, G.R., 2008. Parallel and serial computing tools for testing single-locus and epistatic SNP effects of quantitative traits in genome-wide association studies. BMC Bioinformatics 9, 315.
Kim, B., Kim, S., 2018. Prediction of inherited genomic susceptibility to 20 common cancer types by a supervised machine-learning method. Proceedings of the National Academy of Sciences 115, 1322-1327.
Matukumalli, L.K., Schroeder, S., DeNise, S.K., 2011. Analyzing LD blocks and CNV segments in cattle: Novel genomic features identified using the Bovine HD BeadChip. Illumina Inc. San Diego, USA.
Matthews, D., Kearney, J.F., Cromie, AR., 2019. Genetic benefits of genomic selection breeding programmes considering foreign sire contributions. Genetic Selection Evolution 51, 40.
Meuwissen, T.H.E., Hayes, B.J., Goddard, M.E., 2001. Prediction of total genetic value using genome wide dense marker maps. Genetics 157, 1819-1829.
Neves, H.H.R., Carvalheiro, R., Queiroz, S.A., 2012. A comparison of statistical methods for genomic selection in a mice population. BMC Genetics 13,100.
Ødegård, J., Indahl, U., Strandén, I., Meuwissen, T.H.E., 2018. Large‑scale genomic prediction using singular value decomposition of the genotype matrix. Genetic Selection Evolution 50, 6.
Oguto, J.O., Piepho, H.P., Schulz-Streeck, T., 2011. A comparison of random forests, Gradient Boosting and support vector machines for genomic selection. BMC Proceedings 5, 11.
Orozco-Arias, S., Tabares-Soto, R., Ceballos, D., Guyot, R., 2017. Parallel Programming in Biological Sciences, Taking Advantage of Supercomputing in Genomics. Advances in Computing 735, 627-643.
R Core Team., 2022. R: A language and environment for statistical computing. Vienna, Austria. Available at: https://www.R-project.org/.
Singh, P.P., Nagpal R., Pal, R., Nagamani, V., Rao, B.B.P., 2007. MemHunt: Dynamic Memory Leak Analyzer and Garbage Collector. In Proceedings of the 2nd National Conference on Emerging Trends and Applications in Computer Engineering, Ajmir, India.
Smith, C., 1967. Improvement of metric traits through specific genetic loci. Animal Production 9, 349-358.
Technow, F., 2013. hypred: Simulation of genomic data in applied genetics. Available at: https://github.com/cran/hypred.
Thompson, K., Charnigo, R., 2015. Parallel Computing in Genome-Wide Association Studies. Journal of Biometrics and Biostatistics 6, 1000e131.
VanRaden, PM., 2008. Efficient methods to compute genomic predictions. Journal of Dairy Science 91, 4414-4423.
Wang, X., Xu, Y., Hu, Z., Xu, C., 2018. Genomic selection methods for crop improvement: Current status and prospects. The Crop Journal 6, 330-340.
Wickham, H., 2018. pryr: Useful tools to pry back the covers of R and understand the language at a deeper level. Available at: https://cran.r-project.org/web/packages/pryr/index.html.
Wu, XL., Sun, C., Beissinger, TM., Rosa, GJ., Weigel, KA., Gatti Nde, L., Gianola, D., 2012. Parallel Markov chain Monte Carlo bridging the gap to high-performance Bayesian computation in animal breeding and genetics. Genetics Selection Evolution 44, 29.
Ying, X., 2019. An overview of overfitting and its solutions. Journal of Physics: Conference. Series 1, 1168.
Zhang, H., Yin, L., Wang, M., 2019. Genomic selection for agricultural economic traits in maize, cattle, and pig populations. Frontiers in Genetics 10,189.