Modification of the short read alignment algorithm to improve the quality of the human whole genome sequencing data processing pipeline

Egor Pavlovich GUGUCHKIN; Evgeny Andreevich KARPULEVICH

doi:10.15514/ISPRAS-2023-35(2)-17

Modification of the short read alignment algorithm to improve the quality of the human whole genome sequencing data processing pipeline

Egor Pavlovich GUGUCHKIN, Evgeny Andreevich KARPULEVICH

https://doi.org/10.15514/ISPRAS-2023-35(2)-17

Full Text:

PDF (Rus)

Generate QR code

Abstract

This study emphasizes the importance of aligning short reads in the analysis of human whole-genome sequencing data. The alignment process involves determining the positions of short genetic sequences relative to a known reference genome sequence of the human genome. Traditional alignment methods use a linear reference sequence, but this can lead to incorrect alignment, especially when short reads contain genetic variations. In this work, the index file of the reference sequence was modified using the minimap2 tool. Experimental results showed that adding information about frequently occurring genetic variations to the minimap2 index increases the number of correctly identified genetic variants, which affects the quality of subsequent data analysis.

Keywords

data processing pipeline, DNA sequencing, Computational biology, Sequence alignment methods, NGS data analysis, Computational methods

About the Authors

Egor Pavlovich GUGUCHKIN

Institute for System Programming of the Russian Academy of Sciences
Russian Federation

Egor Pavlovich GUGUCHKIN is a research fellow at ISP RAS. His research interests include the analysis of genetic data and the development of bioinformatics pipelines.

Evgeny Andreevich KARPULEVICH

Institute for System Programming of the Russian Academy of Sciences
Russian Federation

Evgeny Andreevich KARPULEVICH is a specialist of the Information Systems Department. Research interests: application of data analysis algorithms to the biomedical domain, development of systems for distributed data storage and analysis.

References

1. Bagyinszky, E., Youn, Y. C., An, S. S. A., & Kim, S. (2014). The genetics of Alzheimer’s disease. Clinical interventions in aging, 535-551.

2. Fisher, R. A. (1923). XXI.—On the dominance ratio. Proceedings of the royal society of Edinburgh, 42, 321-341.

3. Antonio, K., & Beirlant, J. (2007). Actuarial statistics with generalized linear mixed models. Insurance: Mathematics and Economics, 40(1), 58-76.

4. Martin, S. B., & Barclay, D. R. (2019). Determining the dependence of marine pile driving sound levels on strike energy, pile penetration, and propagation effects using a linear mixed model based on damped cylindrical spreading. The Journal of the Acoustical Society of America, 146(1), 109-121.

5. Ng, P. C., & Kirkness, E. F. (2010). Whole genome sequencing. Genetic variation: Methods and protocols, 215-226.

6. Behjati, S., & Tarpey, P. S. (2013). What is next generation sequencing?. Archives of Disease in Childhood-Education and Practice, 98(6), 236-238.

7. Hwang, K. B., Lee, I. H., Li, H., Won, D. G., Hernandez-Ferrer, C., Negron, J. A., & Kong, S. W. (2019). Comparative analysis of whole-genome sequencing pipelines to minimize false negative findings. Scientific reports, 9(1), 3219.

8. Ye, H., Meehan, J., Tong, W., & Hong, H. (2015). Alignment of short reads: a crucial step for application of next-generation sequencing data in precision medicine. Pharmaceutics, 7(4), 523-541.

9. Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of molecular biology, 147(1), 195-197.

10. Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3), 443-453.

11. Кондратьева, О. А., & Карпулевич, Е. А. (2022). Модификация метода расчета полигенных рисков с использованием графа вариации. Труды Института системного программирования РАН, 34(2), 191-200.

12. Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. bioinformatics, 25(14), 1754-1760.

13. Adjeroh, D., Bell, T., & Mukherjee, A. (2008). The Burrows-Wheeler Transform:: Data Compression, Suffix Arrays, and Pattern Matching. Springer Science & Business Media.

14. Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357-359.

15. Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997.

16. Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18), 3094-3100.

17. Van der Auwera, G. A., Carneiro, M. O., Hartl, C., Poplin, R., Del Angel, G., Levy‐Moonshine, A., ... & DePristo, M. A. (2013). From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline. Current protocols in bioinformatics, 43(1), 11-10.

18. VCFV4.4 and BCFv2.2 27 Jan 2023 - GitHub Pages. Available at: http://samtools.github.io/hts-specs/VCFv4.4.pdf

19. Schneider, V. A., Graves-Lindsay, T., Howe, K., Bouk, N., Chen, H. C., Kitts, P. A., ... & Church, D. M. (2017). Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome research, 27(5), 849-864.

20. Sudmant, P. H., Rausch, T., Gardner, E. J., Handsaker, R. E., Abyzov, A., Huddleston, J., ... & Korbel, J. O. (2015). An integrated map of structural variation in 2,504 human genomes. Nature, 526(7571), 75-81.

21. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., ... & 1000 Genome Project Data Processing Subgroup. (2009). The sequence alignment/map format and SAMtools. bioinformatics, 25(16), 2078-2079.

22. Olson, N. D., Wagner, J., McDaniel, J., Stephens, S. H., Westreich, S. T., Prasanna, A. G., ... & Zook, J. M. (2022). PrecisionFDA Truth Challenge V2: Calling variants from short and long reads in difficult-to-map regions. Cell Genomics, 2(5).

23. FASTQ format specification (no date) FASTQ Format. Available at: https://maq.sourceforge.net/fastq.shtml (Accessed: 27 July 2023).

24. Huang, W., Li, L., Myers, J. R., & Marth, G. T. (2012). ART: a next-generation sequencing read simulator. Bioinformatics, 28(4), 593-594.

25. Cleary, J. G., Braithwaite, R., Gaastra, K., Hilbush, B. S., Inglis, S., Irvine, S. A., ... & De La Vega, F. M. (2015). Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. BioRxiv, 023754.

Review

For citations:

GUGUCHKIN E.P., KARPULEVICH E.A. Modification of the short read alignment algorithm to improve the quality of the human whole genome sequencing data processing pipeline. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2023;35(2):235-248. (In Russ.) https://doi.org/10.15514/ISPRAS-2023-35(2)-17

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Modification of the short read alignment algorithm to improve the quality of the human whole genome sequencing data processing pipeline

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy