How numerous are CRISPR systems? Possibly in the thousands. A significant number of them are detectable by meticulously examining extensive genomic data from 'uncommon' bacteria, like those found in breweries or the waters of Antarctic lakes. This was illustrated by a recent study in the United States, in which the authors, utilizing a specialized algorithm for cluster analysis, identified 188 of them

Discussing genomic big data, we refer to the amassed information concerning the structure and functions of genomes in plants, animals, and humans. This encompasses the sequencing of gene molecules and the interplay between these molecules and proteins.

This vast and intricate trove of data is globally collected by geneticists, biologists, and biotechnologists, with the objective of its analysis being the development of treatments for genetic disorders, the identification of new genetic markers, and the creation of tailored medications.

The National Institutes of Health (NIH), a body of the U.S. Department of Health and Human Services, is among the entities responsible for administering databases that house this globally shared genomic big data, inclusive of that pertaining to bacteria.

In particular, the NIH’s National Center for Biotechnology Information, in cooperation with researchers from the McGovern Institute for Brain Research and the Broad Institute at the Massachusetts Institute of Technology (MIT), has utilized an algorithm proficient in categorizing bacterial genomic data. This approach resulted in the discovery of 188 novel CRISPR system types, as detailed in the ‘Uncovering the functional diversity of rare CRISPR-Cas systems 1 with deep terascale clustering‘ article, published in Science on November 24, 2023.


The algorithm adopted by the research team operates on a ‘locality-sensitive’ categorization method. This facilitated the selection and classification of similar (yet not identical) bacterial genomic data from the scrutinized databases.
While analysing genomic big data, a surprising array of new CRISPR systems was revealed, including a variant with an extended guide RNA. This could potentially lead to even more refined techniques for genomic editing in future DNA splicing and rearranging tasks.
The strategy employed by the research group advocates for an expansion of bacterial sampling criteria in upcoming years. This includes, as practiced by the authors, the collection of water samples from mines or lakes, thereby augmenting the current databases with rare genomic big data and injecting fresh momentum into research.

The Genesis and Role of CRISPR Systems

Before we delve deeper into the relationship between big data and genomics in aiding research, it’s important to note that the universally recognized acronym CRISPR – Clustered Regularly Interspaced Short Palindromic Repeats – denotes a category of DNA segments found in bacteria. These segments, marked by brief repeating sequences, empower these microorganisms to recognize and dismantle viral-origin genomes resembling those that created the palindromic repeats. Essentially, CRISPR serves as a natural defense for bacteria against external assaults.

Over the years, research into this defensive mechanism has spurred the development of progressively sophisticated genetic engineering methods for DNA manipulation in plant, animal, and human organisms.

The initial studies that would eventually lead to the term ‘CRISPR’ commenced in 1987 at Osaka University in Japan. The acronym itself was coined in 2001, providing a clear and singular designation for the multiple DNA sequences in bacteria, which had previously been referred to by various names in scientific literature.

Later, a specific bacterium known as ‘streptococcus pyogenes’ was found to possess a CRISPR system utilizing the Cas9 protein, acting as a molecular scissor to counter pathogens.

In 2012, scientists Emmanuelle Charpentier and Jennifer A. Doudna harnessed this system into an innovative genomic editing tool. It was capable of identifying and incising target DNA sequences more straightforwardly, accurately, and swiftly within a cell’s genome, removing and replacing them as needed.

This precise ‘genetic cut and paste’ approach earned them the Nobel Prize in Chemistry in 2020, opening avenues for laboratory research into prospective medical applications (both diagnostic and therapeutic).”

Utilising Big Data Clustering to Advance Genomic Research

In the realm of CRISPR research related to big data and genomics, the inception of the study led by the National Institutes of Health USA was sparked by a simple yet profound insight: “…databases brimming with bacterial data are immensely valuable for biotechnological strategies. Yet, in recent times, their sheer volume has rendered the task of accurately identifying desired enzymes and molecules increasingly challenging”.

This necessitated the development of an algorithm, grounded in big data clustering techniques, adept at selecting and organising information from the colossal expanse of genomic data. Here, ‘clustering’ or ‘cluster analysis’ denotes methods dedicated to assorting similar entities within a broad and varied dataset.

The team specifically utilised the ‘Fast Locality-Sensitive Hashing-based clustering’ (FLSHclust) algorithm, conceived in Feng Zhang’s laboratory, a CRISPR research luminary and professor at the Massachusetts Institute of Technology.

Employing a ‘locality-sensitive’ approach, this technique effectively grouped together similar, albeit not identical, genomic data, surveying billions of proteins and DNA sequences in a matter of weeks, as opposed to months.

Delving deeper, the algorithm began with a diverse array of genomic data from bacteria of varied types and origins – including those sourced from coal mines, breweries, Antarctic lakes, and canine saliva – and extracted from three publicly accessible databases. Astonishingly, it uncovered “a remarkable number and variety of CRISPR systems”.

Moving Beyond the Hazards of ‘Off-Target’ Editing

Post the discovery of CRISPR Cas9, research progressed along a defined trajectory, aiming to rectify the system’s flaws, particularly the ‘off-target’ editing resulting from inaccuracies in DNA splicing and rearranging.

In this context, the collaborative efforts of the National Institutes of Health and MIT in big data and genomics have facilitated – out of the 188 systems identified – the discovery of CRISPR systems. These utilise a guide RNA (Ribonucleic Acid) that is 32 base pairs in length, as opposed to the standard 20. This adaptation “holds promise for the creation of more precise genomic editing technologies, with a reduced risk of off-target editing,” as reported in science.

Furthermore, the research team demonstrated in laboratory settings that two of these ‘long guide’ CRISPR systems have the potential to alter human DNA in the future. Meanwhile, a third system exhibited a side effect that could, over time, be harnessed to develop techniques for the early diagnosis of infectious diseases. Specifically, this side effect involves “widespread degradation of nucleic acids once the CRISPR protein binds to its target.”

The team also uncovered new operational mechanisms for some established CRISPR systems and identified one system particularly focused on RNA. In forthcoming years, this could be precisely utilised for RNA editing – that is, manipulating gene regulation, expression processes, and protein synthesis. This marks another pivotal advance in genetic engineering, with potential implications for early diagnostic applications.

Big Data and Genomics: Future Research Trajectories?

The study conducted by the National Center for Biotechnology Information of NIH and the Massachusetts Institute of Technology in the field of big data and genomics deserves recognition for elucidating the diversity and abundance of CRISPR systems discoverable through the analysis of bacterial genomic data. Notably, many of these systems are found in atypical bacteria – those inhabiting environments like coal mines, breweries, Antarctic lakes, and in canine saliva. This revelation suggests that genomic editing research, henceforth, should venture into uncharted territories, should “widen the scope of sampling to further enhance the diversity of our discoveries,” as highlighted by the authors.

They further state:

Some of the microbial systems we examined originated from water collected in coal mines globally. Without exploring this avenue, we might never have uncovered these new CRISPR systems.”

As commented by the researchers, an algorithm like the Fast Locality-Sensitive Hashing-based clustering can accomplish a great deal when dealing with genomic big data from a wide array of sources. In the future, this could also aid researchers in exploring other biochemical systems or anyone intending to work with extensive databases, “for instance, to examine protein evolution or to discover new genes.”

Written by: