Introduction
Bacteriophages (or phages) are viruses that infect bacteria. The phage–host relationship is specific and complex. Receptor-binding proteins (RBPs) of phages such as tail fibers and tailspikes are the first phage proteins interacting with the host, initiating the infection process. These proteins can specifically bind outer cell wall structures of bacteria such as capsular polysaccharides (CPS) [
32,
57] or lipopolysaccharides (LPS) [
21], (lipo)teichoic acids, outer membrane proteins, flagella, or pili [
54]. Tail fibers generally adopt a fibrous shape and comprise a distal domain that binds the receptor, while tailspikes are typically shorter and contain an enzymatic domain that also degrades its receptor upon binding [
12]. In this work we use the comprehensive term RBP due to inconsistently available information on the presence of such enzymatic activity. Whereas most phages encode a single or two RBPs, some polyvalent phages express multiple RBPs, forming a branched RBP structure. Each of these RBPs recognizes a different receptor, allowing the phage to infect multiple hosts [
17,
31,
41,
45,
53].
Numerous phages infecting
Escherichia coli encode RBPs targeting the outer layer of LPS, called the O-antigen. When this virulence factor is present on the
E. coli outer cell wall, the structure is referred to as smooth LPS. A high O-antigen serogroup variability with 176 different structures has currently been described for smooth
E. coli strains [
37]. Among the most pathogenic
E. coli strains is the Shiga toxin-producing
E. coli (STEC) pathotype, being one of the main causes for gastrointestinal illnesses around the world. The prevalence of certain O-antigens associated to this pathotype varies across time as well as geographical location. The importance of STEC was recognized in 2015 by the Food and Agriculture Organization (FAO) of the United Nations and the World Health Organization (WHO). Serogroup O157 is the most prevailing serotype in the United States, although the share of non-O157 serogroups is continuing to grow. In 2020, more STEC cases were reported with serogroup O26 than cases carrying the O157 serogroup in Europe [
13]. Additionally, isolated STEC outbreaks of new emerging serogroups can occur, like the O104 serogroup STEC outbreak in Germany in 2011 [
28]. Other important non-O157 serogroups associated with human illness include O45, O91, O103, O104, O111, O145 and O146, with different prevalence in the USA versus Europe [
15,
16].
Most tail fibers and tailspikes are homotrimeric, modular RBPs. They are generally composed of two domains: (1) an N-terminal anchor domain that functions as attachment domain of the RBP to the phage particle and (2) a C-terminal, receptor-binding domain (RBD) that is responsible for binding, and/or cleaving the host receptor. When this RBD has enzymatic activity, it generally displays a β-helical structure. The substrate-binding sites are located within the β-helix domain, either at the three interfaces between subunits (inter-subunit) as in the tailspike (TSP) of phage Sf6 and TSP1 and TSP2 of CBA120, or on the surface of each subunit (intra-subunit) such as in the TSPs of phages P22, Det7 and HK620 [
7,
35,
40,
58,
64]. The RBD can optionally comprise small domains such as a chaperone, adhesin or carbohydrate binding domain. The C-terminal RBD is highly subjected to horizontal gene transfer (HGT) and is often exchanged both within and outside the phylogenetic borders of the phage genera, whereas the N-terminal anchor domain remains conserved within a phage genus [
21,
31,
46]. Certain phages make use of sequence motifs to aid recombination, resulting in high mosaicism in the genome [
2,
25]. Such potential motifs have also been identified within the RBP gene [
56,
61].
This work demonstrates the O-antigen binding potential of RBPs of members from eight phage genera, namely the Gamaleya-, Justusliebig-, Kaguna-, Kayfuna-, Kutter-, Lederberg-, Nouzilly- and Uetakeviruses. We confirm that the selection of phages expressing RBPs of the same subtype recognizes hosts with the same serogroups and we predict the serogroup specificity of various RBPs in silico based on phylogenetic and structural clustering. Additionally, we identified RBD-surrounding DNA sequence motifs that are conserved in RBP genes across the lytic phage genera studied here.
Discussion
We built a pipeline to identify serogroup-specific RBPs in silico. We therefore relied on the modularity principle of RBPs that retain conserved anchors for structural attachment to the phage tail, while swapping the RBD for specificity switches. Both at the phylogenetic and the structural level this modularity gives a clear guidance in classifying the RBPs in RBP subtypes. In total, 14 different RBP subtypes targeting O2, O8, O16, O18, 4s/O22, O26, O45, O77, O78, O103, O104, O111, O145 and O157 were identified in 39 phages spread over eight different phage genera. Simultaneously, several clustered RBP subtypes were found that most likely target a different receptor than the O-antigen. For example, during the serogroup validation step, the RBP of
Justusliebigvirus VecB showed similarity to RBPs from prophages integrated in
E. coli strains of serogroups O6, O11 and O153 with 89.8, 66.1 and 61.7% aa identity. Also, RBP1 of
Gamaleyavirus PGN829.1 shows more than 99.5% aa identity with RBPs from prophages in strains with serogroups O11, O83, O86 and O102. Other examples are the RBP of
Kayfunavirus YZ1 (serogroups O102, O6, O1, O153 and O6; ≥ 95% identity), the RBP of
Uetakevirus phiv142-3 (including serogroups O5, O1, O102 and O51; ≥ 95% identity), and the RBP of
Justusliebigvirus alia (including serogroups O7, O23, O146; ≥ 75.7% identity). A RBP binding smooth
E. coli strains from multiple serogroups was identified previously [
20]. One possible explanation is that these RBPs belonging to the same subtype target a receptor that is shared across multiple serogroups, such as the K-antigen (capsule). Outer membrane proteins may be less likely to serve as receptor since RBPs of phages infecting smooth strains cannot easily approach the outer membrane proteins due to steric hindrance of the long chain O-antigen [
8,
20,
29]. Secondly, these RBPs belonging to the same RBP subtype could potentially have further diverged to alter their host specificity through single point mutations in their substrate binding site, as has been described for some tail fibers [
3,
33,
60,
66]. Further investigation is needed to draw any further conclusion, but multiple serogroup-targeting phages may have a broader therapeutic potential, which is an attractive trait for the development of phage cocktails.
The temperate phage genera
Lederberg- and
Uetakeviruses offer an elegant avenue to identify new RBPs with specificity towards an O-antigen serogroup of interest. Phages of these genera were identified in eight out of nine serogroups of interest and there is a clear link between the RBP of the prophage and the O-antigen serogroup of their host [
5]. This approach is generic and can be easily expanded to other serogroups. In addition to our findings, the RBP sequences of
Salmonella enterica infecting
Lederbergviruses have been used to predict the O-antigen type of its host, with 743 prophage RBPs clustering into 18 distinct RBP subtypes correlating perfectly with the O-antigen polysaccharide that its host displays on its surface [
5]. However, one limitation of this approach is that some
Lederberg- and
Uetakeviruses may also encode an O-antigen modification gene behind their RBP [
9,
44], thereby changing the receptor as a mechanism to prevent superinfection. Next to serogroup prediction, RBDs of
Lederberg- and
Uetakeviruses with a podovirus morphotype have been successfully grafted into myo-like phage tail-like bacteriocins (PTLBs) [
49,
50] to successfully swap the killing spectrum of the PTLB. In addition, many RBDs of RBPs of
Kutterviruses share homology to those of
Lederberg- or
Uetakeviruses, such as TSP3 of phage SPTD1 [
18] and to other RBDs identified in this work. This shows that
Lederberg- and
Uetakeviruses are ideal candidates as a start point to identify a RBP targeting an O-antigen serogroup of interest and expand from there to recruit more RBPs belonging to the same RBP subtype from phages belonging to other taxonomic groups.
Our research suggests that many phages belonging to the genera
Gamaleya-, Justusliebig-, Kaguna-, Kayfuna-, Kutter-, Lederberg-, Nouzilly- and
Uetakeviruses have their RBP(s) as the sole factor determining serogroup specificity. Consequently, these RBPs can be used to predict the phage host serogroup relying on the conservation of serogroup specificity of RBP subtypes.
Kutterviruses have previously been used to predict the host serogroup of
Salmonella enterica and
E. coli. RBP subtypes (> 75% aa identity) were confirmed for the O78 antigen of
E. coli and the O22 antigen and O4/O9 antigen backbone of
S. enterica [
56]. In our work, we found reliable clustering in RBP subtypes based on mere ≥ 30% aa identity, while the predicted quaternary protein structures remain highly similar. This indicates that substantial divergence by adaptive evolution happens to improve phage fitness upon a HGT event of a RBD, while conserving serogroup specificity. RBPs of the same subtype but with low sequence similarity thus have a more distantly related ancestor compared to RBPs with higher similarities. These observations illustrate the interplay of horizontal and vertical evolutionary processes that shape tailspikes. However, the low threshold may lead to the inclusion of false positives, when assigning a serogroup to a RBP that has already undergone crucial mutations resulting in a serogroup specificity switch. As a criterion, we stated that 90% of all RBPs within a subtype needed to be conform in their host serogroup, otherwise the RBP subtype was classified as non-O-antigen targeting. Therefore, we may have falsely discarded serogroup-specific RBP subtypes due to a single RBP that has potentially alternated its specificity. In addition to the eight genera investigated in this study,
Agtre-, Phapecocta-, Roguna- and
Vectreviruses and members of the family
Ackermannviridae or subfamily
Braunvirinae also frequently popped up in the group F RBPs based on HGT identification, suggesting that they could also play an important role in the HGT of RBPs with
E. coli serogroup specificity.
Members of these genera may be engineered to swap the host range of the phages simply by exchanging the RBD domains. As phages seem to have switched host range on many occasions throughout evolution by horizontal transfer, phages could be designed with adapted RBPs to target the strain of choice.
Przondovirus K11, a phage related to
Kayfunaviruses, has been successfully engineered by swapping the RBD to alter the host range towards different
Klebsiella capsular serotypes [
32]. Similarly,
Kuttervirus phage SPTD1 RBDs have been swapped within the same phage genus to target different
Salmonella O-antigen serogroups [
18]. Additionally, the RBDs of podo-like
Lederberg- and
Uetakeviruses have been exchanged with myo-like PTLBs as mentioned previously, illustrating that RBDs can be exchanged across different morphologies [
49,
50].
The observed sequence conservation surrounding the RBD may aid recombination across
Gamaleya-, Justusliebig-, Kaguna-, Kayfuna-, Kutter- and
Nouzillyviruses. Although illegitimate recombination events can happen virtually anywhere in the phage genome, certain regions of sequence conservation can serve as recombination hotspots. Such hotspots have been identified on multiple occasions. In temperate phage clusters including
Lederbergviruses, conserved sequence motifs were identified between genome cassettes, resulting in higher genome mosaicism [
6,
9,
25,
47]. Moreover, sequence homology has also been identified across different genera of lytic phages. For example, sequence homology between the different RBPs of
Kutterviruses and between
Kutter- and
Gamaleyaviruses have already been suggested to aid recombination across different tailspike genes [
7,
21,
46,
56]. In this research we observed conserved motifs that may allow homologous recombinations to occur at a higher rate in the sequence regions surrounding the RBD in up to six different lytic phage genera. Additionally, when expanding the data set in this project, various HGT events were observed across phages belonging to the same, recurring genera, indicating higher odds for HGT events within the RBPs across these genera than to other genera. However, these motifs are not universal for all lytic phage RBPs in the final data set and no correlation could be observed between the presence of these motifs and the number of recombination events that we observed between these phages.
A few hurdles were identified when performing this research. (i) The first limitation is the lack of phage–host serogroup data in public databases. When the serogroup of the phage host is known, it should be mentioned as it can offer valuable information in phage–host interaction studies. Additionally, the number of available phage genomes of phages infecting smooth
E. coli strains is relatively low compared to those infecting rough
E. coli strains. On top of that, most of the smooth
E. coli infecting phage genomes that are available infect
E. coli serogroup O157. To find new phages, smooth
E. coli strains of all serogroups should be used more frequently as hosts during phage isolation. The method used for
E. coli serotyping is also relevant information, since additional O-antigen modification genes can be encoded by prophages, which can be missed by genetic-based serotyping assays. (ii) A second hurdle is the incorrect annotation of many RBPs in databases such as NCBI. This is partially due to the variety of used terminology. Tail fibers generally have a fiber-like structure dominated by a long α-helix bundle with a C-terminal RBD, whereas tailspikes have an enzymatically active, β-helical, elongated structure with no, one or two C-terminal carbohydrate-binding or chaperone domains [
12]. Both terms are often mixed. Wrongly annotated RBPs cause the need for manual and time-consuming curation of the RBP through phage genome alignments. New computational tools such as PhageDPO [
63] may facilitate this process, but still require manual validation. (iii) The number of RBP structures defined by crystallography is growing but still scarce. Therefore, we extensively relied on the AlphaFold2 algorithm to reveal the remarkably conserved anchor and RBD quaternary structures, corresponding to genus and serogroup, respectively. Yet, the AlphaFold2 algorithm frequently failed in delivering good structures, such as the trimeric structure of O8 and O16 targeting RBPs, either due to limitations in computing power to deal with these large, trimeric proteins or due to high error estimates. The limitation in computing power could mostly be circumvented by using high computing infrastructure and splitting the RBP in its anchor and RBD for separate predictions. The high error estimates are caused by the incapacity to predict the mutual orientation of the separate domains because of the flexible hinge domains, due to the limited number of available crystal structures (e.g., for the anchor domain of the
Lederbergvirus RBPs), but also due to the intervening T4gp10-like domains that are needed to create branched RBPs [
31,
45,
46]).
In sum, a pipeline to identify and validate E. coli O-antigen specific RBPs was established. Eight phage genera (Gamaleya-, Justusliebig-, Kaguna-, Kayfuna-, Kutter-, Lederberg-, Nouzilly- and Uetakeviruses) emerged for their high proportion of serogroup-specific RBPs. With their conserved N-terminal anchor domain and exchangeable RBD, they offer an ideal platform for phage host engineering in terms of O-antigen serogroup specificity. This research also emphasizes the need to study recombination hotspots surrounding RBDs that might lead to a better understanding of phage genome mosaicism.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.