Genome analysis views DNA as a linear string of the letters A, C, G, and T but proteins recognize DNA as a three-dimensional object (Figure). Our main interest is to understand better how transcription factors (TFs) recognize nuances in intrinsic DNA structure and to identify TF families for which the readout of local DNA shape contributes to binding specificity and explains distinct functions of closely related TFs. Until recently, our research mainly focused on the analysis of TF binding sites (TFBSs) for which structural information was available. However, sequence information for whole genomes has become available in recent years due to advances in high-throughput sequencing technologies, whereas structural information on that scale is not available. It is still unknown why certain TFs bind to similar DNA sequences but execute different in vivo functions, or in turn bind to diverse sequences. Our scientific contributions suggest that direct chemical contacts with base pairs cannot sufficiently explain binding specificity and that DNA shape is a crucial specificity determinant:
1. We discovered that Drosophila Hox proteins achieve binding specificity through readout of minor groove geometry and that base-specific hydrogen bonds in the major groove are not sufficient for in vivo function (Joshi et al., Cell 2007).
2. Based on the analysis of all available crystal structures of protein-DNA complexes, we generalized our finding that many TF families use arginine residues to recognize minor groove shape and electrostatic potential and that similar shape-dependent interactions with histones contribute to the stabilization of nucleosomes (Rohs et al., Nature 2009).
3. For binding sites of the tumor suppressor p53, we found a different mechanism for altering minor groove shape due to a flip of several base pairs from Watson-Crick to Hoogsteen geometry (Kitayner et al., Nat. Struct. Mol. Biol. 2010).
4. The discovery of minor groove shape recognition has led to a new classification of protein-DNA readout modes in base readout and shape readout (Rohs et al., Annu. Rev. Biochem. 2010), which has already been adopted in a textbook.
5. We developed a high-throughput method for minor groove geometry prediction. In a proof-of-principle study, we published the first application of this approach based on the shape analysis of several hundreds of thousands of TFBSs (Slattery et al., Cell 2011). We predicted the minor groove width of Drosophila Hox binding sites derived from SELEX-seq experiments and discovered that Hox TFs, although they bind to target sites that are similar in sequence, prefer distinct minor groove topographies. Hox TFs responsible for the development of anterior regions of the fly select one shape class while Hox proteins involved in posterior development prefer a different shape class.
6. We are currently expanding our high-throughput method to predict all essential structural features of TFBSs at single nucleotide resolution. This approach is based on the data mining of thousands of Monte Carlo trajectories, which we validated based on all available crystal structures. Using a sliding pentamer window, we derive average conformations at the center of each unique pentamer to predict structural features. Our high-throughput method is very fast in comparison to molecular simulations and predicts shape features of, for instance, the entire yeast genome in about one minute on a single processor. This advance makes DNA shape analysis for the first time expedient on a genome-wide scale.
Building on this high-throughput method for DNA shape prediction, our lab is now working on expanding this approach and will apply DNA shape analysis to a variety of biological questions, which we believe will benefit from integrating studies of DNA sequence and shape. Our immediate research plans are focused on analyzing the role of various intrinsic DNA shape features on a genome-wide basis in achieving DNA binding specificity of closely related TFs. Based on our preliminary results, we expect that genome-wide DNA shape analysis will become an important aspect in interpreting high-throughput sequencing data and provide a better understanding of the genome and its diverse functions.