N. Abe et al.
Deconvolving the recognition of DNA sequence from shape.
Cell 161, 307-318 (2015)
Supplementary Information
SYNOPSIS
./encode.py <input_path> <output_path> <path_to_encode.R> <path_to_feature_list_file>
DESCRIPTION
encode.py is a python script that invokes the R script encode.R to do feature encoding for data files under the input_path directory, and save the output encoded files to the output_path directory. The paths to the encode.R and the feature_list file need to be specified. The feature_list file allows users to customize for desired features. You may need to compile the DNAshape program first by going to folder DNAshape_v2.6 and run "make install".
The data files are expected to have the aligned sequences of the same length as the first column and the corresponding measured binding affinity signal as the second column as the following:
<Sequence 1> <Affinity 1>
<Sequence 2> <Affinity 2>
......
<Sequence N> <Affinity N>
Here is one example: (extracted from the file "sample_input.txt.s" in the folder "Sample_input")
TTGTCAATTATATGCTAAG 0.8
GCTGAGGTTACACTTGACT 0.6
...
TGCAGAGTTACGACATTAG 0.9
For a given input data file, the following features can be generated:
Sequence features: A mapped to [1,0,0,0], C mapped to [0,1,0,0], G mapped to [0,0,1,0], T mapped to [0,0,0,1]
MGW features: DNA minor groove width values normalized
Roll features: Roll angle between adjacent base-pairs normalized
ProT features: Propeller twist between paired bases normalized
HelT features: Helix twist between adjacent base-pairs normalized
Users can customize for desired features by modifying the file feature_list. The first column in feature_list specifies names of the desired feature combinations, which can be named as any string without space. The second column must be a 5-bit binary string that toggles the output status of sequence features, MGW features, Roll features, ProT features, and HelT features, respectively. For example, "11111" will enable the encoding of all five features. "10000" will enable the encoding of only sequence features. It is trivial to toggle the option in encode.R such that the final affinity uses the original value or its logarithm.
The output encoded files contain the affinity values, or responses, as the first column, followed by a constant column of 1's, followed by sequence features (if enabled), followed by MGW features (if enabled), followed by ProT features (if enabled), followed by HelT features (if enabled).
Feel free to contact us if you have any questions. (yang23@usc.edu)