Manipulating PDB files using BioPython

From WikiGagneLague
Jump to: navigation, search


Contents

Introduction

The PDB file format is the de facto standard for storing macromolecular structure information. While the PDB file structure is simple, there are a number of subtleties that make its parsing difficult. Therefore, easy-to-use modules for very high-level programming languages that allow both parsing and serialisation are scarce. Most of the time, PDB files are edited by hand or using UNIX text manipulation tools (sed, awk...), hoping for the best. The PDB module of the BioPython package is a very powerful alternative that allows Python users to access the information stored in PDB files as sophisticated, hierarchical data structures. The information may be modified and saved back to new PDB files. BioPython makes it easy to renumber the residues in a PDB, change the name of a residue, selectively remove heteroatoms, etc.

An introduction to using the PDB module is available in the BioPython Tutorial and Cookbook. However, it does not touch the subject of manipulating structure information, but only talks about accessing that information. Here I will illustrate how you can modify PDB structures using the Bio.PDB module.

Parsing PDB files

A PDBParser instance produces a Structure instance by reading a PDB file. You must supply the filename and an ID for the structure. Once the structure has been read, you can also retrieve the header and trailer information. The header is likely to be incomplete, though, as it stops after the REMARK lines, missing SEQRES, FTNOTE, etc.

from Bio.PDB.PDBParser import PDBParser

parser = PDBParser()

structure = parser.get_structure('1g59', '1g59.pdb')
header = parser.get_header()
trailer = parser.get_trailer()

Accessing the information

The macromolecular information is stored in a structure, model, chain, residue, atom hierarchy (SMCRA). Each object in the hierarchy is identified by an ID. Here is a crystallographic structure example, the glutamyl-tRNA synthetase from T. thermophilus complexed with its tRNA (PDBID: 1G59). There is only one model (as opposed to NMR structures). The structure contains two proteins, two nucleic acids and 283 water molecules (5 different chains). The chains contain either amino acids, nucleotides or water "residues" (groups would be a more appropriate term). Each residue contains one or more atoms.

Structure['1g59']
  |
  +---- Model[0]
          |
          +---- Chain['A']
          |       |
          |       +---- Residue[' ', 1, ' ']
          |       |       |
          |       |       +---- Atom['N']
          |       |       |
          |       |       +---- [...]
          |       |       |
          |       |       +---- Atom['CE']
          |       |
          |       +---- [...]
          |       |
          |       +---- Residue[' ', 468, ' '] [...]
          |
          +---- Chain['B'] [...]
          |
          +---- Chain['C'] [...]
          |
          +---- Chain['D'] [...]
          |
          +---- Chain[' ']
                  |
                  +---- Residue['W', 1, ' ']
                  |       |
                  |       +---- Atom['O']
                  |
                  +---- [...]
                  |
                  +---- Residue['W', 283, ' '] [...]

We see that IDs come in a variety of forms. The ID of a structure is the one you specified when the PDB file was parsed. (I used the PDBID.) For models, the IDs are numbers and correspond to the order in which the models appear in the PDB file (0 is the first model). To retrieve the first model in a structure:

model = structure[0]

Chains are identified by one letter strings. Here, chains A and C are polypeptides, B and D nucleic acids. We see that the chain containing the water molecules is named ' ' (a space). This is not very helpful, but happens quite often. As soon as a space character is used in the PDB file instead of a letter, the residue becomes part of that chain. Since this happens in many PDB, you will often have a chain whose ID is a space. Many times, that chain will contain everything in the PDB (protein and water molecules). To retrieve chain A in the first model:

model = structure[0]
chain = model['A']

Residue IDs are a bit more complicated. They are a tuple of three values. The first value is the HETERO code of the residue (a string), the second is its number (an integer), the third its insertion code (a string). (I am not fond of this, as the HETERO and insertion codes are very often a one space string.) Here is how to retrieve residue number 1 in chain A (amino acid), and HETERO water residue number 38 in chain ' ':

model = structure[0]

chain_a = model['A']
chain_w = model[' ']

first_aa = chain_a[(' ', 1, ' ')]
some_water = chain_w[('W', 38, ' ')]

Atoms are identified by their name as seen in the PDB file. This is a string of between 1 and 4 characters. For instance, to retrieve the epsilon carbon of the first residue in chain A (in the first model), use:

model = structure[0]
chain = model['A']
residue = chain[(' ', 1, ' ')]

atom = residue['CE']

Retrieving element properties

Once you have a reference to your object of interest, you can examine its properties. For instance:

model = structure[0]
chain = model['A']
residue = chain[(' ', 1, ' ')]
atom = residue['CE']

print chain.id
A

print residue.id[1], residue.resname
1 MET

print atom.name, atom.occupancy, atom.bfactor, atom.coord
CE 1.0 50.12 [ 39.10900116  79.59600067  29.96500015]

For a complete list of properties, see the API documentation or use the autocompletion feature of IPython.

Accessing child elements as a list

Sometimes, you do not know (or do not care about) the ID of the child elements. When that is the case, you can retrieve all child elements in an ordered list. For instance:

residue_list = chain.child_list

for residue in residue_list:
    print residue.resname
MET
VAL
VAL
THR
[...]
ALA

Iterating on an element

All levels of the structure hierarchy (except atoms) can be iterated over. The preceeding example can therefore be rewritten as:

for residue in chain:
    print residue.resname
MET
VAL
VAL
THR
[...]
ALA

Modifying the structure

The Cookbook does not say much about this, and my experience with the module suggests that modifying the data structure and then saving it back to a file is not as easy as it should... The procedure is not coherent (it depends on what you want to change), and bugs in the module make accessing the data much more tricky once it has been modified. Also, the data structure reports contradictory information once modified. This does not prevent you from modifying the structure, though. You only need to be careful. I have succeeded in removing chains, removing, renaming and renumbering residues, and renaming atoms. (That is all I need to do right now.)

To remove an element, you must use the detach_child method on its parent, passing the ID of the element to remove. I tried using the detach_parent method directly from the element to remove, but it did not work. To rename a residue, set its resname property. To change the number associated with a residue, you must set its id property. Setting resseq will not do anything... (Remember to use a tuple of 3 values when setting the ID.) Setting the ID will change the residue number, but you will still have to use the original ID to access the residue. To change the name of an atom, you must set its fullname property. The name property will not work. Be careful that the atom full name must be a four-character string (padded with spaces if necessary). Setting the full name will not change the name or the ID used to access the atom.

Removing residues

The following example shows how to remove all heteroatoms from a structure. Chains are also removed if they contain only heteroatoms.

for model in structure:
    for chain in model:
        for residue in chain:
            id = residue.id
            if id[0] != ' ':
                chain.detach_child(id)
        if len(chain) == 0:
            model.detach_child(chain.id)

The following example shows how to remove all water molecules whose occupancy factor is under 50 %. Chains are also removed if they contain only water molecules whose occupancy factor is under 50.

for model in structure:
    for chain in model:
        for residue in chain:
            id = residue.id
            if id[0] == 'W':
                if residue['O'].occupancy < 0.5:
                    chain.detach_child(id)
        if len(chain) == 0:
            model.detach_child(chain.id)

Renumbering residues

The following example shows how to renumber the residues in a chain, starting at 1. We assume the chain contains no hetero atoms, and that no residue has an insertion code.

i = 1
for residue in chain:
    residue.id = (' ', i, ' ')
    i += 1

The following example shows how to renumber the 263 residues in a chain using numbers in the range 26-238, 240-252, 254-290. We assume the chain contains no hetero atoms, and that no residue has an insertion code.

seq = range(26, 238 + 1) + range(240, 252 + 1) + range(254, 290 + 1)
i = 0
for residue in chain:
    residue.id = (' ', seq[i], ' ')
    i += 1

Renaming residues

The following example shows how to rename all HIS residues to HSD (the nomenclature for the CHARMM program).

for residue in chain:
    if residue.resname == 'HIS':
        residue.resname = 'HSD'

Renaming atoms

The following example shows how to rename the CD1 atoms of isoleucine to CD (the nomenclature for the CHARMM program). (Pay attention to the spaces used to make the name four characters wide.)

for residue in chain:
    if residue.resname == 'ILE':
        residue['CD1'].fullname = ' CD '

Writing the structure

To write a new PDB file from your modified structure, you can use the PDBIO class. Here is an example:

from Bio.PDB import PDBIO

w = PDBIO()
w.set_structure(structure)
w.save('1btl-r1.pdb')

Example script

Here is a script to modify a PDB from the format used by CHARMM to the more standard format used elsewhere :

#!/usr/bin/python

import sys
from Bio.PDB import PDBIO
from Bio.PDB.PDBParser import PDBParser

if len(sys.argv) != 2:
    print "------------------------------------------------------"
    print "Usage : ./convert_PDB_from_CHARMM_to_PDB.py <YOUR PDB>"
    print "------------------------------------------------------"
    sys.exit()
else:
    PDB_input = sys.argv[1]
    parser = PDBParser()
    structure = parser.get_structure('self', PDB_input)
    model = structure[0]
    chain = model[' ']
    residue = chain[(' ', 1, ' ')]
    first_residue = chain.child_list[0]
    last_residue = chain.child_list[-1]

    for residue in chain:
        if residue.resname == 'HSD':
            residue.resname = 'HIS'
        elif residue.resname == 'HSE':
            residue.resname = 'HIS'
        elif residue.resname == 'HSP':
            residue.resname = 'HIS'
        elif residue.resname == 'ILE':
            residue['CD'].fullname = ' CD1'
            residue['HD1'].fullname = '1HD1'
            residue['HD2'].fullname = '2HD1'
            residue['HD3'].fullname = '3HD1'
        elif residue.resname == 'SER':
            residue['HG1'].fullname = '1HG '

    first_residue['HT1'].fullname = '1HT '
    first_residue['HT2'].fullname = '2HT '
    first_residue['HT3'].fullname = '3HT '

    last_residue['OT1'].fullname = ' OXT'
    last_residue['OT2'].fullname = ' O  '

    w = PDBIO()
    w.set_structure(structure)
    w.save('corrected_from_CHARMM_to_PDB.pdb')

Resources

The RCSB Protein Data Bank is the repository for all macromolecular structures.

BioPython is a collection of Python tools for computational molecular biology.

The BioPython Tutorial and Cookbook has a chapter on using the PDB module.

The API documentation for the PDB module is indispensable when using the module.