-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to handle HETATM records in PDB files #409
Comments
A common location of I am increasingly beginning to think that it will be hard to update out parser to reliably create/reconstruct a PDB file from a Sire System, unless we have in place a bunch or robust rules for detecting things, e.g. base on residue name, element, etc. In most cases a PDB will be a direct starting point for our users, so perhaps we want to consider bypassing Sire and directly passing this through to the parameterisation engine, i.e. we only create a system with the output of parameterisation, hence reading/parameterising in one go. |
This is challenging. Is it always the case that the atom number increases sequentially in the problematic files? If so, could we do a simple re-ordering of the PDB file after writing to sort the lines into atom number order? Or could we do something when reading a file that adds a Boolean flag or similar to say when an atom has to be followed by a A challenge with both these approaches is that they rely on a "correctly" formatted PDB with I do like the idea of giving users the ability to read and parameterise a PDB file into a Sire system in one go. There are a lot of extra properties that are useful for other bits of the code (e.g. atomic charges, bonding/connectivity information, total charge on the molecule) that are missing or ambiguous from a PDB file. PDBs also tend to have missing atoms (hydrogens, residues from chains etc) which can also lead to special-case or ambiguity-fixing code. The only challenge would be making sure we don't write yet another PDB parser (we already have 2 in Sire...!). It is definitely worth some thought. |
It's a bit of a mess to be honest, which is unsurprising given the abuse of the PDB format. Some example input files appear to be generated by chopping out bits of existing PDBs, so things like the numbering, chain identifiers, etc., aren't important. I am already doing something close to what you suggest, i.e. flagging if an atom is a real From passing a bunch of example PDB files to |
From debugging this BioSimSpace issue it is clear that our handling of HETATM records from PDB files is problematic and needs improving. Unfortunately the formatting of these records seems to be quite variable between PDBs, making it hard to develop a single strategy for dealing with them. For example (copied from the above issue thread):
For example, in this there a
HETAM
s with the same chain identifier before and after theTER
. Some examples of the different formatting:HETATM
in chainB
before and afterTER
, followed byHETATM
s from chainA
.ATOM
andHETATM
interspersed within the same chain.In my option the important thing isn't necessarily the PDB files themselves, rather what
LEaP
etc. require in order to function. (In most cases someone will be simply loading a PDB as a starting point for parametrisation.) As such, seeing howpdb4amber
processes a bunch of files including various types ofHETAM
formatting. In some cases these are converted toATOM
records, in others they are left in place, and sometimes they are even moved. ParmEd uses the approach of labelling everything in a non-standard residue (using template name matching) as aHETATM
, but I'm not sure how it deals with those that are misplaced.Our main problem is that we fully convert the information from the PDB into an internal molecular data structure. Residues in the PDB are reparented to their chains, which are reparented to molecules. When writing back, we reverse this process. If some
HETATM
records need to be placed before the end of a chain (where theTER
record is placed) and some after, this is very tricky to achieve without knowing exactly which ones should go where, and why.I'll try to determine some rules-of-thumb for the position of various
HETATM
records, then test how robust these are. Perhaps it's possible to move all records to the end of the file without issue, i.e. after the finalTER
. This would certainly be the easiest solution.The text was updated successfully, but these errors were encountered: