Skip to content

Conversation

@Goosang-Yu
Copy link

Fetching PDB files using ProteinChain.from_rcsb is a very convenient feature. However, many PDB files, especially those with complex and large structures, contain missing residues.

For instance, I fetched one of the Cas9 structures, "8G1I", using ProteinChain.from_rcsb.

pdb_id = "8G1I" # PDB ID corresponding to Renal Dipeptidase
chain_id = "A" # Chain ID corresponding to Renal Dipeptidase in the PDB structure
renal_dipep_chain = ProteinChain.from_rcsb(pdb_id, chain_id)

When checking the sequence and length of 8G1I, it appears shorter than the actual amino acid sequence, with the missing residues omitted.

pro_seq = renal_dipep_chain.sequence
print("Protein sequence:", pro_seq)
print("Protein length:", len(pro_seq))
Protein sequence: KYSIGLDIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAEATRLKRTARRRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFGNIVDEVAYHEKYPTIYHLRKKLVDSTDKADLRLIYLALAHMIKFRGHFLIEGDLNPDNSDVDKLFIQLVQTYNQLFEENPINASGVDAKAILSARLSKSRRLENLIAQLPGEKKNGLFGNLIALSLGLTPNFKSNFDLAEDAKLQLSKDTYDDDLDNLLAQIGDQYADLFLAAKNLSDAILLSDILRVNTEITKAPLSASMIKRYDEHHQDLTLLKALVRQQLPEKYKEIFFDQSKNGYAGYIDGGASQEEFYKFIKPILEKMDGTEELLVKLNREDLLRKQRTFDNGSIPHQIHLGELHAILRRQEDFYPFLKDNREKIEKILTFRIPYYVGPLARGNSRFAWMTRKSEETITPWNFEEVVDKGASAQSFIERMTNFDKNLPNEKVLPKHSLLYEYFTVYNELTKVKYVTEGMRKPAFLSGEQKKAIVDLLFKTNRKVTVKQLKEDYFKKIECFDSVEIASLGTYHDLLKIIKDKDFLDNEENEDILEDIVLTLTLFEDREMIEERLKTYAHLFDDKVMKQLKRRRYTGWGRLSRKLINGIRDKQSGKTILDFLKSDGFANRNFMQLIHDDSLTFKEDIQKAQSLHEHIANLAGSPAIKKGILQTVKVVDELVKVMGRHKPENIVIEMARNSRERMKRIEEGIKELGSQILKEHPVENTQLQNEKLYLYYLQNGRDMYVDQELDINRLSDYDVDHIVPQSFLKDDSIDNKVLTRSDKNRGKSDNVPSEEVVKKMKNYWRQLNAKLIRKFDNLTKAERGGLSELDKAGFIKRQLVETRQITKHVAQILDSRMNTKYDENDKLIREVKVITLKSKLVSDFRKDFQFYKVREINNYHHAHDAYLNAVVGTALIKKYPKLESEFVYGDGKATAKYFFYSNIMNFFKTEIKRPLIETNGETGEIVWDKGRDFATVRKVLSMPQVNIVKKTEVQTGGFSKESILPKRNSDKLIARKKDWDPKKYGGFDSPTVAYSVLVVAKVEKGKSKKLKSVKELLGITIMERSSFEKNPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASAGELQKGNELALPSKYVNFLYLASHEQKQLFVEQHKHYLDEIIEQISEFSKRVILADANLDKVLSAYNKHRDKPIREQAENIIHLFTLTNLGAPAAFKYFDTTIDRKRYTSTKEVLDATLIHQSITGLYETRIDLSQ
Protein length: 1298

Even when retrieving atom information, the atom data for the missing residues could not be found.

renal_dipep_chain.atom_array.get_atom(1)
# res_id 1-3 is missing
Atom(np.array([127.615, 117.513, 179.528], dtype=float32), chain_id="A", res_id=4, ins_code="", res_name="LYS", hetero=False, atom_name="CA", element="C")

This can be resolved by using pdbfixer to find the missing residues.

import pdbfixer

fixer = pdbfixer.PDBFixer(pdbid="8G1I")

# PDBFixer operations
fixer.findNonstandardResidues()
fixer.replaceNonstandardResidues()
fixer.findMissingResidues()
fixer.findMissingAtoms()

print("Missing Residues:", fixer.missingResidues)
Missing Residues: {(1, 0): ['MET', 'ASP', 'LYS'], (1, 577): ['SER', 'GLY', 'VAL', 'GLU', 'ASP', 'ARG', 'PHE', 'ASN'], (1, 701): ['VAL', 'SER', 'GLY', 'GLN', 'GLY', 'ASP'], (1, 748): ['GLU', 'ASN', 'GLN', 'THR', 'THR', 'GLN', 'LYS', 'GLY', 'GLN', 'LYS'], (1, 859): ['LEU'], (1, 864): ['THR', 'GLN'], (1, 982): ['TYR', 'LYS', 'VAL', 'TYR', 'ASP', 'VAL', 'ARG', 'LYS', 'MET', 'ILE', 'ALA', 'LYS', 'SER', 'GLU', 'GLN', 'GLU', 'ILE'], (1, 1003): ['THR', 'LEU', 'ALA', 'ASN', 'GLY', 'GLU', 'ILE', 'ARG'], (1, 1186): ['TYR', 'GLU', 'LYS', 'LEU', 'LYS', 'GLY', 'SER', 'PRO', 'GLU', 'ASP', 'ASN'], (1, 1298): ['LEU', 'GLY', 'GLY', 'ASP', 'PRO', 'LYS', 'LYS', 'LYS', 'ARG', 'LYS', 'VAL', 'MET', 'ASP', 'LYS', 'HIS', 'HIS', 'HIS', 'HIS', 'HIS', 'HIS'], (3, 0): ['DC', 'DC', 'DA', 'DG', 'DT', 'DG', 'DC', 'DG', 'DT', 'DA', 'DT', 'DA', 'DC', 'DC', 'DA', 'DG', 'DC', 'DA', 'DA', 'DA', 'DA', 'DC', 'DA', 'DC', 'DT', 'DC', 'DC']}

I have added these functionalities as options to ProteinChain.from_rcsb. It can be used as follows:

pdb_id = "8G1I" # PDB ID corresponding to Renal Dipeptidase
chain_id = "A" # Chain ID corresponding to Renal Dipeptidase in the PDB structure
renal_dipep_chain = ProteinChain.from_rcsb(pdb_id, chain_id, fix_pdb=True)

The resulting protein chain includes the missing residue sequence information and the full length.

pro_seq = renal_dipep_chain.sequence
print("Protein sequence:", pro_seq)
print("Protein length:", len(pro_seq))
Protein sequence: MDKKYSIGLDIGTNSVGWAVITDEYKVPSKKFKVLGNTDRHSIKKNLIGALLFDSGETAEATRLKRTARRRYTRRKNRICYLQEIFSNEMAKVDDSFFHRLEESFLVEEDKKHERHPIFGNIVDEVAYHEKYPTIYHLRKKLVDSTDKADLRLIYLALAHMIKFRGHFLIEGDLNPDNSDVDKLFIQLVQTYNQLFEENPINASGVDAKAILSARLSKSRRLENLIAQLPGEKKNGLFGNLIALSLGLTPNFKSNFDLAEDAKLQLSKDTYDDDLDNLLAQIGDQYADLFLAAKNLSDAILLSDILRVNTEITKAPLSASMIKRYDEHHQDLTLLKALVRQQLPEKYKEIFFDQSKNGYAGYIDGGASQEEFYKFIKPILEKMDGTEELLVKLNREDLLRKQRTFDNGSIPHQIHLGELHAILRRQEDFYPFLKDNREKIEKILTFRIPYYVGPLARGNSRFAWMTRKSEETITPWNFEEVVDKGASAQSFIERMTNFDKNLPNEKVLPKHSLLYEYFTVYNELTKVKYVTEGMRKPAFLSGEQKKAIVDLLFKTNRKVTVKQLKEDYFKKIECFDSVEISGVEDRFNASLGTYHDLLKIIKDKDFLDNEENEDILEDIVLTLTLFEDREMIEERLKTYAHLFDDKVMKQLKRRRYTGWGRLSRKLINGIRDKQSGKTILDFLKSDGFANRNFMQLIHDDSLTFKEDIQKAQVSGQGDSLHEHIANLAGSPAIKKGILQTVKVVDELVKVMGRHKPENIVIEMARENQTTQKGQKNSRERMKRIEEGIKELGSQILKEHPVENTQLQNEKLYLYYLQNGRDMYVDQELDINRLSDYDVDHIVPQSFLKDDSIDNKVLTRSDKNRGKSDNVPSEEVVKKMKNYWRQLLNAKLITQRKFDNLTKAERGGLSELDKAGFIKRQLVETRQITKHVAQILDSRMNTKYDENDKLIREVKVITLKSKLVSDFRKDFQFYKVREINNYHHAHDAYLNAVVGTALIKKYPKLESEFVYGDYKVYDVRKMIAKSEQEIGKATAKYFFYSNIMNFFKTEITLANGEIRKRPLIETNGETGEIVWDKGRDFATVRKVLSMPQVNIVKKTEVQTGGFSKESILPKRNSDKLIARKKDWDPKKYGGFDSPTVAYSVLVVAKVEKGKSKKLKSVKELLGITIMERSSFEKNPIDFLEAKGYKEVKKDLIIKLPKYSLFELENGRKRMLASAGELQKGNELALPSKYVNFLYLASHYEKLKGSPEDNEQKQLFVEQHKHYLDEIIEQISEFSKRVILADANLDKVLSAYNKHRDKPIREQAENIIHLFTLTNLGAPAAFKYFDTTIDRKRYTSTKEVLDATLIHQSITGLYETRIDLSQLGGDPKKKRKVMDKHHHHHH
Protein length: 1384

Atom information can also be retrieved with the missing residues filled in by the fixer.

renal_dipep_chain.atom_array.get_atom(1)
Atom(np.array([125.775, 112.234, 187.093], dtype=float32), chain_id="A", res_id=1, ins_code="", res_name="MET", hetero=False, atom_name="CA", element="C")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant