Search and Browse > Advanced Search
Search Examples
The Advanced Search Query Builder provides a powerful search interface to build complex scientific queries with multiple search conditions, that combine different attributes, inputs, operators, and groupings. Several sample queries are provided below.
Full-Text and Attribute Search
You can use the "Full-Text" option to search for words or phrases in all attributes available on the Advanced Search menu. The "Full-Text" search allows you to quickly find results related to a particular topic, regardless of the context. However, the more general your search is, the more results it is likely to return and the more likely the results will be loosely related to your search criteria. Note, that the Advanced Search attributes include terms coming from the mmCIF files as well as data integrated from third-party bio-resources.
This search example queries structures having the "luciferase" term in any attribute and excludes PubMed abstracts from the search to avoid unwanted matches.
If possible, it is best to specify particular search attributes explicitly rather than using the "Full-Text" search. Use the examples below to see how the Attribute Search options can be used to perform specific searches.
Macromolecules
The polymers in PDB structures can be proteins, DNA, RNA, and DNA/RNA hybrids. Polymer Instances (a.k.a chains) are the individual copies of distinct macromolecules. A structure may contain multiple copies of identical macromolecules.
Macromolecular Composition Search
- This search example queries structures with a single protein chain (and no others) of length between 350-400 residues;
- This search example queries structures that contain an RNA polymer, regardless of what other polymer types the structures may or may not contain.
Macromolecule Type Search
- This search example queries for all membrane protein structures in the PDB (as annotated by the resources PDBTM, MemProtMD, OPM, or mpstruc).
Modified Residues Search
Modified residues are non-standard polymeric components (i.e. non-standard amino acids in protein sequences or non-standard nucleotides in nucleic acid sequences).
This example shows how to find structures with modified residues.
Chimeric Macromolecular Entities Search
Polymeric sequences in the PDB are some times engineered by fusing sequence fragments from different organisms. These are known as chimeric entities. This search will find any PDB entry containing chimeric entities.
Sequence Similarity and Alignments
The Sequence Similarity Search can be used to find similar protein and nucleic acid sequences from the PDB archive. When using the Sequence Similarity Search, select Polymer Entities from the Display Results as menu to include a graphical display of pairwise sequence alignment and its statistics to the search results page.
E-Value and Sequence Identity Cutoff (%) filters help removing irrelevant or distantly related sequences. By default, the search allows to match the widest range of sequences. For example, the default E-Value = 1 indicates that results may contain sequences with similar score simply by chance. The default Identity Cutoff (%) = 0 indicates that results may contain sequences with low sequence similarity.
This search query finds sequences similar to the "N-acetyltransferase MPR1" protein from the Baker's yeast. The result includes "Gcn5-related N-acetyltransferase" protein that is related but exhibits poor sequence homology to the query sequence.
Refer to the Protein Sequence Alignment View page for a complete documentation of the sequence alignment display.
Sequence Motif Search
The Sequence Motif Search searches protein and nucleic acid sequences that match a sequence motif. A Sequence Motif can be an exact sequence or a sequence pattern expressed by regular expression syntax.
- The sequence motif search allows searching for arbitrarily short sequence fragments, for example:
NPPTP - The motif search supports wildcard queries by placing an 'X' at the variable residue position. A query for SH3 domains using the consequence sequence -X-P-P-X-P (where X is a variable residue and P is Proline) can be expressed as:
XPPXP - Ranges of variable residues are specified by the {n} notation, where n is the number of variable residues. To query a motif with seven variables between residues W and G and twenty variable residues (represented by
.
in regular expressions) between G and L use the following notation:
W.{7}G.{20}L - Variable ranges are expressed by the {n,m} notation, where n is the minimum and m the maximum number of repetitions. For example the zinc finger motif that binds Zn in a DNA-binding domain can be expressed as:
C.{2,4}C.{12}H.{3,5}H - The '^' operator searches for sequence motifs at the beginning of a protein sequence. The following two queries find sequences with N-terminal Histidine tags:
^HHHHHH or ^H{6} - Square brackets specify alternative residues at a particular position. The Walker (P loop) motif that binds ATP or GTP can be expressed as:
[AG]....GK[ST]
A or G are followed by 4 variable residues, then G and K, and finally S or T.
When using the Sequence Motif Search, select Polymer Entities from the Display Results as menu to include sequence positions where motif is found in the matched sequences.
Structure Motif Search
This search example queries for CSMs from AlphaFold that contain the Serine protease catalytic triad motif (as seen in PDB entry 1a0j).
The steps for running this query are listed here:
- Begin by going to the structure summary page of the PDB entry 1a0j.
- Open the 1D-3D view for this structure by clicking on the link available below the image on this page.
- Based on the annotations presented, find and click on the active site residues in the 1D (sequence) section while holding the shift button. This should allow selection of all 3 active site residues simultaneously, and these residues should be highlighted in the 3D panel.
- Click on the Toggle Expanded Viewport button in the Toggle menu (on the right on the 3D canvas) to gain access to all Mol* functionalities.
- Click on the Submit Search options under structure motif search in the Controls panel (on the right of the page).
- This search will run on all experimental structures and CSMs. Refine the search results returned by selecting AlphaFold under CSM Source Database in the left hand Refinements menu.
Assemblies
The biological assembly is the arrangement of macromolecules in the structure that is believed to be the biologically meaningful molecular assembly.
Assembly Composition
Below are examples of searches that query biological assemblies with different compositional features:
- This search example queries the total number of polymers in the biological assembly, regardless of whether that includes multiple identical molecules or different molecules.
- This search example queries biological assemblies with a single protein chain (and no others) of length between 350-400 residues.
- This search example queries biological assemblies that contain exactly 24 identical chains. For example, the biological assembly of the ferritin 1aew is comprised of 24 copies of a single polymer chain.
- This search example queries for immunoglobulin Fab fragments bound to a dimeric antigen (i.e., the assembly should have 2 Fab heavy chains, 2 Fab light chains, 2 antigen chains) using a stoichiometry based search (A2B2C2) AND a structure based search for a Fab light chain (e.g., using the PDB structure 1bj1, chain A).
- This search example queries for assemblies in the PDB that contain at least one heavy water (or DOD).
Ligands
Ligands are chemical substances that form a complex with larger biomolecule(s).
Free vs. Polymeric Ligands
Most ligands are considered “standalone ligands” that interact non-covalently with macromolecules. Less frequently, ligands can be covalently linked to macromolecules or other heterogen groups.
Find structures with adenosine triphosphate (ATP) where:
- ATP is present as a standalone ligand
- ATP is present as a covalently linked ligand
Structure-Ligand Complexes
This search example queries complexes with ligands of any type.
You can also narrow down this search to include only complexes with specific features. For example:
- This search example queries the protein-ligand complexes solved using X-ray diffraction experimental technique;
- This search example queries the complexes of proteins from Staphylococcus aureus (strain N315) with ligands.
- This search example queries the DNA-ligand complexes from structures with following experimental details:
- Experimental method: X-Ray diffraction
- Refinement X-Ray Resolution: 0-2
- Refinement R-Factors (R Work): 0-0.2
- Refinement R-Factors (R Free): 0-0.214
- Has Experimental data: Yes
Ligand Of Interest (LOI)
Structures may include small molecules annotated as "ligands of interest", meaning that a small molecule is a subject of the author’s research.
This search example queries structures that contain "ligand(s) of interest".
Binding Affinity
You can search for structure-ligand complexes with associated binding affinity data coming from BindingDB and PDBbind-CN resources.
Binding affinity measurement are of one of the following types:
- IC50: the concentration of ligand that reduces enzyme activity by 50%;
- EC50: the concentration of compound that generates a half-maximal response;
- Kd: dissociation constant;
- Ka: association constant;
- Ki: enzyme inhibition constant;
- ΔG: Gibbs free energy of binding (for association reaction);
- ΔH: change in enthalpy associated with a chemical reaction;
- -TΔS: change in entropy associated with a chemical reaction.
The concentration constants (IC50, EC50) and binding constants (Ki, Kd) are given in nM; The thermodynamic parameters (ΔG, ΔH, -TΔS) are given in kJ/mol; Association binding constant (Ka) is given in M-1.
For example, this search returns structure-ligand complexes with an EC50 = 2 nM, e.g. the Thyroid Hormone Receptor from 3GWS structure has an EC50 of 2 nM for 3,5,3'TRIIODOTHYRONINE (T3).
Chemical Components
Chemical Components include all residues (present in protein or nucleic acid sequences), small molecules (ligands) as well as peptide-like antibiotic and inhibitor molecules found in the PDB archive.
- This search example queries structures containing a particular chemical component (e.g. adenosine triphosphate) using the chemical component ID found in the Chemical Component Dictionary. Examples:
- ATP - for adenosine triphosphate
- HEM - for heme group
- MSE - for Selenomethionine when it is not part of the protein polymers
- ZN - for zink ion
- F - for fluoride ion
- This search example queries structures containing a particular molecule (e.g. biotin), using the molecule name.
Drug Search
A variety of information about small molecule drugs' chemical attributes are available from DrugBank for searching the PDB archive. These include the drug target name, its brand name, grouping of classification (whether it is approved, investigational, withdrawn etc.), or whether it is available in the market (in US, Europe, and Canada). These chemical attributes (features) that can be used for querying the archive to find the specific drug molecule.
Synonyms and Chemical identifiers
Search by drug name as annotated by DrugBank is possible by using the Synonyms field under the Chemical Components section.
This search example queries structures containing drug aciclovir (or acyclovir), which maps to chemical component AC2.
By changing the "Return type" to Molecular Definition you can find the small molecule drug that matches the query. So this search finds the drug "aciclovir" in the chemical component dictionary.
Approval and Market Availability
Small molecule drugs can be searched based on their availability on one of the following markets: US, EU, Canada. All new drugs in the U.S. should be shown to be safe and effective for their intended use prior to marketing and FDA approval is required.
Use this search to find all small molecule drugs available in the PDB archive that were approved for the use on the US market at any point in history.
This search example includes only those drugs that are currently on the market, - this can be done by leaving the Drug Marketing End field void.
To find the structures where all or a specific drug molecule is bound to biological macromolecules, set up the query as appropriate and change the "Return type" to Structure:
This example queries for all structures that have an FDA approved drug bound to it.
This search example queries for all structures in the PDB that have the specific small molecule drug Gleevec or STI bound to it.
Withdrawn
Following their approval for use in a clinical setting, some drugs may be withdrawn due to harmful side-effects. For example, the painkiller Vioxx, also known as Rofecoxib (RCX in PDB entry 5kir) was recalled due to discovery of increased chances of heart attack and stroke.
This search example helps find all the withdrawn drugs
Combined Queries with DrugBank information
You may combine the search for FDA approved drugs with other annotations that are relevant in the context of specific macromolecular structure, using the Structure Attributes section.
For example, use this search to find all approved drugs that were annotated as a Ligand of Interest (LOI). Note that setting return type to Structure won’t guarantee that an approved drug and LOI are the same component.
Publications
Search for PDB structures that do not have a publication associated with it.
This search example queries for structures in the PDB that have the primary publication journal listed as "To be published".
Computed Structure Models (CSMs)
As of August 2022, CSMs predicted by AlphaFold2 (Jumper et al., 2021) and RoseTTAFold (Baek et al., 2021)] are available from RCSB.org for query, visualization, and analysis.
No Experimental Structures available:
To search for mouse proteins that have CSMs but do not have a corresponding experimental structure.
Approach the problem as follows to see examples:
1. Search for all mouse sequences
2. Group by UniProt ID
3. Order results by group size (starting from smallest)
The groups of size 1 that are listed first mostly contain only models with no experimental data.
Predicted structure confidence
Query for high-quality (pLDDT > 90) computed structure models of human proteins.