multifile patent sequence searching on stn€¦ · multifile patent sequence searching on stn ......
TRANSCRIPT
Agenda
• Sequence searchable databases on STN®
• Step-by-step through a multifile BLAST search• Multifile post-processing using STN Express• Overview of the search results• Summary and resources
2
See also: Sequence Basics e-Seminar (June 2010):http://www.stn-international.com/Sequence_Basics_Seminar.html
STN sequence searchable databases
• DGENE– Thomson Reuters GENESEQTM
– Value-added patent sequence data from around the globe• USGENE
– The USPTO Genetic Sequence Database– All available sequence data from the USPTO
• PCTGEN– WIPO/PCT Patent Application Biosequences– All available e-published sequence data from WIPO
• CAS REGISTRY– Chemical Abstracts Service (CAS) REGISTRY– Worldwide value-added patent and non-patent sequences
3
DGENE, USGENE and PCTGEN offer three sequence search modes
• Sequence Code Match (motif) searching– Using the RUN GETSEQ command
• BLAST similarity– Using the RUN BLAST command
• FASTA similarity– Using the RUN GETSIM command
4
Note: this e-Seminar covers BLAST.
CAS REGISTRY/CAplus offers two sequence search modes
• Sequence Code Match (motif) searching– Using the Search (=> S) command
• BLAST similarity– Using a separate Graphic User Interface
5
Note: this e-Seminar covers BLAST.
Multifile patent sequence searching
6
Search Question:Find all patents that disclose Homo sapiens D-amino-acid oxidase (NCBI NP_001908), or similar sequences (≥ 80%):MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPNNPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNCTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQTVTLGGIFQLGNWSELNNIQDHNTIWEGCCRLEPTLKNARIIGERTGFRPVRPQIRLEREQLRTGPSNTEVIHNYGHGGYGLTIHWGCALEAAKLFGRILEEKKLSRMPPSHL
(Search conducted on 7th July 2010)
Multifile search strategy
1) RUN BLAST in DGENE, USGENE and PCTGEN using offline BATCH mode
2) Merge, organize by patent family, and display DGENE, USGENE and PCTGEN results
3) Repeat the search using CAS REGISTRY BLAST4) Retrieve, identify, and display unique CAS
REGISTRY BLAST CAplus records5) Post-process DGENE, USGENE and PCTGEN
results using the STN Express Table Tool6) Post-process unique REGISTRY BLAST results
using the BLAST Report Tool7
SAVE, UPLOAD and VERIFY the query
• Prepare and save the query as a plain text file in a suitable text editor, e.g. Windows Notepad
8
From the Discover! button menu.
SAVE, UPLOAD and VERIFY the query (cont.)
9
(a) Click Upload Sequence(b) Choose the query file(c) Select the STN database
(b)(a)
(c)
The sequence becomes a Query L-number in the database of choice for use with RUN BLAST.
SAVE, UPLOAD and VERIFY the query (cont.)
10
=> FILE USGENE
=> UPL R BLAST
Uploading C:\. . . .\NP_001908 Homo sapiens DAO.txt
UPLOAD SUCCESSFULLY COMPLETEDL1 GENERATED
=> D L1 LQUE
L1 ANSWER 1 USGENE COPYRIGHT 2010 SEQUENCEBASE CORP on STNLQUE MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSD
PNNPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNCTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQTVTLGGIFQLGNWSELNNIQDHNTIWEGCCRLEPTLKNARIIGERTGFRPVRPQIRLEREQLRTGPSNTEVIHNYGHGGYGLTIHWGCALEAAKLFGRILEEKKLSRMPPSHLThe sequence query is now ready for searching directly in
DGENE, USGENE, or PCTGEN using the L-number (L1).
Commands in red are automatically run by the STN Express Sequence Query Upload wizard.
Verify the sequence was uploaded successfully with D LQUE.
RUN the DGENE, USGENE and PCTGEN BLAST searches in BATCH mode
11
=> FILE DGENEFILE 'DGENE' ENTERED AT 17:05:31 ON 07 JUL 2010COPYRIGHT (C) 2010 THOMSON REUTERS
=> RUN BLAST L1 /SQP -F F BATCH
PLEASE ENTER BATCH IDENTIFIER (MAX. 8 CHARS):DAOP
TO BE NOTIFIED WHEN THIS BATCH SEARCH IS COMPLETE, PLEASE ENTER YOUR EMAIL ADDRESS (MAX. 50 CHARS) OR "NONE"INPUT: OR (END):[email protected]
BLAST Version 2.2
The BLAST software is used herein with permission of the National Center for Biotechnology Information (NCBI) of the National Library of Medicine (NLM). . . .
BATCH PROCESSING STARTED FOR DAOP
Add BATCH to the end of a RUN BLAST command to search in offline batch search mode.
Enter a valid email address to be notified when the BATCHsearch is completed.
New!
RUN the DGENE, USGENE and PCTGEN BLAST searches in BATCH mode (cont.)
12
=> FILE USGENE
=> RUN BLAST L1 /SQP -F F BATCH. . . .
PLEASE ENTER BATCH IDENTIFIER (MAX. 8 CHARS):DAOP. . . .
=> FILE PCTGEN
=> RUN BLAST L1 /SQP -F F BATCH. . . .
PLEASE ENTER BATCH IDENTIFIER (MAX. 8 CHARS):DAOP. . . .
=> LOG H
SESSION WILL BE HELD FOR 120 MINUTESSTN INTERNATIONAL SESSION SUSPENDED AT 17:07:14 ON 07 JUL 2010
Note: DGENE, USGENE and PCTGEN BLAST searches can be run in parallel using BATCH mode.
Turn the Low Complexity Filter off with the syntax: /SQP –F F
Tip: use LOGOFF HOLD (LOG H) to be able to return to the same STN session within two hours.
=> FILE DGENEFILE 'DGENE' ENTERED AT 17:11:25 ON 07 JUL 2010COPYRIGHT (C) 2010 THOMSON REUTERS
=> RUN GETBATCH DAOPPlease enter your batch identifier
or enter # for batch id listor enter * for batch id at top of listor enter - before batch id to deleteor enter . for (end)
Database DGENE AAPosted date: Jun 25, 2010 11:33 PM
. . . .
ENTER EITHER THE NUMBER OF ANSWERS YOU WISH TO KEEPOR ENTER MINIMUM PERCENT OF SELF SCORE FOLLOWED BY %(BEST ANSWER PERCENTAGE OF SELF SCORE IS 100%) ENTER (ALL) OR ? :80%L2 RUN STATEMENT CREATEDL2 19 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYAD. . . MPPSHL/SQP.-F F
Answer set arranged by accession number; to sort by descendingsimilarity score, enter at an arrow prompt (=>) "sor score d".
Retrieve the BATCH search results
13
In this example, 80% of the Query Self Score is used to select out just the most relevant results (L2).
Use RUN GETBATCH to retrieve completed BATCH search results.
=> FILE USGENE
=> RUN GETBATCH DAOP. . . .
ENTER EITHER THE NUMBER OF ANSWERS YOU WISH TO KEEPOR ENTER MINIMUM PERCENT OF SELF SCORE FOLLOWED BY %(BEST ANSWER PERCENTAGE OF SELF SCORE IS 100%) ENTER (ALL) OR ? :80%L3 RUN STATEMENT CREATEDL3 14 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYAD. . . MPPSHL/SQP.-F F
=> FILE PCTGEN
=> RUN GETBATCH DAOP. . . .
ENTER EITHER THE NUMBER OF ANSWERS YOU WISH TO KEEPOR ENTER MINIMUM PERCENT OF SELF SCORE FOLLOWED BY %(BEST ANSWER PERCENTAGE OF SELF SCORE IS 100%) ENTER (ALL) OR ? :80%L4 RUN STATEMENT CREATEDL4 3 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYAD. . . MPPSHL/SQP.-F F
Retrieve the BATCH search results (cont.)
14
Use RUN GETBATCH to retrieve completed BATCH search results.
Multifile search strategy
1) RUN BLAST in DGENE, USGENE and PCTGEN using offline BATCH mode
2) Merge, organize by patent family, and display DGENE, USGENE and PCTGEN results
3) Repeat the search using CAS REGISTRY BLAST4) Retrieve, identify, and display unique CAS
REGISTRY BLAST CAplus records5) Post-process DGENE, USGENE and PCTGEN
results using the STN Express Table Tool6) Post-process unique REGISTRY BLAST results
using the BLAST Report Tool15
Merge the results into a single L-number
16
=> SET DUPORDER FILESET COMMAND COMPLETED
=> DUP IDE L2 L3 L4
FILE 'DGENE' ENTERED AT 17:16:56 ON 07 JUL 2010COPYRIGHT (C) 2010 THOMSON REUTERS
FILE 'USGENE' ENTERED AT 17:16:56 ON 07 JUL 2010COPYRIGHT (C) 2010 SEQUENCEBASE CORP
FILE 'PCTGEN' ENTERED AT 17:16:56 ON 07 JUL 2010COPYRIGHT (C) 2010 WIPOPROCESSING COMPLETED FOR L2 PROCESSING COMPLETED FOR L3 PROCESSING COMPLETED FOR L4 L5 36 DUP IDE L2 L3 L4 (INCLUDES 0 SETS OF DUPLICATES)
ANSWERS '1-19' FROM FILE DGENE ANSWERS '20-33' FROM FILE USGENE ANSWERS '34-36' FROM FILE PCTGEN
=> SOR IDENT DPROCESSING COMPLETED FOR L5 L6 36 SOR L5 IDENT D
SET DUPORER FILE ensures that multifile records merged using DUP IDE are organized by database (file).
DUPLICATE IDENTIFY (DUP IDE) is used here to create a single multifile L-number (L5).
The multifile L-number (L5) can be sorted by BLAST SCORE, or Percent Identity (IDENT).
New!
Review multifile answers with a free-of-charge format including alignment
17
=> D L6 TRIAL SCORE ALIGN 1-36; FILE STNGUIDE
L6 ANSWER 1 OF 36 DGENE COPYRIGHT 2010 THOMSON REUTERS on STN AN AAO23074 Protein DGENETI Determining a genotype of an individual for preparing a composition
for treating schizophrenia by determining the identity of anucleotide at a biallelic marker of the D-amino acid oxidase gene ofthe polynucleotide in a sample -
DESC Human D-amino acid oxidase wild-type protein.KW Biallelic marker; D-amino acid oxidase; DAO; neuroleptic; CNS
disorder; movement; Parkinson's disease; Huntington's; motorneurone; Alzheimer's; mood; unipolar depression; bipolar; . . . .
SQL 347SCORE 731 100% of query self score 731BLASTALIGN
Query = 347 lettersLength = 347Score = 731 bits (1886), Expect = 0.0Identities = 347/347 (100%), Positives = 347/347 (100%)Query: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQP . . .
MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPSbjct: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQP . . .
Query Self Scoreand percentage.
Review answers with a free-of-charge format including alignment (cont.)
18
L6 ANSWER 4 OF 36 USGENE COPYRIGHT 2010 SEQUENCEBASE CORP on STN TI Collections of matched biological reagents and methods for
identifying matched reagents (PublishedApplication)MTY ProteinSQL 347SCORE 731 100% of query self score 731BLASTALIGN
Query = 347 lettersLength = 347Score = 731 bits (1886), Expect = 0.0Identities = 347/347 (100%), Positives = 347/347 (100%)Query: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPN
MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPNSbjct: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPNQuery: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPR
NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRSbjct: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRQuery: 121 ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVN
ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNSbjct: 121 ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNQuery: 181 CTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQ . . .
CTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQSbjct: 181 CTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQ . . .
BLAST Percent Identity (IDENT).
Review answers with a free-of-charge format including alignment (cont.)
19
L6 ANSWER 28 OF 36 PCTGEN COPYRIGHT 2010 WIPO on STN TI ORGAN-SPECIFIC PROTEINS AND METHODS OFTHEIR USE MTY PRTSQL 347SCORE 728 99% of query self score 731BLASTALIGN
Query = 347 lettersLength = 347Score = 728 bits (1879), Expect = 0.0Identities = 346/347 (99%), Positives = 346/347 (99%)Query: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPN
MRVVVIGAGVIGLSTALCIHERYHSVLQPL IKVYADRFTPLTTTDVAAGLWQPYLSDPNSbjct: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLHIKVYADRFTPLTTTDVAAGLWQPYLSDPNQuery: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPR
NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRSbjct: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRQuery: 121 ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVN
ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNSbjct: 121 ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNQuery: 181 CTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQ . . .
CTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQSbjct: 181 CTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQ . . .
Ensure Capture Session is on to record a transcript for use in post-processing
20
Note: Check the Capture Retrospectively box to capture the session so far, as well as the session from this point forwards.
Use the STN Express 8.4 Patent Family Manager wizard display the results
21
Access the patent family manager wizard from the Discover! Menu.
Choose a bibliographic display format with alignment for the first (best) hit, and a free-of-charge format with alignment for the rest of the sequences in each patent family group.
=> FSORT L6. . . .
L7 36 FSO L6
11 Multi-record Families Answers 1-33Family 1 Answers 1-5Family 2 Answers 6-8Family 3 Answers 9-10Family 4 Answers 11-12Family 5 Answers 13-14Family 6 Answers 15-16Family 7 Answers 17-18Family 8 Answers 19-25Family 9 Answers 26-27Family 10 Answers 28-31Family 11 Answers 32-33
3 Individual Records Answers 34-360 Non-patent Records
The patent family manager begins by organising the results using FSORT...
22
In this example, 14 patent family groups (i.e. 11 + 3) are retrieved.
Commands in RED are those issued automatically by the STN Express Patent Family Manager.
FSORT organizes the patent sequence records by Publication, Application, Related, and Priority numbers.
=> DIS L7 PFAM=7 1 BIB,SQL,SCORE,IDENT,ALIGN
L7 ANSWER 17 OF 36 DGENE COPYRIGHT 2010 THOMSON REUTERS on STN FAMILY7AN AEL25470 protein DGENETI Identifying compound that reduce/inhibit internal ribosome . . . .IN Fear MPA (TELE-N) TELETHON INST CHILD HEALTH RES.PI WO 2006102720 A1 20061005 197AI WO 2006-AU435 20060331PRAI AU 2005-901574 20050331PSL Disclosure; SEQ ID NO 18LA EnglishOS 2006-747347 [76]CR N-PSDB: AEL25469
PC-NCBI: gi30446PC-SWISSPROT: P14920
DESC Reporter protein SEQ ID NO:18.SQL 347SCORE 726 99% of query self score 731IDENT 99%BLASTALIGN
Query = 347 lettersLength = 347Score = 726 bits (1873), Expect = 0.0Identities = 345/347 (99%), Positives = 345/347 (99%). . . .. . . .
...and then continues by displaying the family groups in the specified formats
23
Commands in RED are those issued automatically by the STN Express Patent Family Manager.
=> DIS L7 PFAM=7 2-TOT TRIAL,SCORE,IDENT,ALIGN
L7 ANSWER 18 OF 36 USGENE COPYRIGHT 2010 SEQUENCEBASE CORP on STNFAMILY7TI Isolation of Inhibitors of IRES-Mediated Translation
(PublishedApplication)DESC Homo Sapiens Protein; sequence 18 of 148MTY ProteinSQL 347SCORE 726 99% of query self score 731IDENT 99%BLASTALIGN
Query = 347 lettersLength = 347Score = 726 bits (1873), Expect = 0.0Identities = 345/347 (99%), Positives = 345/347 (99%)Query: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPN
MRVVVIGAGVIGLSTALCIHERYHSVLQPL IKVYADRFTPLTTTDVAAGLWQPYLSDPNSbjct: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLHIKVYADRFTPLTTTDVAAGLWQPYLSDPNQuery: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPR
NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRSbjct: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRQuery: 121 ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVN
ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNSbjct: 121 ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVN
. . . .
...and then continues by displaying the family groups in the specified formats (cont.)
24
This USGENE hit is in the same family as the DGENE record on the previous slide (FAMILY 7).
=> DIS L7 34-36 BIB,SQL,SCORE,IDENT,ALIGN
L7 ANSWER 34 OF 36 USGENE COPYRIGHT 2010 SEQUENCEBASE CORP on STN AN 20060275794.63099 Protein USGENETI Collections of matched biological reagents and methods for
identifying matched reagents (PublishedApplication)IN Carrino John (San Diego, CA); Liang Feng (San Diego, CA)PA Invitrogen Corporation (Carlsbad CA)PI US 20060275794 A1 20061207AI US 2006-371354 20060307DT PatentSQL 347SCORE 731 100% of query self score 731IDENT 100%BLASTALIGN
Query = 347 lettersLength = 347Score = 731 bits (1886), Expect = 0.0Identities = 347/347 (100%), Positives = 347/347 (100%)Query: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPN
MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPNSbjct: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPNQuery: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPR
NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRSbjct: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPR
. . . .
...and then continues by displaying the family groups in the specified formats (cont.)
25
This USGENE record is the first of the 3 “individual records” in the FSORT answer set (L7).
Multifile search strategy
1) RUN BLAST in DGENE, USGENE and PCTGEN using offline BATCH mode
2) Merge, organize by patent family, and display DGENE, USGENE and PCTGEN results
3) Repeat the search using CAS REGISTRY BLAST4) Retrieve, identify, and display unique CAS
REGISTRY BLAST CAplus records5) Post-process DGENE, USGENE and PCTGEN
results using the STN Express Table Tool6) Post-process unique REGISTRY BLAST results
using the BLAST Report Tool26
Typical steps of CAS REGISTRY BLAST
1. Launch BLAST2. Search the sequence3. Examine and evaluate alignment/relevance of
sequence answers4. Display STN data on sequences – REGISTRY5. Display STN data on sequences – CAplusSM
– Limit CAplus results, if necessary– Display CAplus data (references and HITRN)
6. Post-process BLAST alignment data
27
Launch CAS REGISTRY BLAST
28
• The Result Set Manager is the starting point
• To begin a new sequence search
• To review results of previous sequence searches
Input the search query
29
• Sequences can be input by Copy/paste • Read from a file• Recall a previously searched sequence
within the same session• Sequence line numbers do not
interfere with the search.
Select the BLAST program
30
The following programs are most typically run:• BLASTn for nucleotides• BLASTp for proteins/peptides
Verify BLAST settings
31
Default values have been set to optimize sequence searches for researchers. Recommended settings for patent searches:• Low Complexity Filtering –
unchecked• Max No. of Answers - 1000
Evaluate the alignment report
33
The negative sign represents that the alignment details are shown.Detail information such as the sequence length, score, percent identity are available.
Select sequences of interest
34
Sequences can be selected:• In groups, using the color bar in the
Alignment Scores• Individually, by selecting the check box• To transfer the sequence data to STN,
click the Get STN Data button.
Get STN Data and Save alignments (.xss)
35
The alignment data is saved in STN Express Saved Sequences (.xss) format.
Alignment data needs to be transferred for post-processing.
Transfer sequences to STN
36
Display sequences if desired.
• Logon to STN and a REGISTRY search of the sequences is automatic.
• Results display can be accomplished using either Discover! wizards or command line input.
• Note: Type END or click Cancel to get out of the “Display Wizard”. You can turn off the “Display Wizard” in Preferences.
Multifile search strategy
1) RUN BLAST in DGENE, USGENE and PCTGEN using offline BATCH mode
2) Merge, organize by patent family, and display DGENE, USGENE and PCTGEN results
3) Repeat the search using CAS REGISTRY BLAST4) Retrieve, identify, and display unique CAS
REGISTRY BLAST CAplus records5) Post-process DGENE, USGENE and PCTGEN
results using the STN Express Table Tool6) Post-process unique REGISTRY BLAST results
using the BLAST Report Tool37
Display additional CAplus answers including the HITRN for alignment post-processing
38
=> FILE HCAPLUSFILE 'HCAPLUS' ENTERED AT 17:25:10 ON 07 JUL 2010COPYRIGHT (C) 2010 AMERICAN CHEMICAL SOCIETY (ACS)
=> S L12 AND PATENT/DTL13 12 L12 AND PATENT/DT
=> TRANSFER L6 PN 1-L14 TRANSFER L6 1- PN : 20 TERMSL15 29 L14ALL TERMS IN L14 RETRIEVED.
=> S L13 NOT L15L16 2 L13 NOT L15
=> D BIB HITRN 1-2
The 44 REGISTRY records (L12) correspond to 12 HCAplus patent records (L13).
Transfer Publication Numbers (PN) from DGENE/USGENE/PCTGEN (L6) to find corresponding HCAplus records (L15).
In this example, 2 additional, highly relevant references have been found by including the REGISTRY/HCAplus search (L16).
Example: Unique REGISTRY/CAplus result
39
L16 ANSWER 1 OF 2 HCAPLUS COPYRIGHT 2010 ACS on STN AN 2002:391912 HCAPLUSDN 137:1836TI Measurement of DNA methylation for analysis of the toxicology . . . .IN Olek, Alexander; Piepenbrock, Christian; Berlin, KurtPA Epigenomics Ag, GermanySO PCT Int. Appl., 113 pp.
CODEN: PIXXD2LA GermanFAN.CNT 1
PATENT NO. KIND DATE APPLICATION NO. DATE--------------- ---- -------- -------------------- --------
PI WO 2002040710 A2 20020523 WO 2001-EP12951 20011108. . . .
PRAI DE 2000-10056802 A 20001114 WO 2001-EP12951 W 20011108
IT 391975-30-7, Protein (human 347-amino acid)RL: BSU (Biological study, unclassified); PRP (Properties); BIOL(Biological study)
(amino acid sequence; measurement of DNA methylation for anal. of the toxicol. of substances)
Note: HITRN must be included, so that the CAS REGISTRY BLAST alignments can be merged into the BLAST Report.
Multifile search strategy
1) RUN BLAST in DGENE, USGENE and PCTGEN using offline BATCH mode
2) Merge, organize by patent family, and display DGENE, USGENE and PCTGEN results
3) Repeat the search using CAS REGISTRY BLAST4) Retrieve, identify, and display unique CAS
REGISTRY BLAST CAplus records5) Post-process DGENE, USGENE and PCTGEN
results using the STN Express Table Tool6) Post-process unique REGISTRY BLAST results
using the BLAST Report Tool40
Access the Table Tool and select the multifile search Transcript file
41
The most recent STN session Transcript is usually listed here.
Choose a template and select content
42
Option: choose a pre-defined custom template from a previous project.
L7 is the DGENE, USGENE and PCTGEN FSORTed answer set.
Select fields, column order, headings, fonts and spacing for the table
43
The pre-defined custom template included a list of fields. These can be further customized and the template re-saved.
Explore the results further in Microsoft Excel
45
Some tips for Microsoft Excel:• Resize columns and rows as desired –
especially the BLAST alignment column to approx 77
• View, Freeze panes – holds the top row fixed when scrolling down
• Add Filters – provides a great way to navigate results – for example by BLAST percent identity (above)
Multifile search strategy
1) RUN BLAST in DGENE, USGENE and PCTGEN using offline BATCH mode
2) Merge, organize by patent family, and display DGENE, USGENE and PCTGEN results
3) Repeat the search using CAS REGISTRY BLAST4) Retrieve, identify, and display unique CAS
REGISTRY BLAST CAplus records5) Post-process DGENE, USGENE and PCTGEN
results using the STN Express Table Tool6) Post-process unique REGISTRY BLAST results
using the BLAST Report Tool46
Post-process REGISTRY BLAST alignments
Download the post-processing template (.PRF) files used in this seminar:http://www.stn-international.com/stn_biosequence_searching_mfs.html
47
Select BLAST alignment report
48
• The first step is to select the XSS file to include in the BLAST report.
• Important: If your BLAST query is fairly long, or a nucleic acid, or the answers may exceed 1000 characters, make sure you change the value in the Do not include alignments longer than box.
Post-processing then continues via standard STN Express Custom Report Tool steps.
Select the session Transcript and template
49
The most recent STN session Transcript is usually listed here.
Option: choose a pre-defined custom template from a previous project.
Select fields, fonts and spacing for the report
51
The pre-defined custom template included a list of fields. These can be further customized and the template re-saved.
Overview of search results for Homo sapiens D-amino-acid oxidase – unique in (red)
SEQs≥ 80%
PNs Patent Families*
DGENE 19 10 8 (1)
USGENE 14 10 7 (2)
PCTGEN 3 3 3 (1)
REGISTRY 18 12 9 (2)
NCBI 6 4 4 (0)
Total Unique - - 14(* Patent families = INPADOC Patent Families. Specifically, family records in INPAFAMDB.)
Summary
• RUN BLAST is available for searching DGENE, USGENE and PCTGEN directly on STN
• CAS REGISTRY BLAST provides BLAST searching options for the REGISTRY database
• DGENE, USGENE and PCTGEN multifile search results can be post-processed into tables, and exported to Microsoft Excel, using STN Express
• CAS REGISTRY BLAST alignment data can be merged with CAplus records, and exported in to RTF format, to form single unified report
• All four STN sequence databases are required for a comprehensive patent sequence search
54
Resources for sequence searching on STN
• Sequence Searching on STN modular workshophttp://www.stn-international.com/sequence_searching.html
• CAS REGISTRY sequence searching resourceshttp://www.cas.org/support/stngen/stndoc/sequences.html
• DGENE Workshop Manualhttp://www.stn-international.com/dgene_wm.html
• USGENE Workshop Manualhttp://www.stn-international.com/usgene_wm.html
• USGENE Workshop Manual Multifile Supplement:http://www.stn-international.com/usgene_wm_mfs.html
55
FIZ [email protected] and Training:www.stn-international.de
CASE-mail: [email protected] and Training:www.cas.org
For more information …