multifile patent sequence searching on stn® · multifile patent sequence searching on stn ......

69
Multifile Patent Sequence Searching on STN ® Robert Austin – FIZ Karlsruhe

Upload: phamlien

Post on 14-Sep-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

Multifile Patent Sequence Searching on STN®

Robert Austin – FIZ Karlsruhe

Agenda

• Sequence searchable databases on STN®

• Step-by-step through a multifile BLAST search• Multifile post-processing using STN Express• Overview of the search results• Recent database enhancements• Summary and resources

2

See also: Sequence Basics e-Seminar:http://www.stn-international.com/Sequence_Basics_Seminar.html

STN sequence searchable databases

• DGENE– Thomson Reuters GENESEQTM

– Value-added patent sequence data from around the globe• USGENE

– The USPTO Genetic Sequence Database– All available sequence data from the USPTO

• PCTGEN– WIPO/PCT Patent Application Biosequences– All available e-published sequence data from WIPO

• CAS REGISTRYSM

– Chemical Abstracts Service (CAS) REGISTRY– Worldwide value-added patent and non-patent sequences

3

CAS REGISTRY/CAplus offers two sequence search modes

• NCBI BLAST similarity– Using a separate Graphic User Interface

• Sequence Code Match (motif) searching– Using the Search (=> S) command

4

DGENE, USGENE and PCTGEN offer three sequence search modes

• NCBI BLAST similarity=> RUN BLAST

• FASTA-based similarity=> RUN GETSIM

• Sequence Code Match (motif) search=> RUN GETSEQ

5

Learn more in the DGENE Workshop Manual:http://www.stn-international.com/dgene_wm.html

Multifile patent sequence searching

6

Search Question:Find all patents that disclose Homo sapiens D-amino-acid oxidase (NCBI NP_001908), or similar sequences (≥ 80%):MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPNNPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNCTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQTVTLGGIFQLGNWSELNNIQDHNTIWEGCCRLEPTLKNARIIGERTGFRPVRPQIRLEREQLRTGPSNTEVIHNYGHGGYGLTIHWGCALEAAKLFGRILEEKKLSRMPPSHL

(Search conducted on 7th July 2010)

Multifile search strategy

1) RUN BLAST in DGENE, USGENE and PCTGEN using offline BATCH mode

2) Merge, organize by patent family, and display DGENE, USGENE and PCTGEN results

3) Repeat the search using CAS REGISTRY BLAST4) Retrieve, identify, and display unique CAS

REGISTRY BLAST CAplus records5) Post-process DGENE, USGENE and PCTGEN

results using the STN Express Table Tool6) Post-process unique REGISTRY BLAST results

using the BLAST Report Tool7

SAVE, UPLOAD and VERIFY the query

• Prepare and save the query as a plain text file in a suitable text editor, e.g. Windows Notepad

8

From the Discover! button menu.

SAVE, UPLOAD and VERIFY the query (cont.)

9

(a) Click Upload Sequence(b) Choose the query file(c) Select the STN database

(b)(a)

(c)

The sequence becomes a Query L-number in the database of choice for use with RUN BLAST.

SAVE, UPLOAD and VERIFY the query (cont.)

10

=> FILE USGENE

=> UPL R BLAST

Uploading C:\. . . .\NP_001908 Homo sapiens DAO.txt

UPLOAD SUCCESSFULLY COMPLETEDL1 GENERATED

=> D L1 LQUE

L1 ANSWER 1 USGENE COPYRIGHT 2010 SEQUENCEBASE CORP on STNLQUE MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSD

PNNPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNCTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQTVTLGGIFQLGNWSELNNIQDHNTIWEGCCRLEPTLKNARIIGERTGFRPVRPQIRLEREQLRTGPSNTEVIHNYGHGGYGLTIHWGCALEAAKLFGRILEEKKLSRMPPSHLThe sequence query is now ready for searching directly in

DGENE, USGENE, or PCTGEN using the L-number (L1).

Commands in red are automatically run by the STN Express Sequence Query Upload wizard.

Verify the sequence was uploaded successfully with D LQUE.

RUN the DGENE, USGENE and PCTGEN BLAST searches in BATCH mode

11

=> FILE DGENEFILE 'DGENE' ENTERED AT 17:05:31 ON 07 JUL 2010COPYRIGHT (C) 2010 THOMSON REUTERS

=> RUN BLAST L1 /SQP -F F BATCH

PLEASE ENTER BATCH IDENTIFIER (MAX. 8 CHARS):DAOP

TO BE NOTIFIED WHEN THIS BATCH SEARCH IS COMPLETE, PLEASE ENTER YOUR EMAIL ADDRESS (MAX. 50 CHARS) OR "NONE"INPUT: OR (END):[email protected]

BLAST Version 2.2

The BLAST software is used herein with permission of the National Center for Biotechnology Information (NCBI) of the National Library of Medicine (NLM). . . .

BATCH PROCESSING STARTED FOR DAOP

Add BATCH to the end of a RUN BLAST command to search in offline batch search mode.

Enter a valid email address to be notified when the BATCHsearch is completed.

Tip: BATCH mode BLAST searches may be run concurrently in each database.

RUN the DGENE, USGENE and PCTGEN BLAST searches in BATCH mode (cont.)

12

=> FILE USGENE

=> RUN BLAST L1 /SQP -F F BATCH. . . .

PLEASE ENTER BATCH IDENTIFIER (MAX. 8 CHARS):DAOP. . . .

=> FILE PCTGEN

=> RUN BLAST L1 /SQP -F F BATCH. . . .

PLEASE ENTER BATCH IDENTIFIER (MAX. 8 CHARS):DAOP. . . .

=> LOG H

SESSION WILL BE HELD FOR 120 MINUTESSTN INTERNATIONAL SESSION SUSPENDED AT 17:07:14 ON 07 JUL 2010

RUN the USGENE and PCTGEN searches concurrently using BATCH.

Reminder: Turn the Low Complexity Filter off with the syntax: /SQP –F F

Tip: use LOGOFF HOLD (LOG H) to be able to return to the same STN session within two hours (optional).

=> FILE DGENEFILE 'DGENE' ENTERED AT 17:11:25 ON 07 JUL 2010COPYRIGHT (C) 2010 THOMSON REUTERS

=> RUN GETBATCH DAOPPlease enter your batch identifier

or enter # for batch id listor enter * for batch id at top of listor enter - before batch id to deleteor enter . for (end)

Database DGENE AAPosted date: Jun 25, 2010 11:33 PM

. . . .

ENTER EITHER THE NUMBER OF ANSWERS YOU WISH TO KEEPOR ENTER MINIMUM PERCENT OF SELF SCORE FOLLOWED BY %(BEST ANSWER PERCENTAGE OF SELF SCORE IS 100%) ENTER (ALL) OR ? :80%L2 RUN STATEMENT CREATEDL2 19 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYAD. . . MPPSHL/SQP.-F F

Answer set arranged by accession number; to sort by descendingsimilarity score, enter at an arrow prompt (=>) "sor score d".

Retrieve the BATCH search results

13

In this example, 80% of the Query Self Score is used to select out just the most relevant results (L2).

Use RUN GETBATCH to retrieve completed BATCH search results.

=> FILE USGENE

=> RUN GETBATCH DAOP. . . .

ENTER EITHER THE NUMBER OF ANSWERS YOU WISH TO KEEPOR ENTER MINIMUM PERCENT OF SELF SCORE FOLLOWED BY %(BEST ANSWER PERCENTAGE OF SELF SCORE IS 100%) ENTER (ALL) OR ? :80%L3 RUN STATEMENT CREATEDL3 14 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYAD. . . MPPSHL/SQP.-F F

=> FILE PCTGEN

=> RUN GETBATCH DAOP. . . .

ENTER EITHER THE NUMBER OF ANSWERS YOU WISH TO KEEPOR ENTER MINIMUM PERCENT OF SELF SCORE FOLLOWED BY %(BEST ANSWER PERCENTAGE OF SELF SCORE IS 100%) ENTER (ALL) OR ? :80%L4 RUN STATEMENT CREATEDL4 3 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYAD. . . MPPSHL/SQP.-F F

Retrieve the BATCH search results (cont.)

14

Use RUN GETBATCH to retrieve completed BATCH search results.

Multifile search strategy

1) RUN BLAST in DGENE, USGENE and PCTGEN using offline BATCH mode

2) Merge, organize by patent family, and display DGENE, USGENE and PCTGEN results

3) Repeat the search using CAS REGISTRY BLAST4) Retrieve, identify, and display unique CAS

REGISTRY BLAST CAplus records5) Post-process DGENE, USGENE and PCTGEN

results using the STN Express Table Tool6) Post-process unique REGISTRY BLAST results

using the BLAST Report Tool15

Merge the results into a single L-number

16

=> SET DUPORDER FILESET COMMAND COMPLETED

=> DUP IDE L2 L3 L4

FILE 'DGENE' ENTERED AT 17:16:56 ON 07 JUL 2010COPYRIGHT (C) 2010 THOMSON REUTERS

FILE 'USGENE' ENTERED AT 17:16:56 ON 07 JUL 2010COPYRIGHT (C) 2010 SEQUENCEBASE CORP

FILE 'PCTGEN' ENTERED AT 17:16:56 ON 07 JUL 2010COPYRIGHT (C) 2010 WIPOPROCESSING COMPLETED FOR L2 PROCESSING COMPLETED FOR L3 PROCESSING COMPLETED FOR L4 L5 36 DUP IDE L2 L3 L4 (INCLUDES 0 SETS OF DUPLICATES)

ANSWERS '1-19' FROM FILE DGENE ANSWERS '20-33' FROM FILE USGENE ANSWERS '34-36' FROM FILE PCTGEN

=> SOR IDENT DPROCESSING COMPLETED FOR L5 L6 36 SOR L5 IDENT D

SET DUPORER FILE ensures that multifile records merged using DUP IDE are organized by database (file).

DUPLICATE IDENTIFY (DUP IDE) is used here to create a single multifile L-number (L5).

The multifile L-number (L5) can be sorted by BLAST SCORE, or Percent Identity (IDENT).

Review multifile answers with a free-of-charge format including alignment

17

=> D L6 TRIAL SCORE ALIGN 1-36; FILE STNGUIDE

L6 ANSWER 1 OF 36 DGENE COPYRIGHT 2010 THOMSON REUTERS on STN AN AAO23074 Protein DGENETI Determining a genotype of an individual for preparing a composition

for treating schizophrenia by determining the identity of anucleotide at a biallelic marker of the D-amino acid oxidase gene ofthe polynucleotide in a sample -

DESC Human D-amino acid oxidase wild-type protein.KW Biallelic marker; D-amino acid oxidase; DAO; neuroleptic; CNS

disorder; movement; Parkinson's disease; Huntington's; motorneurone; Alzheimer's; mood; unipolar depression; bipolar; . . . .

SQL 347SCORE 731 100% of query self score 731BLASTALIGN

Query = 347 lettersLength = 347Score = 731 bits (1886), Expect = 0.0Identities = 347/347 (100%), Positives = 347/347 (100%)Query: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQP . . .

MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPSbjct: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQP . . .

Query Self Scoreand percentage.

Review answers with a free-of-charge format including alignment (cont.)

18

L6 ANSWER 4 OF 36 USGENE COPYRIGHT 2010 SEQUENCEBASE CORP on STN TI Collections of matched biological reagents and methods for

identifying matched reagents (PublishedApplication)MTY ProteinSQL 347SCORE 731 100% of query self score 731BLASTALIGN

Query = 347 lettersLength = 347Score = 731 bits (1886), Expect = 0.0Identities = 347/347 (100%), Positives = 347/347 (100%)Query: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPN

MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPNSbjct: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPNQuery: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPR

NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRSbjct: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRQuery: 121 ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVN

ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNSbjct: 121 ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNQuery: 181 CTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQ . . .

CTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQSbjct: 181 CTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQ . . .

BLAST Percent Identity (IDENT).

Review answers with a free-of-charge format including alignment (cont.)

19

L6 ANSWER 28 OF 36 PCTGEN COPYRIGHT 2010 WIPO on STN TI ORGAN-SPECIFIC PROTEINS AND METHODS OFTHEIR USE MTY PRTSQL 347SCORE 728 99% of query self score 731BLASTALIGN

Query = 347 lettersLength = 347Score = 728 bits (1879), Expect = 0.0Identities = 346/347 (99%), Positives = 346/347 (99%)Query: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPN

MRVVVIGAGVIGLSTALCIHERYHSVLQPL IKVYADRFTPLTTTDVAAGLWQPYLSDPNSbjct: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLHIKVYADRFTPLTTTDVAAGLWQPYLSDPNQuery: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPR

NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRSbjct: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRQuery: 121 ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVN

ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNSbjct: 121 ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNQuery: 181 CTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQ . . .

CTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQSbjct: 181 CTGVWAGALQRDPLLQPGRGQIMKVDAPWMKHFILTHDPERGIYNSPYIIPGTQ . . .

Ensure Capture Session is on to record a transcript for use in post-processing

20

Note: Check the Capture Retrospectively box to capture the session so far, as well as the session from this point forwards.

Use the STN Express 8.4 Patent Family Manager wizard display the results

21

Access the patent family manager wizard from the Discover! Menu.

Choose a bibliographic display format with alignment for the first (best) hit, and a free-of-charge format with alignment for the rest of the sequences in each patent family group.

=> FSORT L6. . . .

L7 36 FSO L6

11 Multi-record Families Answers 1-33Family 1 Answers 1-5Family 2 Answers 6-8Family 3 Answers 9-10Family 4 Answers 11-12Family 5 Answers 13-14Family 6 Answers 15-16Family 7 Answers 17-18Family 8 Answers 19-25Family 9 Answers 26-27Family 10 Answers 28-31Family 11 Answers 32-33

3 Individual Records Answers 34-360 Non-patent Records

The patent family manager begins by organising the results using FSORT...

22

In this example, 14 patent family groups (i.e. 11 + 3) are retrieved.

Commands in RED are those issued automatically by the STN Express Patent Family Manager.

FSORT organizes the patent sequence records by Publication, Application, Related, and Priority numbers.

=> DIS L7 PFAM=7 1 BIB,SQL,SCORE,IDENT,ALIGN

L7 ANSWER 17 OF 36 DGENE COPYRIGHT 2010 THOMSON REUTERS on STN FAMILY7AN AEL25470 protein DGENETI Identifying compound that reduce/inhibit internal ribosome . . . .IN Fear MPA (TELE-N) TELETHON INST CHILD HEALTH RES.PI WO 2006102720 A1 20061005 197AI WO 2006-AU435 20060331PRAI AU 2005-901574 20050331PSL Disclosure; SEQ ID NO 18LA EnglishOS 2006-747347 [76]CR N-PSDB: AEL25469

PC-NCBI: gi30446PC-SWISSPROT: P14920

DESC Reporter protein SEQ ID NO:18.SQL 347SCORE 726 99% of query self score 731IDENT 99%BLASTALIGN

Query = 347 lettersLength = 347Score = 726 bits (1873), Expect = 0.0Identities = 345/347 (99%), Positives = 345/347 (99%). . . .

...and then continues by displaying the family groups in the specified formats

23

Commands in RED are those issued automatically by the STN Express Patent Family Manager.

=> DIS L7 PFAM=7 2-TOT TRIAL,SCORE,IDENT,ALIGN

L7 ANSWER 18 OF 36 USGENE COPYRIGHT 2010 SEQUENCEBASE CORP on STNFAMILY7TI Isolation of Inhibitors of IRES-Mediated Translation

(PublishedApplication)DESC Homo Sapiens Protein; sequence 18 of 148MTY ProteinSQL 347SCORE 726 99% of query self score 731IDENT 99%BLASTALIGN

Query = 347 lettersLength = 347Score = 726 bits (1873), Expect = 0.0Identities = 345/347 (99%), Positives = 345/347 (99%)Query: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLDIKVYADRFTPLTTTDVAAGLWQPYLSDPN

MRVVVIGAGVIGLSTALCIHERYHSVLQPL IKVYADRFTPLTTTDVAAGLWQPYLSDPNSbjct: 1 MRVVVIGAGVIGLSTALCIHERYHSVLQPLHIKVYADRFTPLTTTDVAAGLWQPYLSDPNQuery: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPR

NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRSbjct: 61 NPQEADWSQQTFDYLLSHVHSPNAENLGLFLISGYNLFHEAIPDPSWKDTVLGFRKLTPRQuery: 121 ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVN

ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVNSbjct: 121 ELDMFPDYGYGWFHTSLILEGKNYLQWLTERLTERGVKFFQRKVESFEEVAREGADVIVN

. . . .

...and then continues by displaying the family groups in the specified formats (cont.)

24

This USGENE hit is in the same family as the DGENE record on the previous slide (FAMILY 7).

=> DIS L7 34-36 BIB,SQL,SCORE,IDENT,ALIGN

L7 ANSWER 34 OF 36 USGENE COPYRIGHT 2010 SEQUENCEBASE CORP on STN AN 20060275794.63099 Protein USGENETI Collections of matched biological reagents and methods for

identifying matched reagents (PublishedApplication)IN Carrino John (San Diego, CA); Liang Feng (San Diego, CA)PA Invitrogen Corporation (Carlsbad CA)PI US 20060275794 A1 20061207AI US 2006-371354 20060307PRAI WO 2005-US13914 20050422

US 2005-673045P 20050419US 2005-665199P 20050325US 2005-665200P 20050325US 2005-659492P 20050307US 2005-659493P 20050307

PSL SEQ ID NO 63099DESC Homo sapiens protein; sequence 63099DT Patent SQL 347SCORE 731 100% of query self score 731IDENT 100%BLASTALIGN

Query = 347 lettersLength = 347Score = 731 bits (1886), Expect = 0.0Identities = 347/347 (100%), Positives = 347/347 (100%). . . .

...and then continues by displaying the family groups in the specified formats (cont.)

25

This USGENE record is the first of the 3 “individual records” in the FSORT answer set (L7).

Multifile search strategy

1) RUN BLAST in DGENE, USGENE and PCTGEN using offline BATCH mode

2) Merge, organize by patent family, and display DGENE, USGENE and PCTGEN results

3) Repeat the search using CAS REGISTRY BLAST4) Retrieve, identify, and display unique CAS

REGISTRY BLAST CAplus records5) Post-process DGENE, USGENE and PCTGEN

results using the STN Express Table Tool6) Post-process unique REGISTRY BLAST results

using the BLAST Report Tool26

Typical steps of CAS REGISTRY BLAST

1. Launch BLAST2. Search the sequence3. Examine and evaluate alignment/relevance of

sequence answers4. Display STN data on sequences – REGISTRY5. Display STN data on sequences – CAplusSM

– Limit CAplus results, if necessary– Display CAplus data (references and HITRN)

6. Post-process BLAST alignment data

27

Launch CAS REGISTRY BLAST

28

• The Result Set Manager is the starting point

• To begin a new sequence search

• To review results of previous sequence searches

Input the search query

29

• Sequences can be input by Copy/paste • Read from a file• Recall a previously searched sequence

within the same session• Sequence line numbers do not

interfere with the search.

Select the BLAST program

30

The following programs are most typically run:• BLASTn for nucleotides• BLASTp for proteins/peptides

Verify BLAST settings

31

Default values have been set to optimize sequence searches for researchers. Recommended settings for patent searches:• Low Complexity Filtering –

unchecked• Max No. of Answers - 1000

View results

32

Highlight the result set to be viewed, and click on View Results.

Evaluate the alignment report

33

The negative sign represents that the alignment details are shown.Detail information such as the sequence length, score, percent identity are available.

Select sequences of interest

34

Sequences can be selected:• In groups, using the color bar in the

Alignment Scores• Individually, by selecting the check box• To transfer the sequence data to STN,

click the Get STN Data button.

Get STN Data and Save alignments (.xss)

35

The alignment data is saved in STN Express Saved Sequences (.xss) format.

Alignment data needs to be transferred for post-processing.

Transfer sequences to STN

36

Display sequences if desired.

• Logon to STN and a REGISTRY search of the sequences is automatic.

• Results display can be accomplished using either Discover! wizards or command line input.

• Note: Type END or click Cancel to get out of the “Display Wizard”. You can turn off the “Display Wizard” in Preferences.

Multifile search strategy

1) RUN BLAST in DGENE, USGENE and PCTGEN using offline BATCH mode

2) Merge, organize by patent family, and display DGENE, USGENE and PCTGEN results

3) Repeat the search using CAS REGISTRY BLAST4) Retrieve, identify, and display unique CAS

REGISTRY BLAST CAplus records5) Post-process DGENE, USGENE and PCTGEN

results using the STN Express Table Tool6) Post-process unique REGISTRY BLAST results

using the BLAST Report Tool37

Display additional CAplus answers including the HITRN for alignment post-processing

38

=> FILE HCAPLUSFILE 'HCAPLUS' ENTERED AT 17:25:10 ON 07 JUL 2010COPYRIGHT (C) 2010 AMERICAN CHEMICAL SOCIETY (ACS)

=> S L12 AND PATENT/DTL13 12 L12 AND PATENT/DT

=> TRANSFER L6 PN 1-L14 TRANSFER L6 1- PN : 20 TERMSL15 29 L14ALL TERMS IN L14 RETRIEVED.

=> S L13 NOT L15L16 2 L13 NOT L15

=> D BIB HITRN 1-2

The 44 REGISTRY records (L12) correspond to 12 HCAplus patent records (L13).

Transfer Publication Numbers (PN) from DGENE/USGENE/PCTGEN (L6) to find corresponding HCAplus records (L15).

In this example, 2 additional, highly relevant references have been found by including the REGISTRY/HCAplus search (L16).

Example: Unique REGISTRY/CAplus result

39

L16 ANSWER 1 OF 2 HCAPLUS COPYRIGHT 2010 ACS on STN AN 2002:391912 HCAPLUSDN 137:1836TI Measurement of DNA methylation for analysis of the toxicology . . . .IN Olek, Alexander; Piepenbrock, Christian; Berlin, KurtPA Epigenomics Ag, GermanySO PCT Int. Appl., 113 pp.

CODEN: PIXXD2LA GermanFAN.CNT 1

PATENT NO. KIND DATE APPLICATION NO. DATE--------------- ---- -------- -------------------- --------

PI WO 2002040710 A2 20020523 WO 2001-EP12951 20011108. . . .

PRAI DE 2000-10056802 A 20001114 WO 2001-EP12951 W 20011108

IT 391975-30-7, Protein (human 347-amino acid)RL: BSU (Biological study, unclassified); PRP (Properties); BIOL(Biological study)

(amino acid sequence; measurement of DNA methylation for anal. of the toxicol. of substances)

Note: HITRN must be included, so that the CAS REGISTRY BLAST alignments can be merged into the BLAST Report.

Multifile search strategy

1) RUN BLAST in DGENE, USGENE and PCTGEN using offline BATCH mode

2) Merge, organize by patent family, and display DGENE, USGENE and PCTGEN results

3) Repeat the search using CAS REGISTRY BLAST4) Retrieve, identify, and display unique CAS

REGISTRY BLAST CAplus records5) Post-process DGENE, USGENE and PCTGEN

results using the STN Express Table Tool6) Post-process unique REGISTRY BLAST results

using the BLAST Report Tool40

Access the Table Tool and select the multifile search Transcript file

41

The most recent STN session Transcript is usually listed here.

Choose a template and select content

42

Option: choose a pre-defined custom template from a previous project.

L7 is the DGENE, USGENE and PCTGEN FSORTed answer set.

Select fields, column order, headings, fonts and spacing for the table

43

The pre-defined custom template included a list of fields. These can be further customized and the template re-saved.

Review, adjust, and export the table

44

Explore the results further in Microsoft Excel

45

Some tips for Microsoft Excel:• Resize columns and rows as desired –

especially the BLAST alignment column to approx 77

• View, Freeze panes – holds the top row fixed when scrolling down

• Add Filters – provides a great way to navigate results – for example by BLAST percent identity (above)

Multifile search strategy

1) RUN BLAST in DGENE, USGENE and PCTGEN using offline BATCH mode

2) Merge, organize by patent family, and display DGENE, USGENE and PCTGEN results

3) Repeat the search using CAS REGISTRY BLAST4) Retrieve, identify, and display unique CAS

REGISTRY BLAST CAplus records5) Post-process DGENE, USGENE and PCTGEN

results using the STN Express Table Tool6) Post-process unique REGISTRY BLAST results

using the BLAST Report Tool46

Post-process REGISTRY BLAST alignments

Download the post-processing template (.PRF) files used in this seminar:http://www.stn-international.com/stn_biosequence_searching_mfs.html

47

Select BLAST alignment report

48

• The first step is to select the XSS file to include in the BLAST report.

• Important: If your BLAST query is fairly long, or a nucleic acid, or the answers may exceed 1000 characters, make sure you change the value in the Do not include alignments longer than box.

Post-processing then continues via standard STN Express Custom Report Tool steps.

Select the session Transcript and template

49

The most recent STN session Transcript is usually listed here.

Option: choose a pre-defined custom template from a previous project.

Select the records to be processed

50

L16 is REGISTRY/CAplus additional unique answers.

Select fields, fonts and spacing for the report

51

The pre-defined custom template included a list of fields. These can be further customized and the template re-saved.

Review, adjust, and export the report

52

Overview of search results for Homo sapiens D-amino-acid oxidase – unique in (red)

SEQs≥ 80%

PNs Patent Families*

DGENE 19 10 8 (1)

USGENE 14 10 7 (2)

PCTGEN 3 3 3 (1)

REGISTRY 18 12 9 (2)

NCBI 6 4 4 (0)

Total Unique - - 14(* Patent families = INPADOC Patent Families. Specifically, family records in INPAFAMDB.)

53

Recent database enhancements

• Simultaneous left and right truncation added to the basic index of DGENE and PCTGEN

• Recent backfile enhancements– Thomson Reuters GENESEQ (DGENE)– USPTO Genetic Sequence Database (USGENE)– World patent application biosequences (PCTGEN)

54

Simultaneous left and right truncation (SLART) added to the Basic Index of DGENE & PCTGEN

• Left and right truncation provides improved text search capabilities to refine sequence searches

=> S L2 AND INFLAMMAT?L3 7494 L2 AND INFLAMMAT?

=> S L2 AND ?INFLAMMAT?L4 7525 L2 AND ?INFLAMMAT?

=> S L4 NOT L3L5 31 L4 NOT L3

=> D TI DESC KW

L5 ANSWER 1 OF 31 DGENE COPYRIGHT 2011 THOMSON REUTERS on STN TI New lipolytic enzyme, useful for treating digestive disorders,

pancreatic insufficiency, pancreatitis or cystic fibrosis.DESC Aspergillus niger lipolytic enzyme, SEQ ID 2.KW LPY; Lipase; antiinflammatory; cystic fibrosis; gastrointestinal

function disorder; gastrointestinal-gen.; lipolytic enzyme; . . . .

L2 = BLAST search results.

SLART may help retrieve additional answers (L5).

Note: SLART was already available in USGENE.

55

• A backfile of “mega publication” sequence data continues to be added to DGENE:Entry year Pub. years Number of pubs. Number of seqs.

2007 2002 – 2006 20 844,962

2008 2001 – 2008 52 4,575,648

2009 2004 – 2009 109 7,360,824

2010 2006 – 2009 34 2,354,859

2011 (-July) 2003 – 2008 13 3,386,264

Total 228 18,522,557

DGENE backfile enhancements

Status: 22 July 2011.

56

Example: DGENE backfile “mega publication”

57

L1 ANSWER 4 OF 197024 DGENE COPYRIGHT 2011 THOMSON REUTERS on STN AN AUK86054 DNA DGENETI Identifying a target protein of yeast or a gene encoding the target

protein by identifying target protein and gene encoding the protein, and analyzing functions of the gene to identify characters given to the yeast by the gene.

IN Nakao Y; Kodama Y; Shimonaga T; Kanamori TPA (SUNR) SUNTORY LTD.PI WO 2007099451 A1 20070907 292AI WO 2007-IB551 20070226PRAI JP 2006-117198 20060228PSL Disclosure; SEQ ID NO 98512DED 03 FEB 2011 (first entry)DT PatentLA EnglishOS 2007-739784 [69]DESC Saccharomyces pastorianus oligonucleotide, SEQ ID 98512.KW Protein detection; protein purification; brewing; ss.ORGN Saccharomyces pastorianus.AB The present invention relates to a method for identifying a target

protein of brewery yeast or a gene encoding the target protein. Themethod comprises cultivating yeast under a predetermined . . . .

NA 6 A; 7 C; 4 G; 8 T; 0 U; 0 OtherSQL 25SEQ

1 taacccggtc cacgattttg aatct

197,024 backfile sequence records were recently added into DGENE, from WO2007099451 A1.

USGENE backfile enhancements

• The following data/fields are now available for all records published from April 2006 onwards– U.S. related application information (RLI)– Priority application information (PRAI)– Calculated patent expiration date (XPD)– Patent term adjustment details (NTE, PTA)– Patent Sequence Location (PSL)– Sequence description (DESC)

58

See: The USGENE Workshop Manual (page 63):http://www.stn-international.com/usgene_wm.html

Example: USGENE backfile record with RLI, PRAI, XPD, NTE, PSL and DESC fields

L1 ANSWER 1 OF 1 USGENE COPYRIGHT 2012 SEQUENCEBASE CORP on STN AN 6838433.6 Protein USGENETI IL-6 antagonist peptides (Patent)IN Serlupi-Crescenzi Ottaviano (Rome, IT); Bressan Alessandro (Rome,

IT); Della Pietra Linda (Rome, IT); Pezzotti Anna Rita (Rome, IT)PA Applied Research Systems ARS Holding NV (Curacao NL)PI US 6838433 B2 20050104

US 20030186876 A1 20031002AI US 2003-357479 20030204RLI US 2000-715923 20001120

WO 1999-EP3421 19980518PRAI EP 1998-108997 19980518XPD 20180518 (calculated)NTE Subject to any Disclaimer, the term of this patent is extended or

adjusted under 35 USC 154(b) by 54 days.PSL Claim 1; SEQ ID NO 6DESC Artificial protein; Synthetic; sequence 6 of 19DT PatentAB The present invention relates to IL-6 antagonist peptides, isolatable

from a peptide library through the two-hybrids system by . . . .ECLM US6838433 B2: 1. An IL-6 antagonist peptide, isolatable from a

peptide library by binding to the intracellular domain of gp130 in atwo-hybrid system for detecting protein-protein interaction, saidpeptide comprising SEQ ID NO:6, as well as salts, functionalderivatives, and conservatively substituted analogs thereof havingIL-6 antagonist activity.

. . . .

AN 6838433.6 is displayed here in BRIEF format, which includes all of the additional backfile content.

PSL indexing identifies if and where a SEQ ID NO is referred to in the claims.

59

USGENE organism name standardization

• Original typographical errors for “Homo sapiens” in the organism name field (ORGN) have now been corrected throughout the USGENE file– Including converting “Human” to “Homo sapiens”

• From May 2011 onwards, similar standardization is applied for a list of top organisms in USGENE– E.g.: Zea mays, Glycine max, Oryza sativa, Mus

musculus, Arabidopsis thaliana, Streptococcus pneumoniae, Gossypium hirsutum, Triticum aestivum

60

Example: organism name standardization

=> D AN ORGN SEQ

L1 ANSWER 1 OF 1 USGENE COPYRIGHT 2012 SEQUENCEBASE CORP on STN AN 20090232771.1 USGENEORGN Homo sapiensSEQ

1 mdvvdsllvn gsnitppcel glenetlfcl dqprpskewq pavqillysl51 ifllsvlgnt lvitvlirnk rmrtvtnifl lslavsdlml clfcmpfnli101 pnllkdfifg savcktttyf mgtsvsvstf nlvaislery gaickplqsr151 vwqtkshalk viaatwclsf timtpypiys nlvpftknnn qtanmcrfll201 pndvmqqswh tflllilfli pgivmmvayg lislelyqgi kfeasqkksa251 kerkpsttss gkyedsdgcy lqktrpprkl elrqlstgss sranrirsns301 saanlmakkr virmlivivv lfflcwmpif sanawraydt asaerrlsgt351 pisfilllsy tsscvnpiiy cfmnkrfrlg fmatfpccpn pgppgargev401 geeeeggttg aslsrfsysh msasvppq

AN 20090232771.1 is SEQ ID NO 1 from US20090232771.

61

Direct links to view original NCBI source data have been added to USGENE records

=> D SQIDE

L1 ANSWER 1 OF 1 USGENE COPYRIGHT 2012 SEQUENCEBASE CORP on STN TI Recombinant viral nucleic acids (Patent)DESC DNA; sequence 12 of 13SQL 109SEQ

1 gttttaaata cgctcgagga tgatcagatt cttagtcctc tctttgctaa51 ttctcaccct cttcctaaca actcctgctg tggagggcga tgttagcttc101 cgtttatca

FEATURE TABLE:Key |Location| ==========+========+===============================================USGENE |1..109 |http://www.sequencebase.com/usgene.php?d=7192740.12NCBI |1..109 |http://www.sequencebase.com/ncbi.php?d=EA095311source |1..109 |/organism='unknown' source |1..109 |/mol_type='genomic DNA'

AN 7192740.12 is displayed here in SQIDEformat, which includes sequence-specific fields.

62

Click this link to access the original sequence data as published via NCBI (next slide).

Direct links to view original NCBI source data have been added to USGENE records (cont.)

The source sequence data for AN 7192740.12 in USGENE (previous slide).

Note: This sequence was not included in the original published sequence listing.

63

PCTGEN backfile enhancements

• WIPO recently made a backfile of sequence data available, for the time period 1999 – 2007– The majority are in image form (TIF, PDF, etc)

• Work is in progress by FIZ Karlsruhe Editorial to add this new backfile data into PCTGEN– Using Optical Character Recognition (OCR)– Including Quality Control and intellectual work

64

L1 ANSWER 1 OF 1 PCTGEN COPYRIGHT 2012 WIPO on STN AN 2006030220.1 PRT PCTGENTI Compositions monovalent for CD4OL binding and methods of

use [File created by using OCR software]PA Grant et al., S.PI WO 2006030220 20060323RLI US 2004-610819P 20040917; US 2005-102512 20050408ED 20120112DT PatentORGN Homo sapiensSQL 116SEQ

1 evqllesggg lvqpggslrl scaasgftfs syamswvrqa pgkglewvsa51 isgsggstyy adsvkgrfti srdnskntly lqmnslraed tavyycaksy101 gafdywgqgt lvtvss

Example: New backfile data for PCTGEN

Records created from image format sequence listings are clearly marked.

Work is in progress on the new backfile data.

65

Summary

• RUN BLAST is available for searching DGENE, USGENE and PCTGEN directly on STN

• CAS REGISTRY BLAST provides BLAST searching options for the REGISTRY database

• DGENE, USGENE and PCTGEN multifile search results can be post-processed into tables, and exported to Microsoft Excel, using STN Express

• CAS REGISTRY BLAST alignment data can be merged with CAplus records, and exported in to RTF format, to form single unified report

• All four STN sequence databases are required for a comprehensive patent sequence search

66

Resources for sequence searching on STN

• DGENE Workshop Manualhttp://www.stn-international.com/dgene_wm.html

• USGENE Workshop Manualhttp://www.stn-international.com/usgene_wm.html

• CAS REGISTRY sequence searching resourceshttp://www.cas.org/support/stngen/stndoc/sequences.html

• Multifile BLAST searching (step-by-step guide)http://www.stn-international.com/usgene_wm_mfs.html

67

Recorded STN e-Seminars are available to watch at your own pace….

• FIZ Karlsruhe recorded e-Seminars:http://www.stn-international.com/recorded_events.html– Sequence Basics (all databases)– Multifile patent sequence searching (all databases)

• CAS recorded e-Seminars: http://www.cas.org/support/stngen/stntraining/recorded.html– Sequence motif searching (all databases)– Processing sequence data (REGISTRY)– Unmasking the World of Antibodies (REGISTRY)

68

FIZ [email protected] and Training:www.stn-international.de

CASE-mail: [email protected] and Training:www.cas.org

For more information …