pdf and microsoft share point hurdles to overcome

28
PDF AssociationTechnical Conference June 18-19 2013 PDF and Microsoft Sharepoint Hurdles to Overcome Neil Pitman Aquaforest Limited Version 1.120613

Upload: aquaforest

Post on 15-Jul-2015

158 views

Category:

Technology


1 download

TRANSCRIPT

PDF Association Technical Conference June 18-19 2013

PDF and Microsoft Sharepoint Hurdles to Overcome

Neil PitmanAquaforest Limited Version 1.120613

Objective PDF as a Sharepoint “First Class Citizen”

Agenda

Objectives

Sharepoint Overview

PDF Capture

PDF Search iFilters Handling Image and Mixed Mode PDFs

PDF Metadata Dictionary, XMP and Entity Extraction

Configuration Sharepoint 2010 , 2013

Summary

Sharepoint Overview

What is Sharepoint?

On-Premise and Cloud-based Collaboration & Document Management Platform

Origin - 2001

Usage Focus on MS Office Documents Typically distributed capture

Microsoft Sharepoint Server - 125 million licenses soldSharepoint to be a natural target for PDF storage

Sharepoint Overview

Sharepoint Editions (2010, 2013) Foundation Standard Enterprise

Office 365 / Sharepoint Online

Ecosystem Partner Products Office / Sharepoint Marketplace

Sharepoint Architecture Overview

MS Web-based (IIS)

MS Office Integration

SQL Server Storage

List or library data in a site collection is stored in a SQL Server database table, which uses queries, indexes and locks to maintain overall performance, sharing, and accuracy.

Filtered views with column indexes (and other operations) create database queries that identify a subset of columns and rows and return this subset to your computer.

Thresholds and limits help throttle operations and balance resources for many simultaneous users.

Privileged developers can use object model overrides to temporarily increase thresholds and limits for custom applications.

Administrators can specify dedicated time windows for all users to do unlimited operations during off-peak hours.

Information workers can use appropriate views, styles, and page limits to speed up the display of data on the page.

Microsoft Technology Stack Windows Server 2008/12 Internet Information Server (IIS) .Net Framework SQL Server MS Office

PDF Capture for Sharepoint

Options Sharepoint UI Acrobat XI Load Tools Custom Code Workflow & Event Receivers

WebRequest request = WebRequest.Create(destUrl);request.Credentials = CredentialCache.DefaultCredentials;request.Method = "PUT";byte[] buffer = new byte[1024];using (Stream stream = request.GetRequestStream())using (MemoryStream ms = new MemoryStream(fileBytes)){

for (int i = ms.Read(buffer, 0, buffer.Length); i > 0; i = ms.Read(buffer, 0, buffer.Length))

{stream.Write(buffer, 0, i);

}}

WebResponse response = request.GetResponse();response.Close();

Logging.Log("Upload successful");

Acrobat XI Sharepoint Integration

http://www.adobe.com/uk/products/acrobat/pdf-version-control-sharepoint-integration.html

PDF Search in Sharepoint -Overview

Item 1

Item 2

iFilter Architecture

iFilters scan documents for text and attributes – primarily in support of Microsoft Search technologies.

iFilter Configuration

Architecture

Code Sample

Suppliers

Issues

PDF Search in Sharepoint : iFilters

iFilter Explorer

iFilter Explorer

Using iFilters directly in Code

StringBuilder Buffer=new StringBuilder();string PDFFile = @"C:\dev\PDF

Conference\s.pdf";FilterCode f=new FilterCode();f.GetTextFromDocument(PDFFile, ref Buffer);Console.WriteLine(Buffer);

public void GetTextFromDocument(string Path, ref StringBuilderBuffer)

{IFilter filter = null;int hresult;IFilterReturnCodes rtn;

// Initialize the return buffer to 64K.Buffer = new StringBuilder(64 * 1024);

// Try to load the filter for the path given.hresult = LoadIFilter(Path, new IntPtr(0), ref filter);if (hresult == 0){

IFILTER_FLAGS uflags;

// Init the filter provider.rtn = filter.Init(

IFILTER_INIT.IFILTER_INIT_CANON_PARAGRAPHS |IFILTER_INIT.IFILTER_INIT_CANON_HYPHENS |IFILTER_INIT.IFILTER_INIT_CANON_SPACES |

IFILTER_INIT.IFILTER_INIT_APPLY_INDEX_ATTRIBUTES |IFILTER_INIT.IFILTER_INIT_INDEXING_ONLY,0, new IntPtr(0), out uflags);

if (rtn == IFilterReturnCodes.S_OK){

STAT_CHUNK statChunk;

// Outer loop will read chunks from the document at a

[DllImport("query.dll", SetLastError = true, CharSet = CharSet.Unicode)]

static extern int LoadIFilter(string pwcsPath,

[MarshalAs(UnmanagedType.IUnknown)] object pUnkOuter,

ref IFilter ppIUnk);

https://gist.github.com/jimschubert/1473904

iFilter TestBookmark

PDF Attachment

Text

Image/OCR Text

Annotation

XMP Metadata

Dictionary Metadata

iFilter Test Results

AdobeiFilter

PDFLibiFilter

FoxItiFilter

MicrosoftFormat Handler

Body Text Annotations

Bookmarks

Dictionary Metadata

XMP Metadata *

PDF Attachment

Dealing with Image and Mixed-Mode PDFs

Classify : Image-Only Born-Digital Part Image-Only, Part Born-Digital Previously OCRed

Dealing with Image and Mixed-Mode PDFs

Objectives: Ensure Full Searchability Avoid Text to Image Processing

Process : Capture Time? Scheduled In-Place?

PDF Metadata In Sharepoint

Text Search vs Metadata Search

Crawled vs Managed Properies

Review Requirements Dictionary Metadata XMP Metadata Entity Extraction

Consider Automation

PDF Metadata In Sharepoint

Crawled vs Managed Properies

PDF Metadata In Sharepoint : Using Event Receivers

Event Receivers can enable Metadata assignment

PDF Metadata In Sharepoint

Entity Extraction

Configuration Sharepoint 2010

Sharepoint 2013

Sharepoint 2010 PDF Configuration

Missing icon and iFilter

http://www.adobe.com/devnet-docs/acrobatetk/tools/AdminGuide/Acrobat_Reader_IFilter_configuration.pdf

Sharepoint 2010 PDF Configuration

Sharepoint PDF Configuration

Default for PDF : X-Download-Options: noopen' added to HTTP Response Header

Sharepoint 2013 and PDF Configuration

PDF Format Handler Support

Currently no iFilter Support for PDF !?!?!!

Inline Viewing PDF in Sharepoint 2013

http://stevemannspath.blogspot.co.uk/2012/10/sharepoint-2013-pdf-preview-in-search.html

Sharepoint 2013 and PDF Configuration

http://stevemannspath.blogspot.co.uk/2013/04/sharepoint-2013-pdf-support-and.html

Summary

Microsoft Sharepoint Server - 125 million licenses sold

Sharepoint to be a natural target for PDF storage

PDF as a Sharepoint “First Class Citizen”

Contact : [email protected]