data crawling using sas applications and sas …€¢ sas macros can execute java, python, nodejs,...

5
#analyticsx SAS Institute ABSTRACT Introduction . Statistical analyses are data dependent. There could be no analysis if you cannot acquire data. The World Wide Web is an enormous data warehouse, literally, which could be used to acquire data. There are various ways to acquire data from the Internet. We propose some methods which could be implemented within the SAS ecosystem to extract data from web. SAS Text Analytics provides Information Retrieval studio which offers some out of the box crawlers. It also allows you to write a markup matcher to extract specific information from a webpage. SAS can also work closely with open source utilities which can be directly embedded into SAS Macro code. We will also explore cURL, Node.js, Java, python and JSON script using proc DS2 code. To summarize: Web Data Crawling using SAS Applications (e.g., SAS Text Miner, SAS Information Retrieval Studio) and SAS macros can greatly automate the first step in the analytics lifecycle—for example, by using APIs—for solution development: METHODS SAS Text Miner Import Node: Overview. The Text Import node serves as a replacement for an Input Data node in SAS EM . It can extract files contained in a directory or from the web . It can also access web resources which requires user credentials. Import node can also handle proprietary formats such as MS Word and PDF files as input . Output from text import node is treated with Text Parsing which performs tokenization for further analyses Identifies the languages and ensure proper transcoding of documents %TMFILTER is the macro that works behind the scenes. All the parameters of %TMFILTER appear on properties panel of SAS EM. Data Crawling using SAS Applications and SAS Macros SAS® Web Crawlers SAS® Text Miner: Text Import Node %TMFilter SAS® Information Retrieval Studio SAS® Macros PROC DS2 Java/Python/etc Manuel Figallo, Principal Systems Engineer; Mark Leventhal, Principal Analytical Consultant; Anurag Mhaiskar, Analytical Consultant

Upload: phamkhue

Post on 16-Nov-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Crawling Using SAS Applications and SAS …€¢ SAS Macros can execute Java, Python, NODEJS, etc and even process XML/JSON SAS Macros: Design • SAS Macros can also be embedded,

#analyticsx

SAS Institute

ABSTRACT

• Introduction. Statistical analyses are data dependent. There could be no analysis if you cannot acquire data. The World Wide Web is an enormous data warehouse, literally, which could be used to acquire data. There are various ways to acquire data from the Internet. We propose some methods which could be implemented within the SAS ecosystem to extract data from web. SAS Text Analytics provides Information Retrieval studio which offers some out of the box crawlers. It also allows you to write a markup matcher to extract specific information from a webpage. SAS can also work closely with open source utilities which can be directly embedded into SAS Macro code. We will also explore cURL, Node.js, Java, python and JSON script using proc DS2 code. To summarize:

• Web Data Crawling using SAS Applications (e.g., SAS Text Miner, SAS Information Retrieval Studio) and SAS macros can greatly automate the first step in the analytics lifecycle—for example, by using APIs—for solution development:

METHODS

SAS Text Miner Import Node: Overview.

• The Text Import node serves as a replacement for an Input Data node in SAS EM .

• It can extract files contained in a directory or from the web .

• It can also access web resources which requires user credentials.

• Import node can also handle proprietary formats such as MS Word and PDF files as input .

• Output from text import node is treated with Text Parsing which performs tokenization for further analyses

• Identifies the languages and ensure proper transcoding of documents

• %TMFILTER is the macro that works behind the scenes.

• All the parameters of %TMFILTER appear on properties panel of SAS EM.

Data Crawling using SAS Applications and SAS Macros

SAS® Web Crawlers

SAS® Text Miner:

Text Import Node%TMFilter

SAS® Information Retrieval Studio

SAS® Macros

PROC DS2

Java/Python/etc

Manuel Figallo, Principal Systems Engineer; Mark Leventhal, Principal Analytical Consultant; Anurag Mhaiskar, Analytical Consultant

Page 2: Data Crawling Using SAS Applications and SAS …€¢ SAS Macros can execute Java, Python, NODEJS, etc and even process XML/JSON SAS Macros: Design • SAS Macros can also be embedded,

#analyticsx

METHODS (CONT.)

SAS Information Retrieval Studio: Overview.

• SAS Information Retrieval Studio is a web based tool to extract data from various web resources and internal data sources. The different crawler components available are :

Web Crawler – Extracts Text from Webpages

File Crawler – Extracts text from file and documents for internal data systems and shared network drives.

Feed Crawler- Extracts text from web feed such as RSS

• Proxy server controls flow of the documents. Pipeline server processes the documents.

• Provides Search and Indexing mechanism.

• Allows customized crawler plugins for specific websites.

SAS Information Retrieval Studio: SAS Markup Matcher.

• Part of IR studio

• Offers web based interactive interface for normalizing HTML and XML documents.

• Especially useful for extracting specific information e.g. Reviews , Forum discussion and blogs

• Point and click rule building in form of XPath expressions.

• Allows template for reuse and document testing

• Markup matcher when included in pipe line server along with export-csv processor, can extract web data into csv file which then can be imported into SAS dataset.

• This is a screenshot of SAS Markup Matcher crawling Amazon.com:

SAS Macros and APIs: Overview.

• Macros are reusable SAS software components that are modular and flexible enough to be applied to a variety of use cases.

• SAS Macros can be used to make API calls so that there is minimal human involvement in data crawling:

• APIs or Application Programming Interfaces:

1. Are machine readable

2. Used to automate processes

3. Integrate Systems

Data Crawling using SAS Applications and SAS Macros

SAS Institute

Manuel Figallo, Principal Systems Engineer; Mark Leventhal, Principal Analytical Consultant; Anurag Mhaiskar, Analytical Consultant

Page 3: Data Crawling Using SAS Applications and SAS …€¢ SAS Macros can execute Java, Python, NODEJS, etc and even process XML/JSON SAS Macros: Design • SAS Macros can also be embedded,

#analyticsx

METHODS (CONT.)

• Think of an API as a power outlet. SAS Macros “plug into” APIs to retrieve data (instead of electricity) to “fuel” SAS Applications with data:

• SAS Web Applications are especially well-suited for API-based data since they can easily run SAS Macros to crawl or access Web data in order to gain insights on the data as quickly as possible. These systems integrate with one another using HTTP, which serves as a “spanning layer” across the Web technologies and applications:

• SAS Web Applications include SAS Studio (SS), SAS Visual Analytics (VA), and SAS Visual Statistics (VS)

• One example of an API-based data source is the Health Indicators Warehouse or HIW. To retrieve data from this resource, use the %extractHIWResource or %extractManyHIWResources macro.

SAS Macros: The Anatomy of a Macro

• SAS Macros consist of three things: input(s), an interface, and output(s). The parametrized macro %extractResource takes in two inputs (circled in red) to produce XML output, which can later be converted to a SAS dataset:

• The two inputs (circled in red) are: 1) an API call; and, 2) the filesystem location of the resulting XML file

• SAS Macros can execute Java, Python, NODEJS, etc and even process XML/JSON

SAS Macros: Design

• SAS Macros can also be embedded, nested, or wrapped inside a SAS Studio custom task to facilitate reuse and make them easy-to-use. This design is much like how a painting is “wrapped” inside a matte and frame to produce new behaviors (i.e., a hang behavior):

Data Crawling using SAS Applications and SAS Macros

SAS Institute

Manuel Figallo, Principal Systems Engineer; Mark Leventhal, Principal Analytical Consultant; Anurag Mhaiskar, Analytical Consultant

Page 4: Data Crawling Using SAS Applications and SAS …€¢ SAS Macros can execute Java, Python, NODEJS, etc and even process XML/JSON SAS Macros: Design • SAS Macros can also be embedded,

#analyticsx

Manuel Figallo, Principal Systems Engineer; Mark Leventhal, Principal Analytical Consultant; Anurag Mhaiskar, Analytical Consultant

METHODS(CONT.)

SAS Macros: Usage

• SAS Macros exist for a variety of API-based data sources, including the HIW (Health Indicators Warehouse) as shown in the examples above.

• The library of SAS Macros to crawl or access API-based data is available at the following link:

• http://tinyurl.com/AnalyticsX2016

• This collection or library of SAS Macros consists of the following:

• To create a SAS Macro of your own, use the %doMacroTemplate that implements the design principles mentioned earlier:

• Calling a SAS Macro from code is also easy:

• The library of SAS Macros also contains examples that use SAS Custom Tasks (for use in a SAS Studio Process Flow):

• SAS Studio can greatly facilitate the execution of SAS Macros in an easy-to-use interface in order to crawl and access API-based data. Here is a brief video showing how easy it is to crawl or access latitude or longitude variables (using the Google API) for a list of US street addresses using the %geocodeLocation macro for geocoding:

Data Crawling using SAS Applications and SAS Macros

Macro Name Data Technology

%extractManyHIWResource Health Indicators Warehouse NODEJS/API

%getTweetsBySearchQuery Twitter Java/API

%extractRegulationsResource Regulations.gov DS2/JSON/API

%sendEmail Google Mail Python/API

%geocodeLocation Google Maps SAS/Google API

%sendEmail Google Mail CURL

etc

Error Handling

Flow Control

SAS Institute

Page 5: Data Crawling Using SAS Applications and SAS …€¢ SAS Macros can execute Java, Python, NODEJS, etc and even process XML/JSON SAS Macros: Design • SAS Macros can also be embedded,

#analyticsx

Manuel Figallo, Principal Systems Engineer; Mark Leventhal, Principal Analytical Consultant; Anurag Mhaiskar, Analytical Consultant

RESULTS

• The methods for data crawling and access covered thus far can benefit any analytics project with these results:

• Data Extraction. This can be into a standard format--for example, csv:

• Ease-of-Use and Reuse (right). Easy-to-use drag-and-drop SAS Studio Custom Task interfaces can ensure widespread adoption of the aforementioned methods.

Data Crawling using SAS Applications and SAS Macros

• System Integration (right). SAS Macros can remove onerous, manual ETL processing for direct connectivity to API-based data, thus making processing automated, uniform, and integrated as shown in the before and after process flows (gap analysis) to the right. SAS Web Applications and analytics can alsoprovide powerful visualizations.

• System Performance (left). API calls using SAS Macros, for example, can be done asynchronously so long as system resources permit. A Boolean input variable determines if the macro will run asynchronously. By default, the API calls are made sequentially where such a capability exists.

• Insights. Automated and Integrated data crawling mentioned hitherto can feed analytic systems to gain actionable insights as shown in these SAS Visual Analytics county-level maps:

CONCLUSION

SAS Applications and SAS Macros provide access to many data sources on the Web in order to “feed” SAS analytic systems and gain actionable insights. These SAS technologies facilitate the analytics lifecycle or journey as well as help reduce or identify risks while achieving automation, integration, and uniformity vis-a-vis data access. Finally, Web data can be used by Architects to prototype solutions, Business Analysts to demo software, and Developers to develop and test software.

Customer Testimonial:

“I had the opportunity to use [the macro] to crawl and retrieve documents from the regulations.gov API. The macro allowed me to automate the extraction of over 9000 documents for further text analysis, a process that previously would have been manual and time-consuming. I expect that I will adapt the macro for use in the future with other API data sources.”

REFERENCES

Figallo, Manuel. "Pedal-to-the-Metal Analytics with SAS® Studio, SAS® Visual Analytics, SAS® Visual Statistics, and SAS® Contextual Analysis." Available: http://support.sas.com/resources/papers/proceedings16/SAS6560-2016.pdf

SAS Institute