out of context - mitweb.mit.edu/smadnick/www/wp2/1997-14.pdf · out of context, pa e 89 l ~managing...

14
From Computerworld, September 1997 Out of Context Allan E. Alter CISL WP #97-14 September 1997 The Sloan School of Management Massachusetts Institute of Technology Cambridge, MA 02142

Upload: duongnhi

Post on 16-May-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Out of Context - MITweb.mit.edu/smadnick/www/wp2/1997-14.pdf · Out of context, pa e 89 L ~Managing I ... Ana, Calif. () is re-leasing a $129 package called Busi-nessVue, which the

From Computerworld, September 1997

Out of Context

Allan E. Alter

CISL WP #97-14September 1997

The Sloan School of ManagementMassachusetts Institute of Technology

Cambridge, MA 02142

Page 2: Out of Context - MITweb.mit.edu/smadnick/www/wp2/1997-14.pdf · Out of context, pa e 89 L ~Managing I ... Ana, Calif. () is re-leasing a $129 package called Busi-nessVue, which the

(www.compute rwor I d.c om) Septembe r 15, 1997 Computerworld

Turf battleCharlotte, N.C.'s emergenceas a financial hub has createdplenty of IS jobs - but alsosome pain. Page 90

Intelligent agents and global data warehouses are poised toexplode in popularity. But they're only as good as the data you

feed them, so you run the risk of taking the information ...

ByAllan E.Alter

Of

Uh-oh.Look what the intelligent agent dragged in:

a Two numbers for Exxon's 1995 net sales figures.Which is right: $122 billion or $1o8 billion?

* A foreign bond due 01-03-05. Is that Jan. 3, 2005,March 1, 2005, or March 5, 2001?

a The number 262oo in a spreadsheet cell. Is thatin dollars, marks, French or Swiss francs? If it'sU.S. currency, is that $262.00 or $26,200?

a Five banks that offer the best deals on 3o-yearmortgages. But do they figure in the annual per-centage rate? Or points?

a A dip in the Brazilian sales figures. Was it a badyear, or did the Sao Paolo office change itsaccounting rules?

ARE YOU DREAMING OF THE DAY when intelligentagents will roam the World Wide Web and findyou the best deal on a mortgage? Or are youbuilding a global, corporatewide data warehouseright now? Large data warehouses and Web-scrap-ers may wind up as newfangled Towers of Babelif they can't make sense out of information thatcomes from different sources or extract data fromWeb pages with different formats. A few smallvendors and a team of researchers at MIT direct-ed by Professor Stuart E. Madnick are beginningto solve these problems, but a total solution ap-pears to be years away.

Madnick calls this the "data context" problem:Data in different environments can mean differ-ent things, just as the word "Java" means differ-ent things to programmers and truck drivers. Inthe U.S., a "D" grade means barely passing; inAustralia, a "D" grade means "with distinction."

Until now, this has been an annoying but lim-ited issue, Madnick says. Mostcompanies have only attemptedto integrate internal data, ordata from one country. Data dic-tionaries, developed by techiesfor techies, were more con-cerned with consistency innames, not what the namesmeant. But with zillions of Webpages out there, data warehous-

COMPUTFor related

more informainterview withsor Stuart Ma

our Webwww.compute

es going global and millions of end users launch-ing queries on their own, it will be harder to livewith data that contains hidden assumptions ordata that obscures important distinctions. Theproblem isn't likely to affect data marts becausethey include information from fewer and morehomogeneous sources, says Peter M. Storer, avice president at Atre Associates, Inc., a consul-tancy in Port Chester, N.Y., that works on datawarehouses.

Dick Hudson, chief information officer at Glob-al Marine, Inc., an off-shore drilling company inHouston, wants to help his firm's purchasersshop on the Web. The company's top suppliersare creating online catalogs; instead of browsingto find drill pipes, he'd like an inteljigent agent todo comparative shopping and dowrload the re-sults onto a spreadsheet. But if this agent is obliv-ious to fittings, collars and other drill-pipe vari-ants, it will download data on the wrong kinds of

- --- pipes and buyers will order the

ERWORLD wrong ones. "You don't want toorder ioo,ooo feet of drill pipe

Web sites, and get apples when you want-tion and an ed oranges," Hudson says.MIT profes- "We are starting to build adnick, visit global data warehouse, and thesite at: context issue is becoming c-rworld.com. cial," says a regional informa-

Out of context, pa e 89

L ~Managing I

Page 3: Out of Context - MITweb.mit.edu/smadnick/www/wp2/1997-14.pdf · Out of context, pa e 89 L ~Managing I ... Ana, Calif. () is re-leasing a $129 package called Busi-nessVue, which the

(www.computerworld.com) September 15, 1997 Computerworld

0Uof

CONTEXTCONTINUED FROM PAGE aS

tion technology manager at a Fortune1oo company who asked to remainanonymous. "When you report sales,some sales managers put in discounts,others won't. Some include freight, somedon't. In big regions like Asia, the differ-ences can be astronomical."

It's a supply-chain issue, too. Has aparticular order been shipped or not? Atthis firm, query one system and the an-swer is no; query another and the an-swer is yes. That's because these twosupply-chain systems define a key worddifferently.

NEW AVENUESOn the other hand, solving the problemcould provide new opportunities.

Raymond C. Bonker, a vice presidentat Merrill Lynch, sees a payoff in com-batting information overload. His vision:Pull financial data off the Internet, addinformation from external sources andinternal databases, and deliver a money-making mix of information to sales staff,researchers and traders in useful, sum-mary form.

That could result in better, faster deci-sion-making and less time wasted onbrowsing, calculating or deciphering themany monitors that crowd their desks.MIT is running a pilot project with Mer-rill Lynch to build such a system, saidBonker, who works in Merrill Lynch'sJersey City, N.J., office.

Primark Corp., in Waltham, Mass., al-so is working with Madnick. Primark in-tegrates business information from hun-dreds of sources around the world and

feeds it to Wall Street-type firms.Straightening out data context problemsand other quality control tasks requires150 people, says Chief Technology Offi-cer Bob Brammer. If technology couldwhup the data context problem, Bram-mer could reduce labor and productioncosts and get information to customersmore quickly. "If a company has releasedits second-quarter report and that infor-mation is on our system faster than onour competitor's, that's an advantage,"Brammer says.

Madnick sees other opportunities: TheU.S. military wants ways to get materialin a hurry without stocking inventories."Trusted agents" could scan suppliers'inventory and production planning sys-tens to quickly find items they need.Mail-order companies could take a list ofshipments, compare it with data fromUPS' and Federal Express' package-track-ing Web sites, and automatically gener-ate letters of apology to anyone whoseshipments are late.

Solving the data context dilemmacould even make information systemslook good. Says the anonymous IS man-ager, "To me, fixing these problems willmean a lot. They take a lot of credibilityout of the work that we do. Everybodyblames the systems, when it's not justthe systems but the business practices."

So how do we get there from here?Madnick has been focusing his recent

work on the Web. Getting every Webpage designer on the planet to standard-ize is impossible. Instead, "our idea is tofind a way to record the context," hesays.

Madnick's team of MIT researchershas developed technology for solving onepart of the problem: extracting data fromWeb sites. He calls it a "Web wrappergenerator" - middleware that resides ona user's server and allows you to treatthe Web like a giant database. Users canpost a SQL query and get back a data-base record or spreadsheet that containsthe information they want. The genera-tor includes a "Web page spec file,"

The $122_(or $108)L-billion question -

How might inconsistent datashow up on the Web? MIT profes-sor Stuart E. Madnick points towww. pirc.com/top-companies/,a demonstration Web page set upby Primark.

The page is linked to lists of thetop 25 U.S. and international com-panies that provide different num-bers for the same company. Forexample, Exxon's net sales as ofDec. 31, 1995 are listed as$121,8o4,ooo,ooo on the U.S. listand $1o7,993,ooo,ooo in the in-ternational list. Why the differ.ence? The U.S. list comes from its"Disclosure" service, which in-cludes Interest income, excisetaxes and other income. But theinternational list comes from its"Worldscope" service, whichdoesn't. Kelly Services, Inc. is list-ed as having 66o,6oo employeeson the U.S. list, but it doesn't.show up on the international top25. That's because Disclosure listsits nonpermanent employees;Worldscope doesn't. - Allan Alter

which provides a "schema" (what thedatabase structure should look like), a"page transition" (which tells the enginehow many pages to go to on a site to sat-isfy the query) and "extraction rules"(guidelines for locating information on aWeb page).

Two vendors also are addressing thisissue: Alpha Microsystems, Inc. in SantaAna, Calif. (www.alphaconnect.com) is re-leasing a $129 package called Busi-nessVue, which the company claims canscrape information about a company offvarious Web sites and Internet newsgroups and deliver it in spreadsheet or

text form. Hudson is trying out a prod-uct called Center Stage from OnDisplay,Inc. (www.ondisplay.com) in San Ramon,Calif., which he hopes can scrape dataoff those cyberspace catalogs and down-load it into a spreadsheet.

But no one claims yet to have solvedthe other piece of the puzzle - buildinga "context mediation engine" that canmake information from many sourcesread the same way, such as translatingall measurements to inches instead offeet and centimeters or putting dates in-to a single day/month/year order.

Madnick's MIT team has been work-ing on such an engine and has even cre-ated a third version, but he isn't ready totake it outside the laboratory. But he iswilling to provide an online demonstra-tion for readers who contact him atsmadnick@MITedu.

In the meantime, here are some stepsanyone can take:n Educate: Alert users to data contextissues.n Prioritize: On data warehousing proj-ects, seek agreement on the most criticalterms and data, but think hard aboutwhether every data point deserves theeffort, advises Dale Goodhue, an assis-tant professor at the University of Geor-gia in Athens who is studying data ware-housing efforts.w Be skeptical about data: "Anybody whocompletely relies on a computer reportto make a decision needs to have theirbrains checked," says data warehousingexpert Shaku Atre, president of Atre As-sociates, Inc. in Port Chester, N.Y.

It's always good to not place too muchtrust in computers, but the fact remainsthat a system that can't be trusted won'tbe used. That's why the data contextproblem is likely to slow down the use ofintelligent agents and global data ware-houses. o

Alter is Computerworld's senior editor,Managing and editor of the Leadership se-ries. His E-mail address is [email protected].

uart Mcdnck A group led by MIT Professor Stuart Madnick (at left) is building a "context mediation engine" that can make informationfrom many sources read the same way, such as translating all measurements to inches and feet instead of centimeters. Inthis example, a price for a particular product in British pounds and Japanese yen is run through the engine and comes outin dollars.

Price in pounds (Brit.)Quantity 1 I

Price in yen (Japan)' Quantity 12 I

Source 1

12-95f

Source 220,HooVContext

mediationengine

(Sources can be Web pages or databases)

Source Price

1 $19-43

2S16.67

Price in dollarsQuantity i

= . - -. . ..... -. .

U

I CONTEXT SOURCE S R ECEIVER S

Page 4: Out of Context - MITweb.mit.edu/smadnick/www/wp2/1997-14.pdf · Out of context, pa e 89 L ~Managing I ... Ana, Calif. () is re-leasing a $129 package called Busi-nessVue, which the

9-15-97 Managing I Out of Context (pg. 1)

MANAGING* OUT Of CONTEXTINTERVIEW WITHSTUART MADNICK

Uh-oh.Look what theintelligent agentdragged in:

Two numbers forExxon's 1995 netsales figures.Which is right:$122 billion or$108 billion?

A foreign bond due01-03-05. Is thatJan. 3, 2005, March1, 2005, or March5, 2001?

The number 26200in a spreadsheetcell. Is that indollars, marks,French or Swissfrancs? If it's U.S.currency, is that$262.00 or$26,200?

Five banks thatoffer the best dealson 30-yearmortgages. But dothey figure in theannual percentagerate? Or points?

A dip in theBrazilian salesfigures. Was it abad year, or did theSao Paolo officechange itsaccounting rules?

http://www2.computerworld.com/home...w/A7D9AEFEE385E5DB852565100074FE4C

Intelligent agents and global data warehouses are poised toexplode in popularity. But they're only as good as the datayou feed them, so you run the risk of taking the infornation...

OUTofCONTEXTBy Allan E. Alter

Dreaming of the day when intelligent agents willroam the World Wide Web and find you the bestdeal on a mortgage? Or are you building a global,corporatewide data warehouse right now? Large datawarehouses and Web-scrapers may wind up asnew-fangled Towers of Babel if they can't makesense out of information that comes from differentsources or extract data from Web pages withdifferent formats. A few small vendors and a team ofresearchers at MIT directed by Prof. Stuart E.Madnick are beginning to solve these problems, buta total solution appears years away.

Madnick calls this the "data context" problem: datain different environments can mean different things,just as the word "Java" means different things toprogrammers and truck drivers. In the U.S., a "D"grade means barely passing; in Australia, a "D"grade means "with distinction."

Until now,this hasbeen anannoyingbut limitedissue,Madnick A

says. Mostcompanieshave onlyattempted tointegrateinternaldata, or datafrom onecountry. Data dictionaries, developed by techies for

31/4/98 11:13 AM2 of 3

Page 5: Out of Context - MITweb.mit.edu/smadnick/www/wp2/1997-14.pdf · Out of context, pa e 89 L ~Managing I ... Ana, Calif. () is re-leasing a $129 package called Busi-nessVue, which the

http://www2.computerworld.com/home...w/A7D9AEFEE385E5DB852565100074FE4C

accounting rules?techies, were more concerned with consistency innames, rather than what the names meant. But withzillions of Web pages out there, data warehousesgoing global and millions of end users launchingqueries on their own, it will be harder to live withdata that contains hidden assumptions or obscuresimportant distinctions. The problem isn't likely toaffect data marts because they include informationfrom fewer and more homogeneous sources,according to Peter M. Storer, a vice president at AtreAssociates, Inc., a consultancy in Port Chester, N.Y.,that works on data warehouses.

Dick Hudson, chief information officer at GlobalMarine, Inc., a Houston offshore drilling company,wants to help his firm's purchasers shop on the Web.The company's top suppliers are creating onlinecatalogs; instead of browsing to find drill pipes, he'dlike an intelligent agent to do comparative shoppingand download the results onto a spreadsheet. But ifthis agent is oblivious to fittings, collars and otherdrill pipe variants, it will download data on thewrong kinds of pipes and buyers will order thewrong ones. "You don't want to order 100,000 feetof drill pipe and get apples when you wantedoranges," Hudson says.

NEXT *

1/4/98 11:13 AM3 of 3

9-15-97 Managing | Out of Context (pg. 1)

Page 6: Out of Context - MITweb.mit.edu/smadnick/www/wp2/1997-14.pdf · Out of context, pa e 89 L ~Managing I ... Ana, Calif. () is re-leasing a $129 package called Busi-nessVue, which the

http://www2.computerworld.com/home/online9697.nsf/All/970915man-context2

OUTofCONTEXTMANAGING* OUT OF CONTEXTINTERVIEW WITHSTUART MADNICK

The $122 (or $108)Billion Question

How mightinconsistent datashow up on theWeb? MITprofessor Stuart E.Madnick points to ademonstration Webpane set up byPrimark.

The page is linkedto lists of the top 25U.S. andinternationalcompanies thatprovide differentnumbers for thesame company. Forexample, Exxon'snet sales as of Dec.31, 1995, are listedas$121,804,000,000on the U.S. list and$107,993,000,000in the internationallist. Why thedifference? TheU.S. list comesfrom its

2 of"2

"We are starting to build a global data warehouse, and thecontext issue is becoming crucial," says a regional informationtechnology manager at a Fortune 100 company who asked toremain anonymous. "When you report sales, some salesmanagers put in discounts, others won't. Some include freight,some don't. In big regions like Asia, the differences can beastronomical." It's a supply chain issue, too. Has a particularorder been shipped or not? At this firm, query one system andthe answer is no; query another and the answer is yes. That'sbecause these two supply chain systems define a key worddifferently.

New avenuesOn the other hand, solving the problem could provide newopportunities.

Raymond C. Bonker, a vice president at Merrill Lynch, sees apayoff in combating information overload. His vision: Pullfinancial data off the Internet, add information from externalsources and internal databases, and deliver a money-makingmix of information to sales staff, researchers and traders inuseful, summary form. That could result in better, fasterdecision-making and less time wasted on browsing, calculatingor deciphering the many monitors that crowd whose?? desks.MIT is running a pilot project with Merrill Lynch to build sucha system, said Bonker, who works in Merrill Lynch's JerseyCity, N.J., office.

Primark Corp., in Waltham, Mass., also is working withMadnick. Primark integrates business information fromhundreds of sources around the world and feeds it to WallStreet-type firms. Straightening out data context problems andother quality control tasks requires 150 people, according toChief Technology Officer Bob Brammer. If technology couldwhup the data context problem, Brammer could reduce laborand production costs and get information to customers morequickly. "If a company has released its second-quarter reportand that information is on our system faster than on ourcompetitor's, that's an advantage," Brammer says.

Madnick sees other opportunities: The U.S. military wantsways to get materiel in a hurry without stocking inventories."Trusted agents" could scan suppliers' inventory and productionplanning systems to quickly find items they need. Mail-ordercompanies could take a list of shipments, compare it againstdata from UPS's and FedEx's package-tracking Web sites, andautomatically generate letters of apology to anyone whoseshipments are late.

Solving the data context dilemma could even make IS lookgood. Says the anonymous IS manager, "To me, fixing theseproblems will mean a lot. They take a lot of credibility out of

5~

1/4/98 11:13 AM

9-15-97 Managing I Out of Context (pg. 2)

I of 3

Page 7: Out of Context - MITweb.mit.edu/smadnick/www/wp2/1997-14.pdf · Out of context, pa e 89 L ~Managing I ... Ana, Calif. () is re-leasing a $129 package called Busi-nessVue, which the

9-15-97 Managing | Out of Context (pg. 2)

'Disclosure"service, whichincludes interestincome, excisetaxes and otherincome. But theinternational listcomes from its"Worldscope"service, whichdoesn't.

Kelly Services, Inc.is listed as having660,600 employeeson the U.S. list, butit doesn't show upon the internationaltop 25. That'sbecause Disclosurelists itsnonpermanentemployees;Worldscope doesn't.

http://www2.computerworld.com/home/online9697.nsf/All/970915man-context2

problems will mean a lot. They take a lot of credibility out ofthe work that we do.

Everybody blames the systems, when it's not just the systemsbut the business practices."

So how do we get there from here?

Madnick has been focusing his recent work on the Web.Getting every Web page designer on the planet to standardize isimpossible. Instead, "our idea is to find a way to record thecontext," he says.

Madnick's team of MIT researchers has developed technologyfor solving one part of the problem: extracting data from Websites. He calls it a "Web wrapper generator" -- middleware,residing on the user's server, that allows you to treat the Weblike a giant database. Users can post a SQL query and get backa database record or spreadsheet that contains the informationthey want. The generator includes a "Web page spec file,"which provides a "schema" (what the database structure shouldlook like), a "page transition" (which tells the engine how manypages to go to on a site to satisfy the query) and "extractionrules" (guidelines for locating information on a Web page).

Two vendors also are addressing this issue: AlphaMicrosystems, Inc. in Santa Ana, Calif. is releasing a $129package called BusinessVue, which the company claims, canscrape information about a company off various Web sites andInternet news groups and deliver it in spreadsheet or text form.Hudson is trying out a product called Center Stage fromOnDisplay, Inc. in San Ramon, Calif., which he hopes canscrape data off those cyberspace catalogs and download it into aspreadsheet.

But no one claims yet to have solved the other piece of thepuzzle -- building a "context mediation engine" that can makeinformation from many sources read the same way, such astranslating all measurements to inches instead of feet andcentimeters or putting dates into a single day/month/year order.Madnick's MIT team has been working on such an engine andhas even created a third version, but he isn't ready to take itoutside the laboratory. But he is willing to provide an onlinedemonstration for readers who contact him atsmadnick(@ MIT.edu.

In the meantime, here are some steps anyone can take:

41/4/98 11:13 AM2 of 3

Page 8: Out of Context - MITweb.mit.edu/smadnick/www/wp2/1997-14.pdf · Out of context, pa e 89 L ~Managing I ... Ana, Calif. () is re-leasing a $129 package called Busi-nessVue, which the

http://www2.computerworld.com/home/online9697.nsf/All/970915mancontext2

Educate: Alert users to data context issues.Prioritize:

skepti

On data warehousing projects, seek agreementon the most critical terms and data, but thinkhard about whether every data point deserves theeffort, advises Dale Goodhue, an assistantprofessor at the University of Georgia in Athenswho is studying data warehousing efforts.

Be "Anybody who completely relies on a computercal report to make a decision needs to have their

about data: brains checked," says data warehousing expertShaku Atre, president of Atre Associates, Inc. inPort Chester, N.Y.

It's always good to not place too much trust in computers, butthe fact remains that a system that can't be trusted won't beused. That's why the data context problem is likely to slowdown the use of intelligent agents and global data warehouses.

A group led by MIT Professor Stuart Madnick is building a "context mediation engine" that can makeinformation from many sources read the same way, such as translating all measurements to inchesinstead of feet and centimeters. In this example, a price for a particular product in British pounds andJapanese yen is run through the engine and comes out in dollars.

CONTEXT SOURCES RECEIVERS

4 PEVIOUS

Alter is Computerworld's senior editor, Managing and editor of the Leadershipseries. His E-mail address is allan alterdcw. com

71/4/98 11:13 AM3 of 3

9-15-97 Managing | Out of Context (pg. 2)

Page 9: Out of Context - MITweb.mit.edu/smadnick/www/wp2/1997-14.pdf · Out of context, pa e 89 L ~Managing I ... Ana, Calif. () is re-leasing a $129 package called Busi-nessVue, which the

http://www2.computerworld.com/home...ne9697.nsf/All/970915maninterview

MANAGING0UAT OF CONTEXT10 INTERVIEW WITH

STUART MADNICI(

CWHow important is

the data contextproblem?

Stuart Madnick is a professor at theMassachusettsInstitute ofTechnology and along-time researcherinto data qualityissues, including datacontext. Here areexcerpts from arecent interview withAllan Alter,Computerworld'ssenior editor for Managing:

MADNICKIt depends on your business goals. I heard anexecutive say his company is moving from beingmultinational to being global. What does thatmean? Before they operated as 20 autonomousdivisions; now they're to behave as an integratedwhole. If that's the kind of goal your organizationhas, being able to integrate informationeffectively is critical.

1/4/98 11:14 AM1 of 6

9-15-97 Managing I Interview with Stuart Madnick

Page 10: Out of Context - MITweb.mit.edu/smadnick/www/wp2/1997-14.pdf · Out of context, pa e 89 L ~Managing I ... Ana, Calif. () is re-leasing a $129 package called Busi-nessVue, which the

http://www2.computerworld.com/home...ne9697.nsf/All/970915maninterview

CWHow does context

differ from datadictionaries?

MADNICKIt's different in twoways.

First, traditional datadictionaries didn'treally address thisissue. Datadictionaries weremeant to deal withthings like fieldnames (sales, profits,etc.), field sizes (2, 4,8 digits or letters) anddata types (integersvs. floating points).The meaning of afield was always left,at best, as a comment.

All this has to dowith what name yougive the field, notwith what that fieldmeans. For example,in England you maycall a field"turnover," which iswhat we in the U.S.A.call "sales". Butwhatever you call it,the field name stilldoesn't deal with theissue of what "sales"or "turnover" actuallymeans. For example,U.S. petroleumcompanies often include excise taxes that theycollect as part of sales.

Second, traditional data dictionaries wereprimarily intended for use by programmers, toknow what names to use in their program. Butcontext information has a broader usage. It isvaluable information for end users as well,because it helps them understand the informationin their systems and can automate the translationof the information to correspond to the meanings

1/4/98 11:14 AM

The 1805Overture

The data contextproblem goes back along, long time. Considerthis:

In 1805, the Austrian andRussian Emperorsagreed to join forcesagainst Napoleon. TheRussians promised thattheir forces would be inthe field in Bavaria byOct. 20. The Austrianstaff planned itscampaign based on thatdate in the Gregoriancalendar. Russia,however, still used theancient Julian calendar,which lagged 10 daysbehind. The calendardifference allowedNapoleon to surroundAustrian General Mack'sarmy at Ulm and force itssurrender on Oct. 21,well before the Russianforces could reach him,ultimately setting thestage for Austerlitz.

Source: David Chandler, TheCampaigns of Napoleon, NewYork: MacMillan 1966, pg. 390

9-15-97 Managing I Interview with Stuart Madnick

2 of 6

Page 11: Out of Context - MITweb.mit.edu/smadnick/www/wp2/1997-14.pdf · Out of context, pa e 89 L ~Managing I ... Ana, Calif. () is re-leasing a $129 package called Busi-nessVue, which the

http://www2.computerworld.com/home...ne9697.nsf/All/970915maninterview

they wish.

CV:How does the Web

make context amore relevant

issue?

Can't you avoid thecontext problem byjust posting better

queries to searchengines like Alta

Vista?

Is the hardware onpeople's desktops

up to the job ofrunning the

Web-wrappingsoftware you have

in mind?

CWWon't this

capability beavailable in future

versions of HTML?

Can Web wrappersbe used to glean

information that'sbeing "pushed," as

well as been posted?

MADNICKIn the past, it may not have been a viable goal tobe a global corporation. Now it is both a desirableand viable goal. The Web gives you a commoncommunication infrastructure.

MADNICKAlta Vista returns you pages, not data. If I want toknow how much snow there is at ski resorts in thearea, or what movies are playing in your area,what I get is a page that may contain thatinformation. That's problem No. 1. The moreserious problem is that more and more Web sitesare dynamic Web sites. When you check youraccount with Fidelity or want to know GM'scurrent stock price, those pages are createddynamically on demand. Alta Vista only findspages that statically exist.

MADNICKYes it is. Our software currently runs on a smallSun server. We are developing a version to run onWindows NT and eventually Windows 95.

MADNICKIt should eventually. But will it emerge on itsown now or a decade in the future? Also, HTMLis only part of the solution. You also needintelligent engines, such as our automatic Webwrapper and context mediator, to interpret suchan enhanced HTML. That is why our research isso important - to help accelerate the process.

MADNICKYes. We haven't used it that way, but there's noreason why it couldn't be.

/0

1/4/98 11:14 AM3 of 6

9-15-97 Managing I Interview with Stuart Madnick

Page 12: Out of Context - MITweb.mit.edu/smadnick/www/wp2/1997-14.pdf · Out of context, pa e 89 L ~Managing I ... Ana, Calif. () is re-leasing a $129 package called Busi-nessVue, which the

http://www2.computerworld.com/home...ne9697.nsf/All/970915man interview

Is it really possibleto "cleanse" or

"translate" datafrom the Web so

that the context isclear?

CV vIs context mediationa job for computers,

or is it really a jobfor people?

MADNICKIt's doable in many cases. The real challenge iswhether its 10% of the cases or 90% of the cases.The purpose of the experiments with MerrillLynch or Primark is to see how far it can bepushed.... Our experiences so far is that it can bedone in the vast majority of the cases.

MADNICKI think it's a leveraging issue. Supposedly, AT&Tonce looked at how many telephone operatorswould be needed in the country given theprojected growth in telephony. They found thatsomeday they'd need half of the U.S. populationto work as telephone operators. If your goal is fullemployment, that's a great strategy. But it's notsuch a good idea to become a nation of telephoneoperators. Similarly, when you think of supplychain management, data warehouses,globalization, and how many people it will take todo this work manually, do you really want todevote that many people to it? I believe there aremore productive things people can do.

I1/4/98 11:14 AM4 of 6

9-15-97 Managing | Interview with Stuart Madnick

Page 13: Out of Context - MITweb.mit.edu/smadnick/www/wp2/1997-14.pdf · Out of context, pa e 89 L ~Managing I ... Ana, Calif. () is re-leasing a $129 package called Busi-nessVue, which the

http://www2.computerworld.com/home...ne9697.nsf/All/970915maninterview

CWIs data context aproblem among

companies sharingan EDT network oran extranet? After

all, won't everybodyin that community

have to share thesame data

definitions andformats to beparticipants?

CWAnd extranets?

MADNICKIn an ideal situation, you might be right - if theinternal systems used by all the parties wereconsistent in their data meanings. But in reality, ifyou look at most EDI situations, eachorganization has its own internal meanings whichonly occasionally match with others. In reality,most companies perform EDI by building customEDI translators to convert their internal data tothe EDI exchange format.

A good example that I've heard of is anautomotive supply company that services GM,Ford and Chrysler. Each auto manufacturer hastheir own EDI standards, so this supplier mustconvert its information to handle three differentEDI conventions.

And in general, EDI only standardizes transactioninformation. When you start getting into closecooperation such as in supply chain management,you need to know things such as productionlevels and schedules from your suppliers. Thistype of information has usually not beenstandardized in normal EDI applications.

MADNICKThis is usually different from EDI because mostextranets are for use by people. With EDI, theprograms directly communicate with each other,so all the understanding must be built into theprograms. With extranets, the burden ofextracting and interpreting information is left tothe user, which could be a time-consuming anderror-prone process.@

For More Information...

For more information about the context problem in information systems, see MIT'sContext Interchange Project, and What's the Meaning in the Oct. 17, 1994 issue ofComputerworld.

MIT professor Stuart Madnick is also willing to provide an online demonstration ofhis context mediation engine for Computerworld readers who contact him [email protected].

1Z

1/4/98 11:14 AM5 of 6

9-15-97 Managing|IInterview with Stuart Madnick

Page 14: Out of Context - MITweb.mit.edu/smadnick/www/wp2/1997-14.pdf · Out of context, pa e 89 L ~Managing I ... Ana, Calif. () is re-leasing a $129 package called Busi-nessVue, which the

http://www2.computerworld.com/home...w/CBE1F0C4CEA5E26A852564DF00814B16

(0620 97 12:00:00 A1)

MIT professor works to create a New Web OrderAllan E. Alter

Dreaming of the day when intelligent agents will roam the World Wide Web and report back withuseful information to you? When some software spy will tell you what banks in town are offeringmortgage rates at 8% or which ski resorts near you have more than 5 feet of snow on the ground?Unfortunately, so-called "intelligent" agents may merely deliver a new Tower of Babel untiltechnologists find answers to two problems: how to extract data from countless differently designed,ever-changing pages out on the Web, and how to make a sense out of all the information gatheredthere.

So said MIT professor Stuart E. Madnick, who spoke this week at an annual conference for ISmanagers held by MIT's Center for Information Systems Research. Madnick, who has spent yearsresearching this area, is now beta-testing technologies that can help overcome these problems.

Context has always been a problem for database designers, Madnick said. A date expressed by thenumbers 01-03-05 could mean Jan. 3, 2005, March 1, 2005 or March 5, 2001, depending on localcustom. Different departments in the same organization may define words such as "sale" differently.And the lingo used at separate but merging companies can make the differences between Frenchand Urdu look trivial. [For more information about the "context" problem in information systems, seethe MIT site on the Context Interchange Project and "What's the Meaning of This?" CW, Oct. 17,1994.

Integrating data from different sources has been an annoying but relatively workable issue. Untilnow, most companies were attempting to integrate only their own internal data. But with the Webtaking off like a rocket, and the potential for linking global sources of information growing along withit, this gnat could turn into a 10-ton mosquito.

For example, Madnick said, a request for local ski conditions could deliver a hodgepodge of screenscraps that don't tell you if snowfall is measured in inches, centimeters or feet, or whether the slopesare in Vermont or Vail, Colo.

On the other hand, solving this problem could provide an opportunity. For example, Madnick said, amail-order company could create a program that queries all orders it ships out, ascertains theirstatus by checking couriers' package-tracking Web sites and automatically generate a letter ofapology to customers whose packages haven't been delivered on time.

So how do we get there from here? Getting every Web page content developer on the planet tostandardize the language they use for their sites is impossible. Instead, "our idea is to find a way torecord the context (su rrounding information]," said Madnick. Madnick and a team of MIT researchershave developed technology for solving one part of the problem -- extracting data from Web sites. Hecalls it a "Web wrapper generator." This middleware resides on the user's server and allows you totreat the Web like a giant database. Users can post a SQL query and get back a database record orspreadsheet containing the information they want. The generator includes a "Web page spec file"that provides a "schema" (what the database structure should look like), a "page transition" (whichtells the engine how many pages to go to on a site to satisfy the query), and "extraction rules"(guidelines for locating the information on the Web page that's fed back to you).

Madnick said a beta of MIT's web wrapper generator is being tested by Primark Corp., a financialand broadcast/publishing information services company in Waltham, Mass. Primark is using it toaugment its internal information sources with Web-based sources, such as the Edgar site forSecurities and Exchange Commission filings. At least one privately held firm, Alpha Microsystems inSanta Ana, Calif., is also developing this technology, according to Madnick.

The other piece of the puzzle -- building a "context mediation engine" that can make information frommany sources read the same way, by, for example, translating all measurements to inches, orputting all dates in day/month/year order -- is also under way at MIT. A third version of the enginehas now been created, "but it's a much more complex technology, so it's harder to put into userhands."

13

1/4/98 11:14 AM1 of 2

MIT professor works to create a New Web Order