this document explains the canadian version of the ner model, a … captioning script with slide...

1

This document explains the Canadian version of the NER model, a system for

measuring the accuracy of live TV captions.

NER is used around the world, and it’s likely to be proposed as part of a new

caption accuracy standard in Canada, so English Canadian broadcasters want their

caption viewers to know about it.

2

Some people think that captions are made by computer programs, but such

programs are still not as accurate as a human. So humans with talent, training and

experience act as live captioners.

The captions they provide for live TV shows simply can’t be perfect, word-for-

word transcriptions, as they are in pre-recorded shows, because in live programs:

• People may speak very quickly, so verbatim captions may be unreadable.

• People talk over top of each other in discussions

• Spoken English is often ungrammatical and makes poor written English.

Captioners provide verbatim captions wherever possible, but when it isn’t

possible, good live captioners paraphrase what they hear, and capture the

meaning if not the exact wording. So measuring accuracy means measuring the

transmission of meaning.

Our current standard for accuracy only compares words between captions and

verbatim – and every mismatch is a mistake whether it’s meaningful or not. The

3

NER model is designed to measure changes in meaning, which is, after all, what

captions are trying to deliver.

NER compares the caption viewer's experience and the hearing viewer’s

experience. Higher scores mean the caption viewer’s experience is close to the

hearing viewer’s. Assessing differences in meaning requires training in NER

evaluation.

The trained NER evaluator begins by preparing a transcript of the captions and a

transcript of the program audio, looking for differences. When they are different,

the evaluator has to decide:

Has any meaning been changed or lost?

Has the caption reading experience been interrupted by an error in a

word?

If so, the evaluator notes a score deduction, which varies with the kind of error.

Training helps evaluators be consistent with each other, so the subjective

differences between them are minimized. Over time, the improved Guidelines

4

and the experience of the evaluators has produced a high level of agreement

between different evaluators which gives us confidence in the scores.

There are four types of NER error that have to do with meaning.

The first of these is called Correct Edition, which is scored when the captions

capture the full meaning of the verbatim with no wrong words that might confuse

or interrupt the reading process.

At the other extreme are captions that transmit false information, like a budget

cut of $15 million, when the audio said $50 million. The caption viewer has no

way to know this is not true, so it’s treated severely with a deduction of a full

point.

The next most serious error is the “Omission of Main Meaning”, in which a

complete thought is dropped or incomprehensible, so the caption viewer has no

idea what was said.

Similar, but less onerous is the “Omission of Detail”, in which a modifying idea has

been lost.

5

If the captions accurately transmit a fire report but miss the time it started, this

would typically be scored as an OD.

In addition to meaning errors, NER deducts for captions that cause the caption

viewer to interrupt their reading.

A Benign Error can occur when the caption viewer gets the meaning, but is

interrupted since they must try to get a mis-spelled but understandable word, or

a missing question mark, for example.

The Nonsense Error is scored when a word or phrase is garbled so it can't be

understood, but the meaning of the idea still gets through. This is not a common

occurrence, since the Nonsense word usually affects the meaning too.

Let’s illustrate with examples:

6

This is an evaluator’s computer screen with the verbatim and caption transcripts

of a live baseball game.

On the right side of the screen, the text is the word-for-word verbatim transcript

of what was said - all the words, plus punctuation where possible, and words or

symbols in braces, to represent music, applause, etc. This represents the hearing

viewer’s experience in written English.

To the left of that is the transcript of the captions. They have been lined up beside

the spoken words they represent so NERScore can compare them and show

words added or missing from the verbatim in the Word Insertions and Omissions

columns.

On the left you can see a set of six coloured columns – where the evaluator will

note errors.

On the far left the program calculates the deductions so the NER score can be

calculated.

And to the right of the verbatim transcript, you see the evaluator’s comment on

their score. It’s often useful to see the evaluator’s reasoning.

So let’s get into the errors.

7

First, FIE or false information error.

At the end of the file, as seen above, there is a simple, clear example – the

announcer said the pitcher had an earned-run-average of 23.14 and the captions

said 21.34. So here, the caption reader was told a clear statement that wasn’t

true. And had no way to know – the evaluator notes that there is no visual to

correct the misinformation.

Information in the programs that corrects the FIE would matter in scoring, but in

our Guidelines it doesn’t matter that the error is small, or, to some, unimportant,

or that the viewer might have background knowledge about it. It’s wrong, and

scored as false information: the evaluator enters 1 in the column for FIEs, and the

program calculates a full point deduction.

In Row 89 above, there’s an Omission of Main Meaning.

8

The announcer had been talking about two players – Jackie Robinson and Larry

Doby, but in this row, the name of Larry Doby is omitted from the captions, so the

reader can’t know who was the first African American to play in the American

League.

The announcer’s thought hasn’t been changed – so it’s not FIE – it’s just been

omitted. The evaluator went with OMM rather than an Omission of Detail,

because the reader has lost the main point of the statement with the loss of the

name. More typically, an OMM occurs when an entire sentence is dropped but

NER is all about meaning, and the full meaning of the idea has been omitted here

because the name was crucial.

So, .5 deduction.

Row 80 above shows an Omission of Detail when the captions do not give the first

name of the player, even though it’s in the audio. The caption reader can’t know

the player’s first name but the hearing viewer does, so .25 deduction.

9

An error will still be scored when meaning is intact, if a wrong word caused an

interruption in the viewer’s reading.

In Line 120 and 121 above, a player is given an intentional walk and will go to first

base, but the captions put this in the past tense – the reading viewer can

understand this with a moment’s thought, but it causes an interruption in the

reading experience while it is figured out.

That’s a Benign Error, or BE, and it gets a .25 deduction.

In Line 86 above, the evaluator has assessed two BEs.

One of them is for an extra comma that the captioner added to the verbatim.

Meaning is not affected, but reading is interrupted briefly. A different BE is

assessed for the misspelling of Larry Doby’s name. He’s been mentioned before

and the correct spelling is visible in the video, so it’s only a BE.

The final error type is NE, for when a word can’t be understood but meaning is

not affected.

10

In real-world evaluations this doesn’t happen often because a garbled word

usually affects meaning, causing an Omission of Main Meaning or Omission of

Detail.

So, to create an example to show you, in line 92 above there’s a BE – where the

word HI has been added before HITTER.

Suppose that was something completely incomprehensible like XIT, not a word,

and not recognizably part of a word. It’s a substantial interruption because it can’t

be understood, it’s meaningless, but the words PINCH HITTER surrounding it get

the meaning of the idea across.

The deduction for BE was .25, but for NE it is .5 since the reading interruption is

longer.

The final score type used in NER is the Correct Edition.

This is scored when the captions differ from the verbatim, but paraphrase the

verbatim correctly. The experience received by the caption-reading viewer is

equivalent to that of the hearing viewer, and there is no deduction.

In line 94 above, the captioner added a verb to make the written English more

understandable.

11

In lines 117 and following, the captioner altered contractions, changed “YOUR” to

“THE” and dropped the word EVEN. These changes didn’t affect the full meaning

of the audio, so CORRECT EDITION is the score.

Sometimes a correct edition can be more substantial, as in a discussion program

where the captioner must try to sort out competing voices or make long, seriously

ungrammatical statements comprehensible.

This is one of the strong points of the NER model – in the earlier measurement

system every different word would be scored as an error, regardless of their

meaning.

Speaker Identification in captions is important, because the hearing viewer may

understand who is speaking from voice quality, an accent, or other information

that is not put into words. So, when preparing the verbatim transcript, the

evaluator supplies this non-verbal information with names or IDs and speaker

change symbols to equal the hearing viewer’s understanding.

In line 120 above, the captioner missed the speaker change to Buck, but caught

up with it two lines later.

It was scored as an Omission of Detail, because the two announcers were saying

the same thing, but the evaluator notes that it could have been a more serious

error if the two commentators were disagreeing. A mis-identified speaker ID can

create an FIE – if, for example, an opinion is attributed to a politician who actually

said the opposite.

12

With all of the errors entered, the NER score is calculated as (words minus errors)

divided by (words). If there were 2000 words in the captions, and the error

deductions totalled 40, the NER score would be 1960 divided by 2000, which

calculates to 98.0.

The NER formula almost always produces a number between 96 and 100, which

can make the score look quite high, even when the captions are poor in accuracy.

We therefore tend to convert the score into a verbal rating, which you can see

here. Only scores over 99.5 are "excellent", while those below 98.0 are "poor“ in

accuracy.

Moreover, the score is not everything. The NER model requires the evaluator to

add a verbal comment. The evaluator can note particularly difficult passages that

are hard for the captioner, or technical problems that may have interfered with

caption accuracy. That can help us to understand why a particular program

segment fell below the 98.0 number.

13

Finally, we note that four English Canadian broadcasters and the four members of

the Captioning Consumer Advocacy Alliance have been meeting for several years

to discuss accuracy, with caption providers, educators, and the CRTC.

The broadcasters also conducted a Trial of NER. They monitored and evaluated

two live programs a month each, with two separate evaluations. The reports of

the Trial are posted on the CRTC website and also on the website NERTrial.com.

As a result, we can be confident that NER produces reasonably objective results,

and that NER scores align well with a subjective assessment of caption quality -

also noted in the CCAA’s own research, which indicates that NER can be a useful

measurement system.

If you are interested in learning more, pdf printouts of program scoresheets from

the Trial are posted for you to look at on the website, along with an email link

where you can send your feedback and comments; we would be very interested

to read your reactions.

Hope this has been helpful.

this document explains the canadian version of the ner model, a … captioning script with slide...

Documents