this document explains the canadian version of the ner model, a … captioning script with slide...
TRANSCRIPT
1
This document explains the Canadian version of the NER model, a system for
measuring the accuracy of live TV captions.
NER is used around the world, and it’s likely to be proposed as part of a new
caption accuracy standard in Canada, so English Canadian broadcasters want their
caption viewers to know about it.
2
Some people think that captions are made by computer programs, but such
programs are still not as accurate as a human. So humans with talent, training and
experience act as live captioners.
The captions they provide for live TV shows simply can’t be perfect, word-for-
word transcriptions, as they are in pre-recorded shows, because in live programs:
• People may speak very quickly, so verbatim captions may be unreadable.
• People talk over top of each other in discussions
• Spoken English is often ungrammatical and makes poor written English.
Captioners provide verbatim captions wherever possible, but when it isn’t
possible, good live captioners paraphrase what they hear, and capture the
meaning if not the exact wording. So measuring accuracy means measuring the
transmission of meaning.
Our current standard for accuracy only compares words between captions and
verbatim – and every mismatch is a mistake whether it’s meaningful or not. The
3
NER model is designed to measure changes in meaning, which is, after all, what
captions are trying to deliver.
NER compares the caption viewer's experience and the hearing viewer’s
experience. Higher scores mean the caption viewer’s experience is close to the
hearing viewer’s. Assessing differences in meaning requires training in NER
evaluation.
The trained NER evaluator begins by preparing a transcript of the captions and a
transcript of the program audio, looking for differences. When they are different,
the evaluator has to decide:
Has any meaning been changed or lost?
Has the caption reading experience been interrupted by an error in a
word?
If so, the evaluator notes a score deduction, which varies with the kind of error.
Training helps evaluators be consistent with each other, so the subjective
differences between them are minimized. Over time, the improved Guidelines
4
and the experience of the evaluators has produced a high level of agreement
between different evaluators which gives us confidence in the scores.
There are four types of NER error that have to do with meaning.
The first of these is called Correct Edition, which is scored when the captions
capture the full meaning of the verbatim with no wrong words that might confuse
or interrupt the reading process.
At the other extreme are captions that transmit false information, like a budget
cut of $15 million, when the audio said $50 million. The caption viewer has no
way to know this is not true, so it’s treated severely with a deduction of a full
point.
The next most serious error is the “Omission of Main Meaning”, in which a
complete thought is dropped or incomprehensible, so the caption viewer has no
idea what was said.
Similar, but less onerous is the “Omission of Detail”, in which a modifying idea has
been lost.
5
If the captions accurately transmit a fire report but miss the time it started, this
would typically be scored as an OD.
In addition to meaning errors, NER deducts for captions that cause the caption
viewer to interrupt their reading.
A Benign Error can occur when the caption viewer gets the meaning, but is
interrupted since they must try to get a mis-spelled but understandable word, or
a missing question mark, for example.
The Nonsense Error is scored when a word or phrase is garbled so it can't be
understood, but the meaning of the idea still gets through. This is not a common
occurrence, since the Nonsense word usually affects the meaning too.
Let’s illustrate with examples:
6
This is an evaluator’s computer screen with the verbatim and caption transcripts
of a live baseball game.
On the right side of the screen, the text is the word-for-word verbatim transcript
of what was said - all the words, plus punctuation where possible, and words or
symbols in braces, to represent music, applause, etc. This represents the hearing
viewer’s experience in written English.
To the left of that is the transcript of the captions. They have been lined up beside
the spoken words they represent so NERScore can compare them and show
words added or missing from the verbatim in the Word Insertions and Omissions
columns.
On the left you can see a set of six coloured columns – where the evaluator will
note errors.
On the far left the program calculates the deductions so the NER score can be
calculated.
And to the right of the verbatim transcript, you see the evaluator’s comment on
their score. It’s often useful to see the evaluator’s reasoning.
So let’s get into the errors.
7
First, FIE or false information error.
At the end of the file, as seen above, there is a simple, clear example – the
announcer said the pitcher had an earned-run-average of 23.14 and the captions
said 21.34. So here, the caption reader was told a clear statement that wasn’t
true. And had no way to know – the evaluator notes that there is no visual to
correct the misinformation.
Information in the programs that corrects the FIE would matter in scoring, but in
our Guidelines it doesn’t matter that the error is small, or, to some, unimportant,
or that the viewer might have background knowledge about it. It’s wrong, and
scored as false information: the evaluator enters 1 in the column for FIEs, and the
program calculates a full point deduction.
In Row 89 above, there’s an Omission of Main Meaning.
8
The announcer had been talking about two players – Jackie Robinson and Larry
Doby, but in this row, the name of Larry Doby is omitted from the captions, so the
reader can’t know who was the first African American to play in the American
League.
The announcer’s thought hasn’t been changed – so it’s not FIE – it’s just been
omitted. The evaluator went with OMM rather than an Omission of Detail,
because the reader has lost the main point of the statement with the loss of the
name. More typically, an OMM occurs when an entire sentence is dropped but
NER is all about meaning, and the full meaning of the idea has been omitted here
because the name was crucial.
So, .5 deduction.
Row 80 above shows an Omission of Detail when the captions do not give the first
name of the player, even though it’s in the audio. The caption reader can’t know
the player’s first name but the hearing viewer does, so .25 deduction.
9
An error will still be scored when meaning is intact, if a wrong word caused an
interruption in the viewer’s reading.
In Line 120 and 121 above, a player is given an intentional walk and will go to first
base, but the captions put this in the past tense – the reading viewer can
understand this with a moment’s thought, but it causes an interruption in the
reading experience while it is figured out.
That’s a Benign Error, or BE, and it gets a .25 deduction.
In Line 86 above, the evaluator has assessed two BEs.
One of them is for an extra comma that the captioner added to the verbatim.
Meaning is not affected, but reading is interrupted briefly. A different BE is
assessed for the misspelling of Larry Doby’s name. He’s been mentioned before
and the correct spelling is visible in the video, so it’s only a BE.
The final error type is NE, for when a word can’t be understood but meaning is
not affected.
10
In real-world evaluations this doesn’t happen often because a garbled word
usually affects meaning, causing an Omission of Main Meaning or Omission of
Detail.
So, to create an example to show you, in line 92 above there’s a BE – where the
word HI has been added before HITTER.
Suppose that was something completely incomprehensible like XIT, not a word,
and not recognizably part of a word. It’s a substantial interruption because it can’t
be understood, it’s meaningless, but the words PINCH HITTER surrounding it get
the meaning of the idea across.
The deduction for BE was .25, but for NE it is .5 since the reading interruption is
longer.
The final score type used in NER is the Correct Edition.
This is scored when the captions differ from the verbatim, but paraphrase the
verbatim correctly. The experience received by the caption-reading viewer is
equivalent to that of the hearing viewer, and there is no deduction.
In line 94 above, the captioner added a verb to make the written English more
understandable.
11
In lines 117 and following, the captioner altered contractions, changed “YOUR” to
“THE” and dropped the word EVEN. These changes didn’t affect the full meaning
of the audio, so CORRECT EDITION is the score.
Sometimes a correct edition can be more substantial, as in a discussion program
where the captioner must try to sort out competing voices or make long, seriously
ungrammatical statements comprehensible.
This is one of the strong points of the NER model – in the earlier measurement
system every different word would be scored as an error, regardless of their
meaning.
Speaker Identification in captions is important, because the hearing viewer may
understand who is speaking from voice quality, an accent, or other information
that is not put into words. So, when preparing the verbatim transcript, the
evaluator supplies this non-verbal information with names or IDs and speaker
change symbols to equal the hearing viewer’s understanding.
In line 120 above, the captioner missed the speaker change to Buck, but caught
up with it two lines later.
It was scored as an Omission of Detail, because the two announcers were saying
the same thing, but the evaluator notes that it could have been a more serious
error if the two commentators were disagreeing. A mis-identified speaker ID can
create an FIE – if, for example, an opinion is attributed to a politician who actually
said the opposite.
12
With all of the errors entered, the NER score is calculated as (words minus errors)
divided by (words). If there were 2000 words in the captions, and the error
deductions totalled 40, the NER score would be 1960 divided by 2000, which
calculates to 98.0.
The NER formula almost always produces a number between 96 and 100, which
can make the score look quite high, even when the captions are poor in accuracy.
We therefore tend to convert the score into a verbal rating, which you can see
here. Only scores over 99.5 are "excellent", while those below 98.0 are "poor“ in
accuracy.
Moreover, the score is not everything. The NER model requires the evaluator to
add a verbal comment. The evaluator can note particularly difficult passages that
are hard for the captioner, or technical problems that may have interfered with
caption accuracy. That can help us to understand why a particular program
segment fell below the 98.0 number.
13
Finally, we note that four English Canadian broadcasters and the four members of
the Captioning Consumer Advocacy Alliance have been meeting for several years
to discuss accuracy, with caption providers, educators, and the CRTC.
The broadcasters also conducted a Trial of NER. They monitored and evaluated
two live programs a month each, with two separate evaluations. The reports of
the Trial are posted on the CRTC website and also on the website NERTrial.com.
As a result, we can be confident that NER produces reasonably objective results,
and that NER scores align well with a subjective assessment of caption quality -
also noted in the CCAA’s own research, which indicates that NER can be a useful
measurement system.
If you are interested in learning more, pdf printouts of program scoresheets from
the Trial are posted for you to look at on the website, along with an email link
where you can send your feedback and comments; we would be very interested
to read your reactions.
Hope this has been helpful.