Visual Field Exam Analysis: How Reliable Are the Results?
Interpretation requires a keen awareness of perimetry's many nuances. Here's a look at the factors that affect exam reliability.
By Douglas R. Anderson, MD, FARVO
Making a diagnostic or therapeutic decision using the results of a patient's visual field examination is not as simple as some ophthalmologists might believe. First, the clinician must take into account the reliability of its information along with the other available clinical data. Guidelines for judging the reliability of the visual field examination, as updated here, have changed with experience and new scientific data through the years. Keep in mind that most of my clinical experience has been with the Humphrey Field Analyzers (Carl Zeiss Meditec), and hence I describe features as found on these perimeters, but the principles apply to any perimeter, even if the percentage cut-off values and other criteria for reliability may be somewhat different.
Degrees of Reliability
Sometimes we consider the results either reliable or unreliable, as if they are either helpful or utterly useless. The implication of calling a test result either “reliable” or “unreliable” is that the first could be taken at face value without fear of misleading information (artifacts, or defects due to other diseases, etc.), while the second is useless. More realistic is to consider that they come in a continuum of degrees and types of reliability and unreliability. A test labeled “reliable” may be misleading, while those considered “unreliable” may still contain sufficient information to establish or rule out certain diseases, or to establish whether the disease has worsened.
The Problem With Either/Or
How did we fall into a tendency to consider tests as either reliable or unreliable, rather than treat reliability as a continuum of dependability related to the ability to discern useful information? It started when information on the range of normal values in eyes without disease was collected, so that abnormality could be recognized. To develop a clean normative database, only data from high-quality visual field examinations was used. Three criteria were selected with educated arbitrariness as the basis for deciding whether a field was of sufficiently high quality:
(1) The Fixation Loss (FL) rate must be under 20% (that is, “catch trial” presentations of a stimulus to the blind spot did not result in a response at least 80% of the time),
(2) False Negative (FN) response rate (failure to respond to a “catch trial” stimulus 9 dB more intense that one previously seen at that location) was under 33%, and
(3) The rate of False Positive (FP) responses (responses when, during a series of stimulus presentations, there was a “catch trial,” a pause in the steady sequence with no stimulus presented) under 33%.
The printout of the results of a field test is labeled unreliable if it fails to meet the criteria that had been used to collect the normal data. (It should be noted that the normal values for the Swedish Interactive Thresholding Algorithm, or SITA, did not exclude normal patients no matter the FP or FN rates, but FL still had to be under 20%.) However, the proper view would be that as the test did not fall within the limits that had been set to obtain the set of “normal” values, so the statistical analyses might be faulty (the total deviation values, the Mean Deviation Index, the probability of abnormality, and so on).
It was not necessarily true that the visual field examination itself was “unreliable” in the sense that is has no useful information, even if the statistical calculations might likely be somewhat inaccurate. Moreover, the determination of the three reliability indices (FL, FN, FP) were themselves faulty such that an examination labeled as “unreliable” may be quite dependable, even the calculated statistical values. On the other hand, other features may render the examination unreliable with the examination being labeled “reliable” based on FL, FP, and FN within bounds.
After noting the “reliability indices” and other features, the clinician must decide whether the visual field report has only slight distortions of the information, or is so faulty that it is completely uninformative.
Causes for Caution
A lot has been learned during the years of use of standard automated perimetry, and it is time to examine how to make use of information to the degree that it can give at least some clinical assistance. There are times to be cautious about interpreting the result, and certain standards signs explored here suggest a potentially misleading examination, but other artifacts and anomalies also may distort the interpretation.
The three standard signs for caution:
1. False negative responses. In the Full Threshold algorithm, the machine presents in random locations a stimulus 9 dB more intense than a stimulus to which the patient had responded earlier in the test. The frequency with which the patient failed to respond was tabulated as the FN rate, presumed to be a sign of inattention. If this happened in more than 33% of the catch trials, it was concluded that this subjective clinical test were “unreliable,” as it was thought that a stimulus 9 dB stronger than one to which the patient responded before should certainly have been visible.
Time was a teacher, and we learned that especially in abnormal regions of the visual field, visibility actually does deteriorate as the test proceeds, and FN responses are to be expected if the visual field has abnormal areas. This kind of “fatigue” is neurophysiologic, and should not cause the clinician to conclude that the patient was not concentrating. After this phenomenon was recognized, and when SITA was devised to replace the Full Threshold algorithm, the opportunity was taken to perform the retesting (catch trials for FN responses) only in locations already known to have normal visual sensitivity.
Moreover, after the test is completed, “post-processing” of the data in SITA fields is done, looking to see whether any of the responses (or lack of responses) along the way are out of keeping with the threshold as finally determined. If so, a failure to respond to a stimulus that should have been visible would be counted as a FN response.
There is, however, in addition to the natural neurophysiologic decline in responsiveness, a tendency for a decline in concentration as the test continues for a long time. This is more the case in older people, patients who are not well rested on a particular day, or for less educated people not used to taking tests. The higher percentage of FN responses in these situations truly does result from loss of attention. The main clues that a person became tired during testing is that the visual thresholds are generally more depressed or are more abnormal in a patchy fashion, mainly near the edges, where testing is done last, after the more central points have been quantified (See Figure 1).
Figure 1. Patient with normal visual field, but false negative responses toward the end of the test cause the greyscale to show areas of depressed sensitivity near the edge (arrows), and not in a pattern diagnostic of any known disease. IMAGES COURTESY OF DOUGLAS R. ANDERSON, MD.
In addition, the patient may have been responding well to more easily-seen stimuli closer to the center, but may think they are not supposed to respond to dimmer and less certain stimuli far from the center. When the pattern of apparent field loss is not consistent with glaucoma or a neurologic disease for which the test is being done, it may be concluded that the field is normal.
In the extreme, responses are crisp at the four “primary” locations and at the locations immediately surrounding them, which are tested first. The performance starts out well, but the patient gives out before the testing has moved very far from the center. The result is four overlapping circular regions of good thresholds near the center, and seemingly reduced threshold in between, giving a “cloverleaf” pattern on the greyscale (See Figure 2). More moderate progressive inattention may result in milder forms of this pattern that may at first be considered possible results of disease. However, a high FN response rate alerts the attentive clinician to the true cause for the pattern.
Figure 2. Another patient with a normal visual field shown on a subsequent test. On this day, the patient was tired and responded fairly well as the test began within the four circular areas (outlined in white circles), but with fatigue at the end of the test failed to respond to most of the stimuli. The result is a cloverleaf pattern of moderately good visual sensitivity surrounded by poor visual sensitivity.
Hence, simply the number or percentage of FN responses is not as informative as inspecting the context of the examination, whether the FN responses can be attributed to disease, or perhaps asking the patient if they felt the test was too long and they lost concentration. However, even for the latter, small patchy defects at the edge not typical of disease or a cloverleaf pattern indicate that the patient became tired before the test was done, even if they don't recognize or admit it.
2. False positive responses. The inverse, responding when there is no stimulus, is found in the Full Threshold strategy mainly by programming the machine from time to time not to present a stimulus when during the nearly regular rhythm of the test one might be expected. The SITA strategy uses a more sophisticated method of measuring the patient's usual response time. It also counts as a false response one that falls outside a range of the patient's response times (twice the average of the last 10 responses) or too soon after the stimulus for it to be a response to the stimulus rather than the expectation of a stimulus.
FP responses greatly affect interpretation of the test result. The traditional limit of 33% is too generous, and even a 5% or 10% FP rate can begin to confound the interpretation. The cause of FP responses can be failure of the patient to understand the test (perhaps the fault of the perimetrist), but seems most often to be due to a “trigger-happy” patient. In some cases, the patient may think they must respond while the light is still on, and the patient attempts the impossible task of responding during the 200-millisecond stimulus. Such patients need to be told that the light is purposely presented as a short flash, then the machine waits for a response, and that the machine times their usual responses to tell how long to wait, so the patient can take time to make deliberate responses when they are sure they saw the light.
Additionally, it may be better to call perimetry a visual field “documentation” instead of a “test.” No one wants to fail a test by not responding when there is a light, so they push the button whenever there is any hint of a visual phenomenon, or even some other sensory stimulus, like a sudden noise outside the room. Again, emphasize to patients that the goal is to push the button in a measured, deliberate fashion when they are sure a light came on.
On the printout, FP responses may be associated with a high Mean Deviation (MD) global index, impossibly high threshold values (such as 45 dB, or even above 50 dB), and areas of white in the greyscale portion of the printout. Even without such signs, the clinician should be cautious when the FP rate is 10% or more.
3. Fixation losses. The importance of maintaining good fixation during most of the test is obvious. The problem when interpreting the field test is knowing how steadily the patient held the gaze at the center. Two methods are used in the current machines, one traditional and one relatively new.
► Blind spot testing. The original method of testing whether fixation was being maintained was to present the stimulus into the physiologic blind spot as a “catch trial.” If the patient responded, it could be blamed on inaccurate gaze position that caused the image of the stimulus to reach a seeing part of the retina. The biggest problem in actual use is that the blind spot is not exactly in the same position in all eyes, or a slight tilt of the head places the stimulus near the edge of the blind spot, so that the stimulus is sometimes visible and sometimes not.
Most clinicians realize that if there are no other signs of poor performance by the patient, the machine can falsely report a high percentage of fixation losses.
► Gaze tracking. The newer machines are able to monitor the position of the eye by recording at the start of the test the relative position of the corneal reflex and the pupil margin, which is affected by gaze direction but not by movement of the head position. It only matters whether the gaze is straight ahead at the time of the stimulus presentation, so with each stimulus presentation, any deviation of the eye from gaze is quantified and recorded. A consequence of this is that fixation is checked several hundred times — each time there is a presentation of a stimulus — instead of the fewer than 10 times there are catch trials with the blind-spot method.
The gaze record on the report (See Figure 3) shows a downward blip if for some reason the eye position could not be determined. This could occur perhaps because of ptosis obscuring the pupil or corneal reflex, a presentation during a blink, or an irregular pupil that prevented accurate recordings. If there is a shift in gaze, the gaze tracking line deviates upward in proportion to the degree of an eccentric position in any direction (with a maximum of 10 degrees). The clinician makes a judgment as to whether the gaze was within two or three degrees most of the time or was frequently far out of line. Sometimes the frequency of larger non-central gaze increases a bit as the test nears the end, in part perhaps because of fatigue, and in part because the locations being tested are more peripheral, a motivation for more searching. There can be artifacts; for example, that the gaze seems eccentric nearly all the time, because the eye was not well positioned when the test was started and relative position of the reflex and the pupil poorly determined.
Figure 3. Gaze tracker diagram. This patient had quite good fixation, with deviation estimated to be only 1 to 4 degrees nearly all the time, but twice had his gaze 10 degrees or more from fixation (large arrows above) when the stimulus was presented. There were a moderate number of times when the position of gaze could not be determined, four instances of which are indicated with small arrows below. However, the quality of fixation can be judged as long as the gaze position could be quantified most of the time.
However, a relatively clean record of gaze position is reassuring, especially if the blind spot method seems to indicate poor fixation. Moreover, the gaze tracker record shows only how well the eye was positioned during stimulus presentations and not between stimulus presentations. The information in the gaze-tracking record is often overlooked and underused as a marker of the quality of the test and the patient's ability to concentrate on the point of fixation.
I hope this review of the three standard signs of reliability will help clinicians better discern what information visual field examinations have to offer, and to make better diagnostic and therapeutic decisions. OM
Dr. Anderson is professor of ophthalmology and Douglas R. Anderson Distinguished Chair in Ophthalmology in the department of ophthalmology at the University of Miami Miller School of Medicine. He has also acted as a consultant for Carl Zeiss Meditec. |