Call for a Standard Fuzzy Test Set

Current industry standard fuzzy testing methods do not provide you with the level of assurance they claim to.

To understand this statement it is necessary to look at the process of how fuzzy tests are created and how the results are analysed, interpreted and presented.

What is fuzzy testing?

Fuzzy testing is designed to verify the filter’s ability to deal with target names that have been manipulated so that they are no longer exact reproductions of the name. There is no regulatory guidance on the kinds of manipulations one should expect to match against, but the industry has converged on a typical set of manipulations that try to replicate what a human may consider to be a close match. These may include some of the following types of manipulations:

In order to assess the capability of a filter in these areas it is necessary to select valid sanctioned target names, perform these manipulations and present them to the filter to measure the response. This process can be performed either manually or automatically.

The disadvantage of manual selection is that every time a test is performed one must first verify that the name is still a valid and appropriate target name on the list. If not, then all tests based on this name must be refreshed with a new selection.

For this reason, it is more common to adopt an automatic solution to the creation of fuzzy test names, but this also has a number of potential risks and challenges, if care is not taken. In particular, these manipulations tend to select names at random which may lead to unsuitable choices when verifying functionality.

For example, one of the most fundamental manipulations one may consider is the deletion of a single character. Consider these three examples of this manipulation:

The first example is the deletion of a single character from a ten letter name part within a name with four words and a total of 25 characters. This kind of manipulation would fall within risk appetite of even the most risk-tolerant financial institution.

The second manipulation is not as clear-cut. Reducing the second word to a single character and leaving a fairly common short first name as the only thing to latch onto may be considered an acceptable or even desirable scenario to miss.

The third manipulation is potentially a scenario that one would like to verify, but should it be classified as a character deletion? This should ideally be reinterpreted as a two-word concatenation.

Unfortunately the above examples are not extreme examples, but rather the norm when it comes to fuzzy test sets. And it gets worse: third party providers of assurance tests very rarely account for these issues and in the majority of cases do not even share the detail of their tests.

Often results are presented as a score in a certain category (e.g. 89.4% resilient to character removal) or – even worse – in some cases, one overall percentage to indicate the entire fuzzy performance of the filter!

These scores are typically compared with some benchmark figure to determine whether the level of performance is acceptable. Leaving aside the challenges with the collation of peer datasets (i.e. when and how were they collected?, were the tests the same?, what was the sizes, geographic locations and industries of the institutions?), what do the peer data really mean?

For example, what does a score of 70% for a given manipulation represent? Does it mean that all the institutions involved were happy to miss 30% of the cases? Or does it mean that 7 out of 10 institutions hit everything and the other 3 hit nothing? Somewhere in-between?

Achieving 90% against a 70% benchmark could be a disaster if the 10% the filter is failing to alert on are very important scenarios you expect to hit. So what does the percentage actually mean? Well, nothing, unless the tests are properly qualified. This means looking at each test case individually and deciding whether it is within risk appetite or not. The results can be reviewed in the usual way by classifying them according to expectation:

The outcome of this is a greater understanding of the strengths and weaknesses of the filter. It can pinpoint the areas where there is potential risk which can be worked on and/or mitigated. It can also point to the areas of the system where there are potential efficiencies to be made. Finally, it can help one define in much greater detail what the agreed risk appetite is, which is extremely useful when it comes to demonstrating appropriate controls to management or even regulators.

Where does this leave the peer comparisons? We are still faced with the problem of not knowing how to interpret the peer group results. What would help with this is to be sure that we are on a level playing field in that the peer institutions were tested in a consistent manner. Today this is far from reality.

However, we all care about the same kinds of things, albeit with variations according to risk appetite. It should therefore be possible to define a set of standard fuzzy tests that are used universally, irrespective of who is performing the testing.

Following previous precedents, such as the Wolfsberg-endoresed ISO20022 Screening Guidelines, it should be possible to create a set of specific fuzzy test cases that verify various common scenarios. There are essentially three types of test case to be considered:

Having such a universal test suite would be extremely beneficial. Having some consensus about the most useful tests to perform and which of the three categories above they fall into would be an extremely powerful tool. This would essentially amount to concerning oneself with the risk appetite dependent part of the problem. Having more time and focus on this aspect will help to significantly reduce risk and cost.

Deep Lake specialises in advanced analytical techniques and expert business knowledge to provide deeper insight into screening environments. Contact us to find out more about our products and services.