2.1: Voice: Not yet loud and clear

Some elements of voice surveillance technology have improved dramatically. But a number of the more mundane, practical challenges around data and regulation remain.

Implementing effective voice surveillance is often seen as the most intractable of the challenges facing surveillance professionals. First, there is the fundamental issue of identification and ensuring that everyone who should be monitored is monitored. This issue is a key problem for the banks, for whom simply ensuring compliance with regulatory demands on coverage was challenging, even before the proliferation of channels, from mobile to SMS messaging, to encrypted apps and unapproved communications tools.

The technical problems are indeed daunting, with the sheer volume of raw voice data; the difficulty in capturing and aggregating diverse voice channels across traditional turrets, fixed lines, VOIP, mobile and apps; and the challenge eliminating background noise without compromising accuracy, and storing it so that it can be easily retrieved. 

Some vendors claim that these issues are in the past. Most banks disagree. “The issue is very far from being solved,” says one compliance MD in the US. “There is a tremendous amount of work still to do. Just one example: each surveillance system sits on top of a different recording system and all of those discontinuities cause significant problems.” Predictably then, according to the 1LoD Surveillance Survey, voice surveillance lags both email, chatroom and trade surveillance in terms of the status of implementation and levels of automation.

Lost in translation
Then there are the issues related to speech recognition and ensuring the highest possible speech-to-text accuracy, as well as interpreting intent and context. This means finding solutions to challenges such as switching or mixing languages, using trading jargon and coded speech, as well as the more sophisticated things like voice biometrics and intent detection. According to the 1LoD survey, voice surveillance solutions at banks currently cover on average seven languages.

Here again there is disagreement over progress. Some take the view that we are still a long way from solutions that can deliver anything beyond the basic functionality demanded by regulation. One banker says: “I think the technologies are still problematic, particularly around generating valid alerts against the backdrop of multiple languages and dialects.” However, others are more positive, with one saying: “I think voice might have come along the most recently. We’ve seen cases of intelligent speech recognition capabilities being used to extract and accurately identify meaningful data points from their recorded conversations.”

This latter view is certainly common among technologists in voice, who point out that traditional large vocabulary continuous speech recognition systems are getting better all the time, while a new generation of neural network (NN)-based models promises both greater transcription accuracy and also much richer metadata in terms of implied emotions and therefore potentially intent.

“Until recently, the implementations and algorithms needed for NN were not affordable for on-premises implementations and therefore commercial solutions mostly used Hidden Markov Model (HMM) [traditional statistical models],” says one solution provider. “Now, it is increasingly possible not just to look at text, timestamps, confidence levels, noise event detection, emotion detection events and so on, but also to determine if a communication has been originated from an Internet Protocol (IP) device or a public switched telephone network (PSTN) device, or even characterize the background environment of the call (outdoor, office, car and so on). Anomaly detection can now be applied to all these features.”

The disagreement in perceived progress may not simply reflect the inevitable difference in the viewpoints of buyers and sellers of a new technology, but also a widening gap between the leading banks and the rest of the pack. Citi has 32 dedicated voice engineers who form part of the global Citi Technology Infrastructure Network and Voice Services Domain, and the regulators are increasingly looking at the leaders’ best practices and asking for it to be adopted by the rest of the market. This poses a problem for smaller institutions and those who still believe that voice surveillance is seen as an area too complex for the current technology to get right.

Regulators’ expectations rising
That can’t-do attitude may not cut it for much longer. The regulators have previously taken a pragmatic view of voice surveillance and what can be achieved through manual sampling and basic phonetic technology, and as such lower levels of enforcement have been applied. But regulations from Dodd-Frank through to MAR and MiFID II do not place any less importance on voice communication than text, and regulators are increasingly expressing the view that voice is not being appropriately monitored relative to the risk it poses. It is seen to be time that voice was analysed with the same levels of rigour and not simply reviewed as post-incident evidence for trade reconstruction. They want to see more continuous monitoring of voice and the combining of voice data with other trade surveillance inputs to build a more integrated surveillance framework.

These expectations are partly being driven by regulators’ own understanding of improvements in technology. FINRA, for example, now talks enthusiastically about its use of AI and dynamic time warping for voice recognition. As they use new technologies, the regulators are signalling that enforcement will take what is now possible into account when assessing institutions. Just as happened with text, improvements in technology will raise regulators’ expectations of what surveillance functions need to achieve.

Will BigTech leap ahead?
One area the banks must hope that the regulators do not look to for inspiration is BigTech. AI and NN solutions need to be trained, and in language recognition this requires vast datasets, far larger than any single financial institution has access to. This suggests that the likes of Facebook, Apple, Google and Microsoft have a significant head start. As one surveillance head says: “Google, Microsoft and Amazon – they’re the people that influence the technology that we’re using in our world now because they’re the only people who’ve got enough data.”

These firms do provide APIs into some of their voice services, but despite their huge resources, even they do not have a workable solution to the challenges with voice yet, not least because their databases do not contain any trading floor conversations. The privacy and other regulatory issues with banks providing those conversations to the BigTech giants, even if they wanted to, are unlikely to be resolved any time soon.

This leaves voice surveillance in financial services reliant upon solution developers that have built models trained on the noisy and domain-specific voice data found in banks, with additional model customisation carried out for each new deployment. It also means that banks must ensure that regulatory expectations are not influenced by the advancements of Alexa and Siri.

1LoD are hosting a Voice Surveillance Digital Debate, EMEA Edition, 09.00-10.30 BST, 28 July.  This digital debate will be opened by Tom Hardin, “Tipper X”:  The lead FBI informant at the heart of the Galleon Group insider trading scandal. Find out more here.  

Based on an international benchmarking survey collecting the views of industry leading experts from 15 of the largest financial institutions globally, the 2020 Surveillance Benchmark Report provides a unique insight into the maturity and development of surveillance functions over the last 12 months, as well as predictions for the future. Including in-depth commentary from regulators, practitioners, consultants and technology experts, it is the only report for professionals in the industry.

Lead sponsor


Partner sponsors



Eventus Systems logo 1


Researched and published by