Spoken Language Systems Lab (L2F)

Transcrição

Spoken Language Systems Lab (L2F)
technology
from seed"
Spoken Language Systems Lab (L2F) Isabel Trancoso
Research Unit: Interactive Intelligent Systems
2 L2F Spoken/
multimodal
dialog
systems
E-Health
E-Learning
Rich
transcription
of
multimedia
documents
Speech-tospeech
machine
translation
3 Multimodal dialog systems Entrainment in Bus
Information Systems
4 L2F Spoken/
multimodal
dialog
systems
E-Learning
E-Health
Rich
transcription
of
multimedia
documents
Speech-tospeech
machine
translation
5 Rich transcription [anchor 150] Boa tarde o governo considera que as medidas de austeridade aprovadas e em vigor. Só para já adequadas às necessidades Binanceiras de Portugal. O ministro das Finanças mostra-­‐se conBiante com as metas traçadas no programa de Estabilidade e Crescimento. Apesar de não fechar as portas à hipótese de medidas adicionais de controlo orçamental, em dois mil e doze. É desta forma que Teixeira dos Santos responde a pressão dos países da moeda única, querem que Portugal e Espanha avança com mais medidas de austeridade, dentro de ano e meio. boa tarde o governo considera que as medidas de austeridade aprovadas e em vigor só para já adequadas às necessidades Binanceiras de portugal o ministro das Binanças mostra-­‐se conBiante com as metas traçadas no programa de estabilidade e crescimento apesar de não fechar as portas à hipótese de medidas adicionais de controlo orçamental em dois mil e doze é desta forma que teixeira dos santos responde a pressão dos países da moeda única querem que portugal e espanha avança com mais medidas de austeridade dentro de ano e meio ainda em mês passou diz que o governo decidiu apertar o cinto aos portugueses e já europa vem pedir mais para depois de dois mil e onze o ministro das Binanças não fecha a porta, mas defende cada ano a seu tempo acho que estamos de em condições de alimentar digamos confessa estar conBiantes de que o objectivo para dois mil e dez vai ser conseguido com as medidas adicionais que foram entretanto já decididas [spk 2000] Ainda em mês passou diz que o Governo decidiu apertar o cinto aos portugueses e já Europa vem pedir mais para depois de dois mil e onze. O ministro das Finanças não fecha a porta, mas defende cada ano, a seu tempo. [spk 1000] Acho que estamos de em condições de alimentar, digamos confessa estar conBiantes, de que o objectivo para dois mil e dez, vai ser conseguido com as medidas adicionais que foram entretanto já decididas. Tópicos: Política; Economia; Nacional Língua: Português (Europeu) • 
• 
• 
• 
• 
• 
On line captioning at RTP since
March 2008‫‏‬
WER = 12% for displayed subtitles
Latency: 3.5s + 3s
Meeting browser, Lecture browser,
Courtroom transcriptions
Other languages: English, Spanish
Other varieties: Brazilian and African
Portuguese
6 Rich transcription European projects User environment
All Feeds
Topic
User
Collection
7 L2F Spoken/
multimodal
dialog
systems
E-Health
E-Learning
Rich
transcription
of
multimedia
documents
Speech-tospeech
machine
translation
8 Speech-­‐to-­‐speech machine translation Fig. How to Use Multili
Cooperation with
Carnegie Mellon
university
!
9 L2F Spoken/
multimodal
dialog
systems
E-Health
E-Learning
Rich
transcription
of
multimedia
documents
Speech-tospeech
machine
translation
10 REAP.PT Cooperation with Carnegie Mellon Univ. 11 Serious games A
B
C
D
E
12 L2F Spoken/
multimodal
dialog
systems
E-Health
E-Learning
Rich
transcription
of
multimedia
documents
Speech-tospeech
machine
translation
13 E-­‐Health AVOZ
Elderly Speech Recognition
IC4U Decision support system for preventing Intensive Care Unit readmissions 14 VITHEA Virtual Therapist for Aphasia Treatment 15 Other projects •  Voice coaching for reduced stress •  Enhancing the European Linguistic Infrastructure •  MISNIS -­‐ Intelligent Mining of Public Social Networks’ InBluence in Society (NEW) •  Music Information retrieval –  FADO identiBication: 95.8% 16 L2F Spoken/
multimodal
dialog
systems
E-Health
E-Learning
Rich
transcription
of
multimedia
documents
Speech-tospeech
machine
translation
17 Activities related with COST 1206 •  Master Thesis (Joana Correia) –  Anti-­‐spooBing: speaker veriBication vs. voice conversion –  Joint work with Alberto Abad & Gopala Anumanchipalli –  Ack: Haizhou Li & ZhiZheng Wu, Infocomm Research, Singapore •  PhD thesis (José Portêlo) –  Privacy preserving speech processing –  Co-­‐supervision: Bhiksha Raj •  Suspect – Secure Speech Technologies –  Funded by National Science Foundation (FCT) –  2012-­‐2014 •  JOINT Carnegie Mellon/INESC-­‐ID Activities in privacy preserving speech processing 18 Motivation 19 Your Voice Recordings are Forever! •  Can you imagine the following happening 20 years from now? –  Finding recordings of yourself saying things you never spoke? –  Your (authentic) voice saying incriminating things you never really said –  You voiceprints being used to impersonate you –  Or even questions you posed to remote systems returning to embarrass you decades later •  All of this is possible –  Each time you use a voice-­‐based service, the service stores your voice recordings –  There is no time limit on when your recordings can be abused •  Tomorrow, or 20 years from now.. 20 Privacy-­‐preserving voice processing •  The system never sees clear-text version of your voice
–  All prior risks eliminated
•  While still performing voice-processing tasks
–  Mining
–  Recognition
–  Authentication..
•  How?
•  Work so far: Privacy-preserving speaker authentication
21 Assumptions •  Speaker possesses a smartphone or computation-­‐capable device •  Communication channel between system and user is secure –  Eavesdroppers not a concern –  Goal is to protect the user from the system 22 Privacy Preserving Speaker Authentication •  To protect the user we require the following: –  The system should not access the user’s audio, or features derived from it. –  The system should not possess a model of the user’s speech. •  These almost-­‐paradoxical sounding requirements may be assumed for other forms of secure biometrics as well. 23 Privacy Preserving Speaker Authentication Proposed Solutions: •  Secure Multiparty Computation (SMC) –  Homomorphic encryption based protocols –  Garbled circuits •  Locality Sensitive Hashing (LSH) •  Secure Binary Embeddings (SBE) Bold items = on-­‐going work 24 SMC with homomorphic encryption •  Employ conventional speaker authentication algorithms –  Bayesian classiBier with Gaussian mixture distributions for speaker and imposters –  ClassiBier trained from enrollment recordings •  “Secure” algorithm through SMC protocol –  User and system repeatedly exchange partial results through elaborate protocols –  Partial results are obscured from one another •  Via partially homomorphic encryption, additive masking, oblivious transfer, etc. •  Problem: Highly inefBicient –  10,000 x slower than clear-­‐text operation 25 Locality Sensitive Hashing •  Convert authentication to a nearest neighbor search –  Compare test recordings to previously stored enrollment recordings •  Perform nearest-­‐neighbor search using LSH –  All data obscured by user through a combination of LSH and symmetric-­‐key encryption prior to sending them to the system •  BeneBits: –  Very efBicient •  Less than 10x slowdown –  Computationally inexpensive •  Problem: Inaccurate –  Nearest neighbor solutions not accurate enough for robust authentication 26 Secure Binary Embeddings •  Scheme for converting vectors to bit sequences (or hashes) using band-­‐quantized random projections •  Produces an LSH-­‐like method with interesting properties: –  If dE(x, x′) ≤ f (∆), then dH (q(x), q(x′))∝dE (x, x′) –  If dE(x,x′) > f(∆), then q(x) and q(x′) provide no information regarding dE(x, x′) •  Based on the concept of Universal Quantization: 27 Secure Binary Embeddings •  SBE behavior (L -­‐ vector dimension, M -­‐ number of bits): •  SBE are uninformative about vectors that are far apart •  But can compute distance between close vectors –  The Hamming distance between SBEs of vectors approximates Euclidean distance between vectors 28 Authentication using SBE •  Convert features derived from audio recordings to SBEs –  SBEs are uninformative •  User only transmits SBEs to system –  Parameters A and w used to compute SBEs are user’s private keys •  Binary classiBier trained from enrollment recordings –  SVM classiBier –  Replace the conventional RBF Kernel with modiBied kernel •  k(x,x′) = e−γ·dH2(q(x),q(x′)) –  Employs Hamming distance between SBEs •  Authentication phase: system works on SBEs from test data 29 Experiments using SBE •  Small corpus (Yoho, 138 speakers) •  Features: Gaussian mean “supervectors” based on MFCCs (39 coeffs) –  A supervector is a concatenation of means from a GMM –  SBEs are computed from supervectors (on user’s client device) 30 Speaker authentication with SBE •  InsigniBicant degradation w.r.t. conventional (“public”) authentication •  But user’s privacy is retained –  System can only engage with user using SBEs generated with user’s own keys –  Security == security of storage of user’s keys (A, w) 31 Continuing the Work: Garbled Circuits •  SBEs are efBicient, but do not generalize –  All classiBier training data (positive and negative enrollment data) provided by user –  Not appropriate for other speech processing tasks •  E.g. Keyword spotting or recognition, where the system trains models •  Garbled circuits –  Enable computation of conventional models privately –  Cast all computation as Boolean circuits, “privatize” circuit through “garbling” –  Challenges: EfBicient design of circuit –  Current work: GCs for authentication 32 Conclusions and current work •  With increasing use of voice services comes the need for protecting user privacy –  Protecting user’s voice data from abuse •  Can be achieved through privacy-­‐preserving voice processing –  For a marginal reduction in performance –  The reduced performance is a small price to pay for keeping a user’s identity secret. •  Continuing work: addressing challenges –  Design of appropriate mechanisms for different tasks –  EfBiciency, efBiciency, efBiciency •  Most tasks feasible, but computationally challenged •  Some tasks such as full-­‐scale recognition may remain impossible 33 34