用 Web Speech API 和 Node.js 将语音转换成 emoji

你有没有构思过这样的项目啊 -- 用 Node.js 来检查我们发的语音内容是积极的还是消极的？

我之前收到 Grammarly 的一封讨论语气检测的推广邮件，说是可以用程序检查我们写的内容是什么，然后告诉我们这个内容给人的感觉是积极进取的，自信的还是其他什么感觉。

我就想 -- 有没有可能使用浏览器和 Node.js 来构建一个简化版本呢？

所以我就开发了这么一个小项目，这里把方法分享给大家。

步骤

在你开始构建一个项目时，你应该（至少模糊地）勾勒出你的目标以及构思如何实现。在开始搜索之前，我罗列出需要做的事情：

记录语音
将语音转换成为文字
文字内容测评
将结果反馈给说话人

搜索一番之后，我发现 Web Speech API 就可以录音以及将语音转换成文字，在 Google Chrome 就可以使用这个功能，语音识别部分正是我们需要的。

而文字内容测评呢，我发现在 AFINN 上面有测评词汇列表。虽然只有 2477 个词汇，但是对我们的项目来说也够用了。

结果反馈这一步，我们用 HTML，JavaScript 和 CSS 编写程序来处理结果，将 emoji 展示在网页上。

现在我们知道要用到些什么了，总结一下：

浏览器接收用户的语音，通过 Web Speech API 返回一些文字
向 Node.js 服务器发起处理文字的请求
依据 AFINN 的列表测评文字返回分数
浏览器根据分数显示不同的 emoji

注意： 如果你熟悉项目设置，那么你可以跳过以下“项目文件和设置”部分。

项目文件和设置

项目文件夹和文件结构如下：


  |-public // folder with the content that we will feed to the browser
    |-style // folder for our css and emojis
      |-css // optional folder, we have only one obvious file
        |-emojis.css
      |-images // folder for the emojis
    |-index.html
    |-recognition.js
  package.json
  server.js // our Node.js server

前端 index.html 包括 JS 和 CSS 代码：

<html>
  <head>
    <title>
      Speech to emotion
    </title>
	<link rel="stylesheet" href="style/css/emojis.css">
  </head>
  <body>
    
    nothing for now
    
    <script src="recognition.js"></script>
  </body>
</html>

recognition.js 在 DOMContentLoaded 事件中，确保在执行 JS 代码前加载页面。

document.addEventListener('DOMContentLoaded', speechToEmotion, false);

function speechToEmotion() {
  // Web Speech API section code will be added here
}

emojis.css 暂时为空。

在文件夹中运行 npm run init，创建 package.json。

接下来，我们需要用 npm install 安装两个包，让事情变得简单点：

expressjs - 包含 HTTP 服务器
nodemon - 这样我们在修改 server.js 的时候就不需要不断键入 node server.js

package.json:

{
  "name": "speech-to-emotion",
  "version": "1.0.0",
  "description": "We speak and it feels us :o",
  "main": "index.js",
  "scripts": {
    "server": "node server.js",
    "server-debug": "nodemon --inspect server.js"
  },
  "author": "daspinola",
  "license": "MIT",
  "dependencies": {
    "express": "^4.17.1"
  },
  "devDependencies": {
    "nodemon": "^2.0.2"
  }
}

server.js:

const express = require('express')
const path = require('path')

const port = 3000
const app = express()

app.use(express.static(path.join(__dirname, 'public')))

app.get('/', function(req, res) {
  res.sendFile(path.join(__dirname, 'index.html'))
})

app.get('/emotion', function(req, res) {
  // Valence of emotion section code will be here for not it returns nothing
  res.send({})
})

app.listen(port, function () {
  console.log(`Listening on port ${port}!`)
})

接着，在终端运行 npm run server-debug，在 localhost:3000 打开浏览器，会看到 HTML 文件中的 “nothing for now” 信息。

Web Speech API

在 Chrome 上使用这个 API，包含语音识别，即可将我们的语音消息转换成文字。它可以检测事件，例如，捕捉音频的起始点。

现在，我们需要 onresult 和 onend 事件来分别检测麦克风捕捉的内容和捕捉结束的点。

recognition.js 的代码如下：

const recognition = new webkitSpeechRecognition()
recognition.lang = 'en-US'

recognition.onresult = function(event) {
  const results = event.results;
  const transcript = results[0][0].transcript
  
  console.log('text ->', transcript)
}

recognition.onend = function() {
  console.log('disconnected')
}

recognition.start()

如此连接麦克风几秒钟以收听音频。如果没有捕捉到什么，连接就会断开。

语音识别引擎可识别这份文件里罗列的语言。

如果想要延长连接的时间（或者是一段话分几次说的情况），我们可以使用一个叫作 continuous 的属性，像给 lang 设置属性值一样把它的值设置为 true，这样就可以不间断地捕捉音频了。

const recognition = new webkitSpeechRecognition()
recognition.lang = 'en-US'
recognition.continuous = true

recognition.onresult = function(event) {
  const results = event.results;
  const transcript = results[results.length-1][0].transcript
  
  console.log('text ->', transcript)
}

recognition.onend = function() {
  console.log('disconnected')
}

recognition.start()

我们添加了 continuous 属性，只获取最后一个结果，而不是所有结果

刷新页面，首先会弹出对话框问我们是否启用麦克风，选择 yes，然后我们就可以讲话，并且在 Chrome DevTools console 查看语音信息转换成文字的结果。

注意：截至本文发布时，Chrome 和安卓系统支持使用这个 API，可能之后 Edge 也会支持。或许可以通过一些脚本或者工具解决浏览器兼容性问题，不过我自己没有尝试。你可以在 Can I use 搜索兼容性问题。

发起请求

通过简单的 fetch 方法发起请求。

  recognition.onresult = function(event) {
    const results = event.results;
    const transcript = results[results.length-1][0].transcript

    // making a request to our /emotion endpoint that we defined on the project start and setup section
    fetch(`/emotion?text=${transcript}`)
      .then((response) => response.json())
      .then((result) => {
        console.log('result ->', result) // should be undefined
      })
      .catch((e) => {
        console.error('Request error -> ', e)
      })
  }

对于较长的文字来说，将 /emotion 转换成 POST（而不是 GET）更好，虽然 GET 也可以

情绪效价

效价用于评估情绪是正面还是负面的，以及是高唤醒情绪还是低唤醒情绪。

在这个项目中，我们使用两种情绪：开心，正面情绪，分值为正；沮丧，负面情绪，分值为负。分值为零则表示中立的情绪，表达的意思是“额，什么情况”。

AFINN 列表的计分从 -5 分到 5 分，以下是列表的词汇示例：

hope 2
hopeful 2
hopefully 2
hopeless -2
hopelessness -2
hopes 2
hoping 2
horrendous -3
horrible -3
horrific -3

举个例子，“我希望这个不危险”，其中“希望”是 2 分，“危险”是 -3 分，那么这句话的总分就是 -1 分。这个列表里没有包含的词汇我们可以忽略不计分。

将文件渲染为 JSON：

{
  <word>: <score>,
  <word1>: <score1>,
  ..
}

然后我们就可以检测文字里的每个词语，计算总分。不过，Andrew Sliwinski 的 sentiment 代码部分已经有相应的处理，我们可以直接使用，不必什么都自己从头写代码。

键入 npm install sentiment，打开 server.js，然后引入库：

const Sentiment = require('sentiment');

接着改变 "/emotion" 的路径：

app.get('/emotion', function(req, res) {
  const sentiment = new Sentiment()
  const text = req.query.text // this returns our request query "text"
  const score = sentiment.analyze(text);

  res.send(score)
})

sentiment.analyze(<our_text_variable>) 会执行上面说的步骤: 根据 AFINN 列表检查文字中每个词语，最后给我们一个总分。

变量 score 的对象：

{
  score: 7,
  comparative: 2.3333333333333335,
  calculation: [ { awesome: 4 }, { good: 3 } ],
  tokens: [ 'good', 'awesome', 'film' ],
  words: [ 'awesome', 'good' ],
  positive: [ 'awesome', 'good' ],
  negative: []
}

在这个例子里我们希望 socore 属性值是一个正值

返回分值之后，我们需要让它在浏览器中显示。

注意： AFINN 是英文的，要在 Web Speech API 使用其他语言的话，可以搜索类似 AFINN 的分值列表，以匹配相应的语言。

精彩部分

最后一步，更新 index.html，以展示 emoji：

<html>
  <head>
    <title>
      Speech to emotion
    </title>
    <link rel="stylesheet" href="style/css/emojis.css">
  </head>
  <body>
    <!-- We replace the "nothing for now" -->
    <div class="emoji">
      <img class="idle">
    </div>
    <!-- And leave the rest alone -->
    <script src="recognition.js"></script>
  </body>
</html>

本项目中使用的 emoji 允许商用，免费，可以在这里获得。谢谢创作 emoji 的艺术家！

我们下载一些喜欢的 icon，把它们添加到 images 文件夹，稍后我们会使用这些 emoji：

error 错误 - 出现错误时
idle 懒洋洋 - 麦克风未启用时
listening 倾听 - 麦克风已连接等待输入时
negative 负面 - 分值为正时
neutral 中性 - 分值为零时
positive 正面 - 分值为负时
searching 搜索 - 发起服务器请求时

在 emojis.css 中添加：

.emoji img {
  width: 100px;
  width: 100px;
}

.emoji .error {
  content:url("../images/error.png");
}

.emoji .idle {
  content:url("../images/idle.png");
}

.emoji .listening {
  content:url("../images/listening.png");
}

.emoji .negative {
  content:url("../images/negative.png");
}

.emoji .neutral {
  content:url("../images/neutral.png");
}

.emoji .positive {
  content:url("../images/positive.png");
}

.emoji .searching {
  content:url("../images/searching.png");
}

第一个选择器是给 emoji 设置一个一致的尺寸，其余的是各种 emoji 图片

做完以上修改之后，我们重新加载页面，会显示懒洋洋 emoji，这是因为我们还没有根据不同场景替换元素的 idle 类。

我们还需要对 recognition.js 进行一步操作，添加一个函数以更改 emoji。

/**
 * @param {string} type - could be any of the following:
 *   error|idle|listening|negative|positive|searching
 */
function setEmoji(type) {
  const emojiElem = document.querySelector('.emoji img')
  emojiElem.classList = type
}

在收到服务器请求的反馈之后，我们开始检测分值为正负或是零，相应地调用 setEmoji 函数。

console.log(transcript) // So we know what it understood when we spoke

setEmoji('searching')

fetch(`/emotion?text=${transcript}`)
  .then((response) => response.json())
  .then((result) => {
    if (result.score > 0) {
      setEmoji('positive')
    } else if (result.score < 0) {
      setEmoji('negative')
    } else {
      setEmoji('listening')
    }
  })
  .catch((e) => {
    console.error('Request error -> ', e)
    recognition.abort()
  })

将 emoji 设置为 searching，然后发起请求

最后，添加 onerror 和 onaudiostart 事件，修改 onend 事件，这样就设置为正确的 emoji 了。

  recognition.onerror = function(event) {
    console.error('Recognition error -> ', event.error)
    setEmoji('error')
  }

  recognition.onaudiostart = function() {
    setEmoji('listening')
  }

  recognition.onend = function() {
    setEmoji('idle')
  }

最后，recognition.js 是这样：

document.addEventListener('DOMContentLoaded', speechToEmotion, false);

function speechToEmotion() {
  const recognition = new webkitSpeechRecognition()
  recognition.lang = 'en-US'
  recognition.continuous = true

  recognition.onresult = function(event) {
    const results = event.results;
    const transcript = results[results.length-1][0].transcript

    console.log(transcript)

    setEmoji('searching')

    fetch(`/emotion?text=${transcript}`)
      .then((response) => response.json())
      .then((result) => {
        if (result.score > 0) {
          setEmoji('positive')
        } else if (result.score < 0) {
          setEmoji('negative')
        } else {
          setEmoji('listening')
        }
      })
      .catch((e) => {
        console.error('Request error -> ', e)
        recognition.abort()
      })
  }

  recognition.onerror = function(event) {
    console.error('Recognition error -> ', event.error)
    setEmoji('error')
  }

  recognition.onaudiostart = function() {
    setEmoji('listening')
  }

  recognition.onend = function() {
    setEmoji('idle')
  }

  recognition.start();

  /**
   * @param {string} type - could be any of the following:
   *   error|idle|listening|negative|positive|searching
   */
  function setEmoji(type) {
    const emojiElem = document.querySelector('.emoji img')
    emojiElem.classList = type
  }
}

测试项目，显示结果：

68747470733a2f2f7777772e66726565636f646563616d702e6f72672f6e6577732f636f6e74656e742f696d616765732f323031392f31312f73656e74696d656e742d746f2d656d6f74696f6e2e676966

注意：我们可以在 html 中添加一个元素，替换 console.log.，来检测识别出的内容，这样可以持续检测。

思考

其实我们可以对这个项目的某些部分可以做很多优化：

检测讽刺的内容
由于语音转文字 API 的敏感词限制，目前没办法检测出愤怒的情绪
也许可以省掉语音转文字这一步，直接根据语音显示出相应的 emoji

我在做这个项目的过程中发现，已经有一些类似的应用了，比如电话销售人员根据客户的语调和情绪判断是否可以成单。还有我在文章开头说的那封邮件，Grammarly 他们用这个方法来检测用户发邮件中用词的语调！这些应用还挺有意思的。

希望这篇文章可以给你一些启发。如果你用我介绍的方法做了什么项目，请告诉我啊。

你可以在我的 GitHub 仓库查看代码。

原文链接：How to Build a Speech to Emotion Converter with the Web Speech API and Node.js，作者：Diogo Spínola