2023-12-04发表2023-12-04更新技术6 分钟读完 (大约891个字)

七牛云文件hash计算浏览器环境实现

最近要实现一个文件唯一性判断的功能，在网上找了一大圈，基本上都是使用一些摘要算法，根据文件内容生成一串字符串。

想起来之前用过七牛云，当时也没太在意他的计算方式，于是就去研究一下他是怎么实现的，没想到人家已经把计算方式开源，遂去 Github 上看了一下，虽然说提供了 JS 版本的实现，但是只有 NodeJS 版本，没有浏览器版本，于是乎参照着 NodeJS 版本，实现了一个浏览器环境的版本.

以下是 readme, 我直接复制过来了，具体地址：https://github.com/qiniu/qetag/tree/master

qetag 是一个计算文件在七牛云存储上的 hash 值（也是文件下载时的 etag 值）的实用程序。
七牛的 hash/etag 算法是公开的。算法大体如下：
如果你能够确认文件 <= 4M，那么 hash = UrlsafeBase64([0x16, sha1(FileContent)])。也就是，文件的内容的 sha1 值（20 个字节），前面加一个 byte（值为 0x16），构成 21 字节的二进制数据，然后对这 21 字节的数据做 urlsafe 的 base64 编码。
如果文件 > 4M，则 hash = UrlsafeBase64([0x96, sha1([sha1(Block1), sha1(Block2), …])])，其中 Block 是把文件内容切分为 4M 为单位的一个个块，也就是 BlockI = FileContent[I*4M:(I+1)*4M]。
为何需要公开 hash/etag 算法？这个和 “消重” 问题有关，详细见：
https://developer.qiniu.com/kodo/kb/1365/how-to-avoid-the-users-to-upload-files-with-the-same-key > http://segmentfault.com/q/1010000000315810
为何在 sha1 值前面加一个 byte 的标记位(0x16 或 0x96）？
0x16 = 22，而 2^22 = 4M。所以前面的 0x16 其实是文件按 4M 分块的意思。
0x96 = 0x80 | 0x16。其中的 0x80 表示这个文件是大文件（有多个分块），hash 值也经过了 2 重的 sha1 计算。

有了前辈的实现过程，再配合 ChatGPT，写起来就比较顺利了。其实这里还有一个点，就是对于大文件的分片，尤其是前端上传大文件的时候，基本上都是分片之后，一段一段的上传。

const cryptoName = "SHA-1";
const bit = 22;
const baseSize = 1 << bit; // 4MB 每个块大小

// 计算sha1,
function sha1(data, hex) {
  return new Promise((reslove, reject) => {
    crypto.subtle
      .digest(cryptoName, data)
      .then((buf) => reslove(hex ? arrayBufferToHex(buf) : buf))
      .catch(reject);
  });
}

function countBlock(fileSize) {
  return (fileSize + (baseSize - 1)) >> bit;
}

// 计算文件hash
async function fileHash(file) {
  const fileSize = file.size;
  const chunkCount = countBlock(fileSize);

  let allBuff;
  let prefix;
  // 大于 4MB 的文件
  if (chunkCount > 1) {
    // const sha1ListBuff
    prefix = new Uint8Array([0x96]).buffer;
    let start = 0;
    const sah1BuffList = [];
    while (start < fileSize) {
      const bf = file.slice(start, start + baseSize);
      const currSha1Buff = await sha1(await bf.arrayBuffer());
      start += baseSize;
      sah1BuffList.push(currSha1Buff);
    }
    allBuff = await sha1(mergeToU8Arr(...sah1BuffList));
  } else {
    prefix = new Uint8Array([0x16]).buffer;
    allBuff = await sha1(await file.arrayBuffer());
  }

  allBuff = mergeToU8Arr(prefix, allBuff);
  // base64
  console.log(base64URLEncode(allBuff));
  // hex
  console.log(arrayBufferToHex(allBuff));
}

function mergeToU8Arr(...buf) {
  const l = buf.length;
  if (l === 1) {
    return buf;
  }
  let totalLen = 0;
  for (let i = 0; i < l; i++) {
    totalLen += buf[i].byteLength;
  }
  const tmpBuf = new ArrayBuffer(totalLen);
  const u8Buf = new Uint8Array(tmpBuf);
  let offset = 0;
  for (const bf of buf) {
    const curBuf = new Uint8Array(bf);
    u8Buf.set(curBuf, offset);
    offset += curBuf.length;
  }
  return u8Buf;
}

function base64URLEncode(input) {
  // 使用 btoa 进行 Base64 编码
  let base64Encoded = btoa(
    String.fromCharCode.apply(null, new Uint8Array(input))
  );
  // 替换标准 Base64 字符集中的一些字符，以获得 URL 安全的编码
  return base64Encoded
    .replace(/\+/g, "-")
    .replace(/\//g, "_")
    .replace(/=+$/, "");
}
// 将 ArrayBuffer 转换为十六进制字符串
function arrayBufferToHex(buffer) {
  const hashArray = Array.from(new Uint8Array(buffer));
  return hashArray.map((byte) => byte.toString(16).padStart(2, "0")).join("");
}

七牛云文件hash计算浏览器环境实现

https://www.md7.top/2023/12/04/七牛云文件hash计算浏览器环境实现/

作者

Fat Dong

发布于

2023-12-04

更新于

2023-12-04

许可协议

#ArrayBuffer sha-