MurmurHash2算法简介

介绍 MurmurHash2 算法的特点，优势，缺点，应用场景，基本实现，MurmurHash2 是一种哈希算法，高效，非加密。

特点

非加密：
- 不适用于密码学。
- 设计目标不为高碰撞性与安全性。
高性能：
- 使用位运算，混合函数实现，计算速度快。
分布均匀：
- 体现输入中各个位的影响，hash 值分布均匀。

算法思想

分块处理：
- 把输入按照每 32/64 位分为一组处理。
混合操作：
- 对每个分块进行混合计算，包括与常数相乘，移位异或，与当前结果混合。
最终混合：
- 将剩下的不足 4/8 字节的数组混合进结果，然后在对结果做混合，确保所有输入位都对结果有影响。

伪代码：

以 32 位版本的 MurmurHash2 为例，伪代码如下：

plaintext

uint m(const void *key,int len){
  uint seed = gen();
  const uint m;
  const int r;
  uint h = len ^ seed;
  
  const uchar *data = (const uchar *)key;
  while(len>=4){
    uint k = *(uint *)data;
    
    // mix m r h
    /*
    ....
    */
    
    data += 4;
    len -= 4;
  }
  // rest bytes
  /*
  swich(len)
  */
  
  // final mix m h
  /*
  ...
  */
  return h;
}

redis 中的实现：

plaintext

/* MurmurHash2, by Austin Appleby
 * Note - This code makes a few assumptions about how your machine behaves -
 * 1. We can read a 4-byte value from any address without crashing
 * 2. sizeof(int) == 4
 *
 * And it has a few limitations -
 *
 * 1. It will not work incrementally. liyazzi: 也就是key需要一次性完整知道，不适合动态数据流，因为每一次的混合计算都依赖于前面的数据，然后最后的混合依赖于整个数据
 * 2. It will not produce the same results on little-endian and big-endian 影响跨平台性
 *    machines.
 */
unsigned int dictGenHashFunction(const void *key, int len) {
    /* 'm' and 'r' are mixing constants generated offline.
     They're not really 'magic', they just happen to work well.  */
    uint32_t seed = dict_hash_function_seed;
    const uint32_t m = 0x5bd1e995;
    const int r = 24;

    /* Initialize the hash to a 'random' value */
    uint32_t h = seed ^ len;

    /* Mix 4 bytes at a time into the hash */
    const unsigned char *data = (const unsigned char *)key;

    while(len >= 4) {
        uint32_t k = *(uint32_t*)data;

        k *= m;
        k ^= k >> r;
        k *= m;

        h *= m;
        h ^= k;

        data += 4;
        len -= 4;
    }

    /* Handle the last few bytes of the input array  */
    switch(len) {
    case 3: h ^= data[2] << 16;
    case 2: h ^= data[1] << 8;
    case 1: h ^= data[0]; h *= m;
    };

    /* Do a few final mixes of the hash to ensure the last few
     * bytes are well-incorporated. */
    h ^= h >> 13;
    h *= m;
    h ^= h >> 15;

    return (unsigned int)h;
}

哈希函数中大多混淆操作选用 ^ 的原因：

异或操作能混合数据的各个位的特征，异或操作拥有自反性

消除偏置：避免简单的加法与乘法带来的累计偏置

性能高效：CPU级别运算，位运算通常速度较快

Redis中的使用

在 redis 中，MurmurHash2 被用作默认哈希函数实现，主要用于：

计算哈希表的 key 索引
计算一致性哈希中虚拟节点的位置

选择 MurmurHash2 的原因：

快速：适合 redis 高性能的需求
均匀：减少哈希冲突
简单：实现简单

缺点

MurmurHash2 在某些情况下可能会出现哈希碰撞率略高的问题。

Hash functions can be vulnerable to collision attacks, where a user can choose input data in such a way so as to intentionally cause hash collisions. Jean-Philippe Aumasson and Daniel J. Bernstein were able to show that even implementations of MurmurHash using a randomized seed are vulnerable to so-called HashDoS attacks.[51] With the use of differential cryptanalysis, they were able to generate inputs that would lead to a hash collision. The authors of the attack recommend using their own SipHash instead.