介绍 MurmurHash2 算法的特点,优势,缺点,应用场景,基本实现,MurmurHash2 是一种哈希算法,高效,非加密。

特点

  1. 非加密
    • 不适用于密码学。
    • 设计目标不为 高碰撞性与安全性。
  2. 高性能:
    • 使用位运算,混合函数实现,计算速度快。
  3. 分布均匀:
    • 体现输入中各个位的影响,hash 值分布均匀。

算法思想

  1. 分块处理:
    • 把输入按照每 32/64 位分为一组处理。
  2. 混合操作:
    • 对每个分块进行混合计算,包括与常数相乘,移位异或,与当前结果混合。
  3. 最终混合:
    • 将剩下的不足 4/8 字节的数组混合进结果,然后在对结果做混合,确保所有输入位都对结果有影响。

伪代码:

以 32 位版本的 MurmurHash2 为例,伪代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
uint m(const void *key,int len){
uint seed = gen();
const uint m;
const int r;
uint h = len ^ seed;

const uchar *data = (const uchar *)key;
while(len>=4){
uint k = *(uint *)data;

// mix m r h
/*
....
*/

data += 4;
len -= 4;
}
// rest bytes
/*
swich(len)
*/

// final mix m h
/*
...
*/
return h;
}

redis 中的实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/* MurmurHash2, by Austin Appleby
* Note - This code makes a few assumptions about how your machine behaves -
* 1. We can read a 4-byte value from any address without crashing
* 2. sizeof(int) == 4
*
* And it has a few limitations -
*
* 1. It will not work incrementally. liyazzi: 也就是key需要一次性完整知道,不适合动态数据流,因为每一次的混合计算都依赖于前面的数据,然后最后的混合依赖于整个数据
* 2. It will not produce the same results on little-endian and big-endian 影响跨平台性
* machines.
*/
unsigned int dictGenHashFunction(const void *key, int len) {
/* 'm' and 'r' are mixing constants generated offline.
They're not really 'magic', they just happen to work well. */
uint32_t seed = dict_hash_function_seed;
const uint32_t m = 0x5bd1e995;
const int r = 24;

/* Initialize the hash to a 'random' value */
uint32_t h = seed ^ len;

/* Mix 4 bytes at a time into the hash */
const unsigned char *data = (const unsigned char *)key;

while(len >= 4) {
uint32_t k = *(uint32_t*)data;

k *= m;
k ^= k >> r;
k *= m;

h *= m;
h ^= k;

data += 4;
len -= 4;
}

/* Handle the last few bytes of the input array */
switch(len) {
case 3: h ^= data[2] << 16;
case 2: h ^= data[1] << 8;
case 1: h ^= data[0]; h *= m;
};

/* Do a few final mixes of the hash to ensure the last few
* bytes are well-incorporated. */
h ^= h >> 13;
h *= m;
h ^= h >> 15;

return (unsigned int)h;
}

哈希函数中大多混淆操作选用 ^ 的原因:

  • 异或操作能混合数据的各个位的特征,异或操作拥有自反性
  • 消除偏置:避免简单的加法与乘法带来的累计偏置
  • 性能高效:CPU级别运算,位运算通常速度较快

Redis中的使用

在 redis 中,MurmurHash2 被用作默认哈希函数实现,主要用于:

  1. 计算哈希表的 key 索引
  2. 计算一致性哈希中虚拟节点的位置

选择 MurmurHash2 的原因:

  • 快速:适合 redis 高性能的需求
  • 均匀:减少哈希冲突
  • 简单:实现简单

缺点

MurmurHash2 在某些情况下可能会出现哈希碰撞率略高的问题。

Hash functions can be vulnerable to collision attacks, where a user can choose input data in such a way so as to intentionally cause hash collisions. Jean-Philippe Aumasson and Daniel J. Bernstein were able to show that even implementations of MurmurHash using a randomized seed are vulnerable to so-called HashDoS attacks.[51] With the use of differential cryptanalysis, they were able to generate inputs that would lead to a hash collision. The authors of the attack recommend using their own SipHash instead.