Python利用模糊哈希实现对比文件相似度

对比两个文件相似度,python中可通过difflib.SequenceMatcher/ssdeep/python_mmdt/tlsh实现,在大量需要对比,且文件较大时,需要更高的效率,可以考虑模糊哈希(fuzzy hash),如ssdeep/python_mmdt

测试过程发现:

  • difflib方法,读取文件后,可以实现匹配度输出
  • ssdeep/mmdt/tlsh方法可以实现,实现提前模糊哈希值,验证时,只读取一次,完成对比,从而优化对比时间,及内存/cpu消耗
  • tlsh测试时,值越小,相似度越高,在对比小文件时,很不理想
  • 在对比小文件时,三种方法相差不大,在对比大文件(案例中81MB),difflib方法慢的难以接受
  • 在实际环境中,建议使用mmdt方法,因为ssdeep在二进制对比中差别较大,失去参考价值,具体还有哪些文件类型存在此问题有待考量,

测试环境:

OS:ubuntu20.04

python:3.8.10

py-tlsh==4.7.2

python-mmdt==0.3.1

ssdeep==3.4

# -*- coding: utf-8 -*-

import ssdeep
import time
from python_mmdt.mmdt.mmdt import MMDT
from difflib import SequenceMatcher

def difflib_test(file1,file2):
  start_time = time.time()
  with open(file1,'rb') as f:
      s1 = f.read()
  with open(file2,'rb') as f:
      s2 = f.read()
  match_obj =  SequenceMatcher(None,s1,s2)
  print("difflib match:",match_obj.ratio())
  end_time = time.time()
  print('difflib_test cost :',end_time-start_time)

def mmdt_test(file1,file2):
  start_time = time.time()
  mmdt=MMDT()
  r1 = mmdt.mmdt_hash(file1)
  print(r1)
  r2 = mmdt.mmdt_hash_streaming(file2)
  print(r2)
  # sim1 = mmdt.mmdt_compare(file1, file2)
  # print("mmdt match:",sim1)
  sim2 = mmdt.mmdt_compare_hash(r1, r2)
  print("mmdt match:",sim2)
  end_time = time.time()
  print('mmdt_test cost :',end_time-start_time)

def ssdeep_test(file1,file2):
  start_time = time.time()
  sig1=ssdeep.hash_from_file(file1)
  sig2=ssdeep.hash_from_file(file2)
  print(sig1)
  print(sig2)
  print("ssdeep match:",ssdeep.compare(sig1,sig2))
  end_time = time.time()
  print('ssdeep_test cost :',end_time-start_time)

if __name__ == '__main__':
  start_time = time.time()
  file1='/root/test/fstab'
  file2='/root/test/fstab2'
  # file1 = '/root/test/initrd.img-5.4.0-125-generic'
  # file2 = '/root/test/initrd.img-5.4.0-135-generic'
  mmdt_test(file1,file2)    
  ssdeep_test(file1,file2)
  difflib_test(file1,file2)
  end_time = time.time()
  print('总执行时间:',end_time-start_time)

下面给出对比小文件/大文件效果:

测试tlsh

import tlsh
import time

def tlsh_test(file1,file2):
  start_time = time.time()
  with open(file1,'rb') as f:
      s1 = tlsh.hash(f.read())
  with open(file2,'rb') as f:
      s2 = tlsh.hash(f.read())
  match_obj =  tlsh.diff(s1,s2)
  print("tlsh match:",match_obj)
  end_time = time.time()
  print('difflib_test cost :',end_time-start_time)


if __name__ == '__main__':
  start_time = time.time()
  # file1='/root/test/fstab'
  # file2='/root/test/fstab2'
  file1 = '/root/test/initrd.img-5.4.0-125-generic'
  file2 = '/root/test/initrd.img-5.4.0-135-generic'
  tlsh_test(file1,file2)
  end_time = time.time()
  print('总执行时间:',end_time-start_time)

对比小文件/大文件

关于Python利用模糊哈希实现对比文件相似度的文章就介绍至此,更多相关Python对比文件相似度内容请搜索编程宝库以前的文章,希望以后支持编程宝库

 tkinter禁用(只读)下拉列表Comboboxtkinter将下拉列表框Combobox控件的状态设置为只读,也就是不可编辑状态:# 定义下拉列表值self.Combo ...