CMS 识别原理展开目录
CMS 英文全称是:Content Management System,中文名称是:网站内容管理系统。CMS 识别原理就是得到一些 CMS 的一些固有特征,通过得到这个特征来判断 CMS 的类别。
这里我们采用 MD5 识别和正则表达式识别的方式,具体来说就是用特定的文件路径访问网站,或者这个文件的 MD5 或者用正则表达式匹配某个关键词,如果匹配成功就说明是这个 CMS。所以识别的成功率与字典有关
指纹格式展开目录
这里截取一些 Web 指纹作为参考:
- {
- "url": "/install/",
- "re": "aspcms",
- "name": "AspCMS",
- "md5": ""
- },
- {
- "url": "/about/_notes/dwsync.xml",
- "re": "aspcms",
- "name": "AspCMS",
- "md5": ""
- },
- {
- "url": "/admin/_Style/_notes/dwsync.xml",
- "re": "aspcms",
- "name": "AspCMS",
- "md5": ""
- },
- {
- "url": "/apply/_notes/dwsync.xml",
- "re": "aspcms",
- "name": "AspCMS",
- "md5": ""
- },
- {
- "url": "/tpl/green/common/images/notebg.jpg",
- "re": "",
- "name": "自动发卡平台",
- "md5": "690f337298c331f217c0407cc11620e9"
- },
- {
- "url": "/images/download.png",
- "re": "",
- "name": "全程oa",
- "md5": "9921660baaf9e0b3b747266eb5af880f"
- },
- {
- "url": "/kindeditor/license.txt",
- "re": "",
- "name": "T-Site建站系统",
- "md5": "b0d181292c99cf9bb2ae9166dd3a0239"
- },
- {
- "url": "/public/ico/favicon.png",
- "re": "",
- "name": "悟空CRM",
- "md5": "834089ffa1cd3a27b920a335d7c067d7"
- },
- {
- "url": "/public/js/php/file_manager_json.php",
- "re": "",
- "name": "悟空CRM",
- "md5": "c64fd0278d72826eb9041773efa1f587"
- },
- {
- "url": "/plugins/weathermap/images/exclamation.png",
- "re": "",
- "name": "CactiEZ插件",
- "md5": "2e25cb083312b0eabfa378a89b07cd03"
- }
指纹文件展开目录
在 data
目录下存放 data.json
文件格式的 Web 指纹,总共有 1400 + 的国内常见指纹,[下载地址]()
代码编写展开目录
思路虽然简单,但实现起来还有很多问题,比如效率,1000 + 指纹说明需要访问 1000 + 的网页,单步的话速度太慢,所以需要使用多线程,等用多了也会发现多线程也太慢了,所以可以使用协程,以后再慢慢优化,这里就使用多线程就行了
新建文件 lib/core/webcms.py
,代码如下
- # __author__ = 'mathor'
- import json, os, sys, hashlib, threading, queue
- from lib.core import Download
-
- class webcms(object):
- workQueue = queue.Queue()
- URL = ""
- threadNum = 0
- NotFound = True
- Downloader = Download.Downloader()
- result = ""
-
- def __init__(self, url, threadNum = 10):
- self.URL = url
- self.threadNum = threadNum
- filename = os.path.join(sys.path[0], 'data', 'data.json')
- fp = open(filename, encoding = 'utf-8')
- webdata = json.load(fp, encoding = 'UTF-8')
- for i in webdata:
- self.workQueue.put(i)
- fp.close
-
- def getmd5(self, body):
- m2 = hashlib.md5()
- m2.update(body)
- return m2.hexdigest()
-
- def th_whatweb(self):
- if (self.workQueue.empty()):
- self.NotFound = False
- return False
-
- if (self.NotFound is False):
- return False
- cms = self.workQueue.get()
- _url = self.URL + cms['url']
- html = self.Downloader.get(_url)
- print("[whatweb log]:checking %s" % _url)
- if (html is None):
- return False
- if cms['re']:
- if (html.find(cms['re']) != -1):
- self.result = cms['name']
- self.NotFound = False
- return True
- else:
- md5 = self.getmd5(html)
- if (md5 == cms['md5']):
- self.result = cms['name']
- self.NotFound = False
- return True
-
- def run(self):
- while(self.NotFound):
- th = []
- for i in range(self.threadNum):
- t = threading.Thread(target = self.th_whatweb)
- t.start()
- th.append(t)
- for t in th:
- t.join()
- if (self.result):
- print("[webcms]:%s cms is %s" % (self.URL, self.result))
- else:
- print("[webcms]:%s cms NOTFound!" % self.URL)
调用展开目录
重写主文件 w8ay.py
- #-*- coding:utf-8 -*-
- '''
- Name: w8ayScan
- Author: mathor
- Copyright (c) 2019
- '''
- import sys
- from lib.core.Spider import SpiderMain
- from lib.core import webcms
- def main():
- root = "https://wmathor.com"
- threadNum = 1000
-
- # webcms
- ww = webcms.webcms(root, threadNum)
- ww.run()
-
- # spider
- w8 = SpiderMain(root, threadNum)
- w8.craw()
-
- if __name__ == "__main__":
- main()
好高端
[...]Via www.wmathor.com[...]
学习了
大佬可以给一份 data.json 文件格式的指纹库吗~文中好像没有插下载地址,目前失效了~谢谢!
这几天比较忙,下周二发你

感谢~不过我暂时不需要啦~自己从 github 拉了一份 db 数据库转成了 json