样本标签错误误差分析.docx
*注,前文分析,在无错误样本时40骑训练量(280训练集)预测的平均准确率为97.72%简单分析可得,出现错误样本大体可分为三种情况:1.AB样本错误标记成NB;而NB标记正确即提前归类的NB集中有一部分AB;而AB集中全为AB*注,在此,错误样本数二总样本数*错误比率;而错误样本全部放置在NB集中即若总样本Io0,AB70,NB30,则错误比率为10%时,AB中错0,NB中错10个错误比率2.5%时:0,984,0,978,0,986,0.98,0.953,0.97,0,985,0.979,0.938,0,973平均准确率97.28%错误比率5%时:0,975,0,949,0,956,0.961,0,955,0,933,0.96,0.954,0,979,0.983平均准确率96.07%错误比率7.5%时:0.96,0.933,0,951,0.94,0.962,0.94,0,967,0,966,0.936,0,962平均准确率95.17%错误比率10%时:0.949,0.944,0,922,0,937,0,953,0.959,0.937,0,963,0.93,0,957,0.939,0.939,0,938,0.946,0,958,0.927,0.91平均准确率94.16%错误比率15驰时:0.914,0,914,0,904,0.913,0,898,0,911,0.9,0.886,0,923,0,876平均准确率90.39%准确率出现较大幅度下降;错误比率20%时:0.848,0.87,0,821,0,862,0.903,0,879,0.867,0,868,0,892,0,879平均准确率86.89%准确率跌出90%2. NB样本错误标记成AB;而AB标记正确即提前归类的AB集中有一部分NB;而NB集中全为NB*注,在此,错误样本数二总样本数*错误比率;而错误样本全部放置在AB中即若总样本Io0,AB70,NB30,则错误比率为10%时,AB中错10个,NB中错。个错误比率2.5%时:0.98,0.973,0,966,0,985,0.947,0,961,0.945,0.975,0,941,0.968平均准确率96.42%错误比率3.5%时:0,955,0,897,0,947,0.909,0,918,0,944,0.955,0,944,0.978,0.941平均准确率93.87%错误比率5%时:0.885,0,945,0,907,0.874,0,921,0,855,0.877,0,916,0.917,0.877平均准确率89.73%错误比率7.5%时:0.876,0.88,0,842,0,885,0.837,0,891,0.885,0.832,0,878,0,865平均准确率86.71%出现较大的识别误差,预测准确率不足90%3. AB、NB集中均存在错误标记*注,在此,错误标记的个数按照样本AB,NB占总样本的比率划分;即若总样本IO0,AB70,NB30,则错误比率为10%时,AB中错7个,NB中错3个错误比率2.5%时:0.985,0,986,0,981,0.979,0,966,0,962,0,977,0,956,0.974,0.965平均准确率97.32%影响幅度不大,继续增大错误比率错误比率5%时:0.952,0,962,0,965,0.947,0,966,0,961,0.944,0,973,0.978,0.94平均准确率95.88%错误比率7.5%时:0.938,0,985,0,955,0.945,0,965,0,954,0.937,0,945,0.932,0.915平均准确率94.70%错误比率10%时:0.907,0,939,0,913,0.916,0,939,0,921,0.859,0.93,0.923平均准确率91.62%错误比率15%时0.871,0.82,0,911,0,886,0.811,0,845,0.9,0,886,0.911,0,878平均准确率87.18%不难发现,模型对样本集的容错程度是不一样的,标记为AB的样本集中存在10%以内的NB样本,模型依旧能够保持95%以上的准确性,几乎不受影响,而当NB集中存在AB样本时,模型对错误AB样本的比率极为敏感,5%的AB样本错误划分为NB会使得模型准确度低于90%当两方均存在一定错误样本时,整体而言,错误样本在10%以内,模型准确度可以保持90%以上。