论文部分内容阅读
提出了一种基于偏最小二乘判别分析和F-score的特征筛选方法,并将其用于蛋白质组学质谱数据分析。方法主要包含3个步骤:(1)用LIMPIC算法对原始数据进行预处理;(2)计算每个变量的F-score值并将所有变量按F-score值降底的顺序排列;(3)采用偏最小二乘判别分析交互检验按前向选择法选择最佳变量子集。用本方法对一组卵巢癌数据进行分析,最终从原始的15154个质荷比变量中选择了12个特征变量作为潜在生物标记物,它们在训练集上交叉检验的特异性和灵敏度分别为98.36%和98.15%,在独立测试集上的特异性和灵敏度分别为96.67%和100%。用筛选出的变量作PCA所得的结果显示这些变量能够较好地将样本分类,说明能够反映出样本的类别信息。所提出的方法可用于蛋白质组学质谱数据的特征筛选及样本分类。
A method of feature selection based on partial least-squares discriminant analysis and F-score is proposed and applied to proteomics mass spectrometry data analysis. The method mainly includes three steps: (1) Preprocess the original data by LIMPIC algorithm; (2) Calculate the F-score value of each variable and arrange all the variables according to the F-score value; (3) Partial Least Squares Discriminant Analysis Interactive Test Select the best subset of variables by forward selection. Using this method to analyze a group of ovarian cancer data, 12 characteristic variables were finally selected as potential biomarkers from the original 15154 mass-to-charge ratio variables. The specificity and sensitivity of cross-validation in the training set were 98.36 % And 98.15%, respectively. The specificity and sensitivity on the independent test set were 96.67% and 100% respectively. The results obtained using the selected variables as PCA show that these variables can better classify the samples and show that they can reflect the sample type information. The proposed method can be used for characterization and sample classification of proteomics mass spectrometry data.