Mining Patterns and Trends in Data Stream

来源 :上海交通大学 | 被引量 : 0次 | 上传用户:ynjdxyzzz
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
Living in the information age,today’s people mostly spend their time with gadgets to do(almost)everything,like dealing with the online transactions,updating social media status,commenting one’s status or company’s products,sharing pictures and videos,applying jobs via web based job search engines,etc.These phenomena make data hence also information are continuously being produced,streamed and received in a big volume and high velocity by(almost)every digital devices and censors over the entire globe.Knowledge discovery from data,therefore,becomes an important task for data stakeholders including mining out the patterns and their trend from data.The challenge increases when the task advances to effectively and immediately find the patterns’ changing trend over dynamic incremental streaming dat.One kind of data that is also changing and streaming every day is the job adverts data,especially the information and communication technology(ICT)related job adverts.Set of skills(skillset)required by the ICT industries change very rapidly and massively along with the fast growth of research and development in this sector,and it has attracted many researches to process the job adverts data using some data mining methods in order to discover some information and then to deliver it to the ICT skillset stakeholders,such as higher education institutions(HEIs)’s management,students,and professionals.For a HEI,finding the skillsets being required in the industry becomes an important task that should be performed periodically.The skillsets are used to improve the curricula.Literature studies conclude that it is difficult for HEI’s curricula management to counterbalance the changing skillset requirement in ICT industries.This situation creates the gap between skillsets that should be taught to the students and skillsets that are required by the industries.Several numbers of methods both using straightforward and nonstraightforward approaches have been proposed to solve this problem.The straightforward one which is manual approach is not an effective solution,since it is costly and only based by small sample,e.g.by inviting industries’ management to HEI’s environment and asking them directly the skillsets they required.In the non-straightforward methods,data mining approaches can be performed where the job adverts becomes the main source of the information about skillsets being required by the industries.We see some opened opportunities to solve some problems that still are found in the state-of-art methods that work on job adverts.In particular,we use frequent patterns(FP)and emerging patterns(EP)mining concept as the solution for all identified problems since the patterns represent the information about combinations of skills that are dominating the job adverts and their requirement is emerging in the industries today.The first problem is the clustering method used to cluster the job titles and the required skillset.Agglomerative and k-means methods are used where the support vector model(SVM)is applied as their basis.As based on the vector of skill’s term in which each vector elements are numbers,the clustering results cannot directly provide the description which is important for clustering analysis.Alternatively,FP based clustering can be used as the solution.Frequent termsets(FT)are used as the clustering candidate as well as the cluster description.FT is mined out from the dataset using a user-given minimum(minsupp)support threshold.A termset is said frequent termset if its support exceeds the minsupp.However,traditional FP clustering only uses the most frequent but short termset as the description,which makes the description itself is meaningless.On the other side,mining long termsets are tough because the number of FT in the collection will explode.Our solution for this problem is by proposing a n alternative concept about mining the FT,called as the frequent contextual termsets(FCT).Some long FTs can be gotten with an acceptable number of FT.Two algorithms for clustering the job skillset data are proposed.The second problem still comes from the method used to cluster the job title done by current researches.The skillsets required for the job are developed only based on the result of vector distance,and the skills included in the calculation are those with frequency at least 10% in a whole dataset;But there is no further inspection whether the combination of skills also reaches at least 10% frequency,or not.This contradicts the concept of FP,because although two skills are frequent with 10% individually,but combination of them is not always frequent.Additionally,skillset resulted from the clustering cannot answer which skills combination that dominates the job adverts.The method is also performed in static way where the job adverts are collected first for 1 – 2 years(or more),before they are processed using the proposed method.Consequently,the research’s results about the statistics of job titles and skillsets may be out of date when they are delivered to academia or professionals,considering the rapid change on skillset requirement in the industries.Another problem,the skillset required for a job is gotten,but the magnitude of the gap between student’s skillset and the industrial-required skillset is not quantitatively measured yet;which makes the academia cannot immediately know the skills that should be taught to the students.Solutions for these problems are as follows.FP mining algorithm is used to generate frequent skillsets(FS)which is combination of skills that are dominating the job skillset data.The domination here is determined using the support which is an interestingness measure in FP mining.Using the support,skills are associated because their co-occurrence in the dataset is frequent.FS is mined out periodically,along with the new job adverts downloaded and added to the job skillset dataset periodically as well.This solves the problem of the static processing of the traditional method.Consequently,the FS is always up to date,and it can be used by the academia as the information about skillsets that are required today in the industries.To measure the magnitude,we propose a new measurement,the student’s skillset coverage which intersects the student’s skillset and the FS.The coverage is actually the magnitude of the gap between these skillsets.The coverage is then visualized onto a new proposed tool,called as the skillset-student matrix.Since FS collection is updated periodically,and students’ skillsets are also changing every year due to more courses they taken,the map of gap on the visualization tool changes as well.Knowing the gap,HEI’s management can take some immediate actions to anticipate the widening of the gap in the future.While FS represents the popular skillsets in the job adverts today,the skillsets that are going to be popular in near future is not known yet;and this question becomes the third problem to be solved in this dissertation.Concept of emerging patterns – EP – can be used as the basis solution;a skillset is called emerging skillset(ES)if it is frequent and its support growth between previous and current time exceeds the minimum growth threshold.However,since we do not know when a skillset will be emerging,so a number of skillsets and their support found in all time-windows should be maintained for a long term.Time-windows refer to as a block of data(job skillset records)processed at a time stamp.Our experiment focused on the development of a new time-windows model i.e.Fibonacci windows model(Fwin,for short)as the solution to store the skillsets and supports for a long period,efficiently.However,although it outperforms the traditional model,the Logarithmic tilted-time windows model(LWin,for short),in the term of time and memory efficiency,but not for the number of EP found by the proposed model.This finding motivated us to improve the titled-time windows model(TTWM)as the basis of both FWin and LWin.TTWM is proposed to save the memory used to store the support data,by tilting or folding the old windows in such a way so the support found in the most recent windows should be the most accurate,and those found in the old windows can be less accurate.Technically,supports found in n windows are condensed inside m(m ≤ n)elements of array,using a particular element’s updating mechanism.While the old supports are merged,the recent supports are kept as its original value.However,the TTWM’s updating mechanism creates many null elements at front,thus some ES cannot be found at several time stamps and this is the main problem attempted to be solved.As the solution we propose a novel Push-front mechanism to TTWM,so not only it avoids the null elements creation,but also provides some supports with the most accurate value at element’s front.Push-front approach is applied to Fibonacci windows model and thus develop the new Push-front Fibonacci windows model.Experimental works show that the new model outperforms both LWin and FWin in the number of ES found from the streaming data.
其他文献
胃癌(gastric cancer,GC)是最常见的恶性消化系统肿瘤之一,在世界范围内,每年正以将近100万新发病例的速度增长,在全球恶性肿瘤的发病率中位于第四位,据世界卫生组织(WTO)统计,每年大约有70多万人死于胃癌或者胃癌相关疾病,居常见肿瘤死因的第二位。根据地理位置划分,胃癌主要发生于亚洲、非洲以及南美洲等发展中国家。以亚洲国家为例,中日韩三个国家,每年的新发胃癌病例数大约占全球胃癌总发
经济增长理论认为,资本积累、劳动投入和全要素生产率提升是经济发展的核心动力。改革开放以来,中国经济依赖于传统生产要素的成本优势和政策红利实现了增长奇迹,而随着人口红利消失和资本边际报酬下降,中国经济增速开始放缓,传统要素投入驱动的经济发展方式已难以续力,向以全要素生产率为源泉的增长模式转变是实现经济高质量发展的关键。当前,新一轮工业技术革命正逐步兴起,西方工业化国家试图通过工业智能化推动技术变革和
自古以来,物质微观结构一直都是人类感兴趣并探寻的对象。从古希腊哲学家德谟克利特(Democritus)猜想的原子不可分论到1803年英国物理学家约翰·道尔顿(John Dalton)依据“倍比定律”提出的近代原子论,再到1897年英国物理学家约瑟夫·约翰·汤姆逊(Joseph John Thomson)通过测量阴极射线的荷质比发现电子以及1911年英国物理学家欧内斯特·卢瑟福(Ernest Rut
蜥蜴类是爬行动物的一个主要类群,在整个脊椎动物的进化中有承前启后的重要作用,对其进化的研究,尤其是分子和细胞遗传学进化的研究,对了解爬行类乃至于整个脊椎动物的进化史具有重要的理论意义。本文用RAPD分子标记技术和核型分析技术对中国蜥蜴类中有代表性的种类进行了DNA和染色体进化的研究,共分五部分: 1.从蛋白质水平、DNA水平对爬行动物分子系统学的研究及进展情况进行了综述。 2.对在中国
研究目的:肝脏具有强大的再生功能,一旦肝脏受到损伤或肝实质减少,现有的肝细胞通过再生来维持肝脏的稳态和肝功能,但是在慢性肝损伤或亚大片肝细胞丢失的情况下,肝脏再生功能受到抑制甚至被破坏,此时肝祖细胞(hepatic progenitor cell,HPC)活化并增殖和分化,以维持肝脏的稳态。Y-box结合蛋白1(YB-1)在人肝纤维化肝组织中高表达,但其与HPC介导的肝纤维化和肝脏再生之间的关系尚
有机场效应晶体管(organic field-effect transistor,OFET)可通过低温溶液法/印刷工艺加工,具有良好的机械柔性,适合耐热性差的塑料衬底,能够构建真柔性电子。进一步的,OFET易于功能化,可用于传感检测应用。因此,OFET可与硅基FET(Si-FET)形成良好的互补,两者的混合集成提供了一个理想的技术平台用于构建大面积/柔性/泛在化的传感电子系统。在该有机/硅的混合电
火灾是地铁运营安全的主要威胁之一,合理有效的火灾烟气通风控制方法对于保障乘客生命安全具有重要意义。随着我国地铁工程逐渐向网络化运营趋势发展,作为线路连接节点的隧道联络区域数量不断增加,火灾烟气在同一线路的上下行隧道之间和临近线路之间的扩散将大幅度提高火灾危险性的影响范围,提高烟气控制、应急疏散和救援的难度。而相关标准规范和研究中尚未形成针对性的防排烟技术措施,单线区间隧道现有的通风排烟系统难以适用
越来越多的实验研究和临床研究已证实,各种肾脏疾病进展为终末期肾衰的速度并非取决于肾小球损伤及肾小球硬化的程度,而主要与肾小管间质损伤及肾间质纤维化(Renal Interstitial Fibrosis,RIF)的程度密切相关。肾小管间质损伤时,肾小管上皮细胞(Tubular Epithelial Cell,TEC)并非仅仅是被动的受害者,更是积极的参与者。尽管目前对TEC在RIF中的作用有了一定
众所周知,工程中的很多结构系统包含各种各样的非线性因素,研究此类结构系统的辨识技术可以为其动力学特性分析和优化设计提供可靠依据。在很多情形下,结构系统中的非线性因素多出现于结构的连接和边界等位置,使得系统中非线性具有很强的局部性特征。据此,有关学者提出了局部非线性系统的概念,假定局部非线性位置稀疏、类型任意,并将系统解耦为基本线性系统和局部非线性两部分,研究针对此类非线性系统的特定辨识方法,得到具
哺乳动物基因的表达模式同其分子结构特征之间的关联以及这些结构特征的进化规律是生物学中基本问题,但这些问题仍有待深入研究。发育进程被认为对于探索进化机制至关重要,而哺乳动物的干细胞分化系统,包括干细胞,祖细胞和其分化产物,也提供了一个描述哺乳动物发育进程的理想模型。因此,本文以小鼠干细胞分化模型中与发育阶段相关的基因表达模式为基础,试图建立哺乳动物基因的表达模式与其分子结构特征之间的关联。在不同的细