论文部分内容阅读
Nowadays,data provenance is widely use to increase the accuracy of the machine learning models and to accelerate the training model.However,these models face the difficulties in information heredity which make information satisfactorily,produce coherent information and data association.This problem could be solved by implementing the data provenance to overcome shortcomings and manage conglomeration issues.Most of the studies in the field of data provenance are focused on its implementation for specific cases.Furthermore,there is a little number of studies on a machine learning(ML)framework,where distinct emphasize on the accurate partition of coherent and physical activities plan to implement ML pipelines for provenance.This paper presents a novel approach to use data provenance which is also called data provenance for distributed machine learning systems for text analysis and linear regression(DPMLR).To develop the comprehensive approach based on a collective set of functions for various algorithms and provide ability to run large scale graph analysis,we apply Stellar Graph as our primary ML structure.It provides the aggregate arrangement of capacities for different calculations to examine large scales graphs.DPMLR is less time consuming when developing the large model sizes in comparison with other platforms on normal sized computing clusters.To accelerate processing time of training model,we adapted the Stellar Graph to read data streams and executes the trainer on Apache Spark clusters for fast processing focused on distributed based system.The preliminary results on the complex data stream structure showed that the overall overhead is no more than 20%.Moreover,the query responding time is in the period of time from 1 to 12 seconds,depending on the complexity and volume of data streams.Finally,it shows opportunities to design an integrated system which performs dynamic scheduling and network bounded synchronization based on ML algorithm.