大数据系统方面的经典论文

推荐人：@JerryLead

说明：下面偏向选取已经在工业界广泛使用的系统论文，还有很多优秀论文没有在列表中。

分布式数据并行处理框架与编程模型

[1] Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.

[2] Isard M, Budiu M, Yu Y, et al. Dryad: distributed data-parallel programs from sequential building blocks[C]//ACM SIGOPS Operating Systems Review. ACM, 2007, 41(3): 59-72.

[3] Yu Y, Isard M, Fetterly D, et al. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language[C]//OSDI. 2008, 8: 1-14.

[4] Chambers C, Raniwala A, Perry F, et al. FlumeJava: easy, efficient data-parallel pipelines[C]//ACM Sigplan Notices. ACM, 2010, 45(6): 363-375.

[5] Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing[C]//Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012

[6] Akidau T, Bradshaw R, Chambers C, et al. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing[J]. Proceedings of the VLDB Endowment, 2015, 8(12): 1792-1803.

[7] Saha B, Shah H, Seth S, et al. Apache tez: A unifying framework for modeling and building data processing applications[C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015: 1357-1369.

[8] Carbone P, Katsifodimos A, Ewen S, et al. Apache Flink™: Stream and Batch Processing in a Single Engine[J]. Data Engineering, 2015: 28.

大数据SQL查询

[1] Pike R, Dorward S, Griesemer R, et al. Interpreting the data: Parallel analysis with Sawzall[J]. Scientific Programming, 2005, 13(4): 277-298.

[2] Olston C, Reed B, Srivastava U, et al. Pig latin: a not-so-foreign language for data processing[C]//Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008: 1099-1110.

[3] Thusoo A, Sarma J S, Jain N, et al. Hive: a warehousing solution over a map-reduce framework[J]. Proceedings of the VLDB Endowment, 2009, 2(2): 1626-1629.

[4] Xin R S, Rosen J, Zaharia M, et al. Shark: SQL and rich analytics at scale[C]//Proceedings of the 2013 ACM SIGMOD International Conference on Management of data. ACM, 2013: 13-24.

[5] Armbrust M, Xin R S, Lian C, et al. Spark sql: Relational data processing in spark[C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015: 1383-1394.

[6] Chattopadhyay B, Lin L, Liu W, et al. Tenzing a sql implementation on the mapreduce framework[J]. 2011.

Big Graph处理

[1] Malewicz G, Austern M H, Bik A J C, et al. Pregel: a system for large-scale graph processing[C]//Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010: 135-146.

[2] Low Y, Bickson D, Gonzalez J, et al. Distributed GraphLab: a framework for machine learning and data mining in the cloud[J]. Proceedings of the VLDB Endowment, 2012, 5(8): 716-727.

[3] Low Y, Gonzalez J E, Kyrola A, et al. Graphlab: A new framework for parallel machine learning[J]. arXiv preprint arXiv:1408.2041, 2014.

[4] Gonzalez J E, Low Y, Gu H, et al. Powergraph: Distributed graph-parallel computation on natural graphs[C]//Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). 2012: 17-30.

[5] Kyrola A, Blelloch G, Guestrin C. GraphChi: large-scale graph computation on just a PC[C]//Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). 2012: 31-46.

[6] Gonzalez J E, Xin R S, Dave A, et al. Graphx: Graph processing in a distributed dataflow framework[C]//11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 2014: 599-613.

分布式机器学习

[1] Dean J, Corrado G, Monga R, et al. Large scale distributed deep networks[C]//Advances in neural information processing systems. 2012: 1223-1231.

[2] Li M, Andersen D G, Park J W, et al. Scaling distributed machine learning with the parameter server[C]//11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 2014: 583-598.

[3] Dai W, Wei J, Kim J K, et al. Petuum: A framework for iterative-convergent distributed ML[J]. 2013.

[4] Abadi M, Agarwal A, Barham P, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems[J]. arXiv preprint arXiv:1603.04467, 2016.

[5] Chen T, Li M, Li Y, et al. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems[J]. arXiv preprint arXiv:1512.01274, 2015.

[6] Meng X, Bradley J, Yuvaz B, et al. Mllib: Machine learning in apache spark[J]. JMLR, 2016, 17(34): 1-7.

[7] Cui H, Cipar J, Ho Q, et al. Exploiting bounded staleness to speed up big data analytics[C]//2014 USENIX Annual Technical Conference (USENIX ATC 14). 2014: 37-48.

Streaming处理

[1] Zaharia M, Das T, Li H, et al. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters[C]//Presented as part of the. 2012.

[2] Akidau T, Balikov A, Bekiroğlu K, et al. MillWheel: fault-tolerant stream processing at internet scale[J]. Proceedings of the VLDB Endowment, 2013, 6(11): 1033-1044.

[3] Qian Z, He Y, Su C, et al. Timestream: Reliable stream computation in the cloud[C]//Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 2013: 1-14.

欢迎大家继续关注慧邮件邮件营销平台，也可以在我们的慧邮件官网了解更多邮件营销技巧，大数据知识，也可以通过电话：400-666-5494联系到我们，更多精彩知识、活动等着你。

首页>> 正文

热门资讯

大数据系统方面的经典论文

相关阅读