推荐人:@JerryLead
说明:下面偏向选取已经在工业界广泛使用的系统论文,还有很多优秀论文没有在列表中。
分布式数据并行处理框架与编程模型
[1] Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.
[2] Isard M, Budiu M, Yu Y, et al. Dryad: distributed data-parallel programs from sequential building blocks[C]//ACM SIGOPS Operating Systems Review. ACM, 2007, 41(3): 59-72.
[3] Yu Y, Isard M, Fetterly D, et al. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language[C]//OSDI. 2008, 8: 1-14.
[4] Chambers C, Raniwala A, Perry F, et al. FlumeJava: easy, efficient data-parallel pipelines[C]//ACM Sigplan Notices. ACM, 2010, 45(6): 363-375.
[5] Zaharia M, Chowdhury M, Das T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing[C]//Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012
[6] Akidau T, Bradshaw R, Chambers C, et al. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing[J]. Proceedings of the VLDB Endowment, 2015, 8(12): 1792-1803.
[7] Saha B, Shah H, Seth S, et al. Apache tez: A unifying framework for modeling and building data processing applications[C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015: 1357-1369.
[8] Carbone P, Katsifodimos A, Ewen S, et al. Apache Flink™: Stream and Batch Processing in a Single Engine[J]. Data Engineering, 2015: 28.
大数据SQL查询
[1] Pike R, Dorward S, Griesemer R, et al. Interpreting the data: Parallel analysis with Sawzall[J]. Scientific Programming, 2005, 13(4): 277-298.
[2] Olston C, Reed B, Srivastava U, et al. Pig latin: a not-so-foreign language for data processing[C]//Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008: 1099-1110.
[3] Thusoo A, Sarma J S, Jain N, et al. Hive: a warehousing solution over a map-reduce framework[J]. Proceedings of the VLDB Endowment, 2009, 2(2): 1626-1629.
[4] Xin R S, Rosen J, Zaharia M, et al. Shark: SQL and rich analytics at scale[C]//Proceedings of the 2013 ACM SIGMOD International Conference on Management of data. ACM, 2013: 13-24.
[5] Armbrust M, Xin R S, Lian C, et al. Spark sql: Relational data processing in spark[C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015: 1383-1394.
[6] Chattopadhyay B, Lin L, Liu W, et al. Tenzing a sql implementation on the mapreduce framework[J]. 2011.
Big Graph处理
[1] Malewicz G, Austern M H, Bik A J C, et al. Pregel: a system for large-scale graph processing[C]//Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010: 135-146.
[2] Low Y, Bickson D, Gonzalez J, et al. Distributed GraphLab: a framework for machine learning and data mining in the cloud[J]. Proceedings of the VLDB Endowment, 2012, 5(8): 716-727.
[3] Low Y, Gonzalez J E, Kyrola A, et al. Graphlab: A new framework for parallel machine learning[J]. arXiv preprint arXiv:1408.2041, 2014.
[4] Gonzalez J E, Low Y, Gu H, et al. Powergraph: Distributed graph-parallel computation on natural graphs[C]//Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). 2012: 17-30.
[5] Kyrola A, Blelloch G, Guestrin C. GraphChi: large-scale graph computation on just a PC[C]//Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12). 2012: 31-46.
[6] Gonzalez J E, Xin R S, Dave A, et al. Graphx: Graph processing in a distributed dataflow framework[C]//11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 2014: 599-613.
分布式机器学习
[1] Dean J, Corrado G, Monga R, et al. Large scale distributed deep networks[C]//Advances in neural information processing systems. 2012: 1223-1231.
[2] Li M, Andersen D G, Park J W, et al. Scaling distributed machine learning with the parameter server[C]//11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). 2014: 583-598.
[3] Dai W, Wei J, Kim J K, et al. Petuum: A framework for iterative-convergent distributed ML[J]. 2013.
[4] Abadi M, Agarwal A, Barham P, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems[J]. arXiv preprint arXiv:1603.04467, 2016.
[5] Chen T, Li M, Li Y, et al. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems[J]. arXiv preprint arXiv:1512.01274, 2015.
[6] Meng X, Bradley J, Yuvaz B, et al. Mllib: Machine learning in apache spark[J]. JMLR, 2016, 17(34): 1-7.
[7] Cui H, Cipar J, Ho Q, et al. Exploiting bounded staleness to speed up big data analytics[C]//2014 USENIX Annual Technical Conference (USENIX ATC 14). 2014: 37-48.
Streaming处理
[1] Zaharia M, Das T, Li H, et al. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters[C]//Presented as part of the. 2012.
[2] Akidau T, Balikov A, Bekiroğlu K, et al. MillWheel: fault-tolerant stream processing at internet scale[J]. Proceedings of the VLDB Endowment, 2013, 6(11): 1033-1044.
[3] Qian Z, He Y, Su C, et al. Timestream: Reliable stream computation in the cloud[C]//Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 2013: 1-14.
欢迎大家继续关注慧邮件邮件营销平台,也可以在我们的慧邮件官网了解更多邮件营销技巧,大数据知识,也可以通过电话:400-666-5494联系到我们,更多精彩知识、活动等着你。