Ace-T's Blog 내 검색 [네이버 커넥트 이웃 합니다~^-^/ 요청 大 환영~~]

spark rdd programining

BigDATA/spark 2018.12.30 20:08
[Good Comment!!, Good Discussion!!, Good Contens!!]
[ If you think that is useful, please click the finger on the bottom~^-^good~ ]
by ace-T


https://spark.apache.org/docs/latest/rdd-programming-guide.html

spark rdd

Overview

At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. 

The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.

A second abstraction in Spark is shared variables that can be used in parallel operations. 
By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.

This guide shows each of these features in each of Spark’s supported languages. It is easiest to follow along with if you launch Spark’s interactive shell – either bin/spark-shell for the Scala shell or bin/pyspark for the Python one.


  1. flatMap[U](f: (T) ⇒ TraversableOnce[U])(implicit arg0: ClassTag[U])RDD[U]

    Permalink

    Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.

map[U](f: (T) ⇒ U)(implicit arg0: ClassTag[U])RDD[U]

Permalink

Return a new RDD by applying a function to all elements of this RDD.







'BigDATA > spark' 카테고리의 다른 글

HDFS부터 DB까지 팁 아닌 팁~  (0) 2019.01.15
sbt lib 연동 안되는 현상  (0) 2019.01.04
spark rdd programining  (0) 2018.12.30
spark-submit deploy-mode option  (0) 2016.11.02
2탄. SPARK를 설치해보자~(클러스터)  (0) 2016.10.19
1탄. SPARK를 설치해보자~  (0) 2016.10.18

acet 박태하가 추천하는 readtrend 추천글!

설정

트랙백

댓글

:::: facebook을 이용하시는 분들은 로그인 후 아래에 코멘트를 남겨주세요 ::::

spark와 친해지기!

BigDATA/spark 2016.03.22 18:35
[Good Comment!!, Good Discussion!!, Good Contens!!]
[ If you think that is useful, please click the finger on the bottom~^-^good~ ]
by ace-T


아마 아래와 같은 형태가 될것 같다.

  • sparkContext 클래스는 스파크클러스터의 연결과 스파크와 연동할 수 있는 엔트리 포인트를 제공.
    인스턴스를 생성하여 다양한 일을할 수 있다.

  • spark RDD : RDD(resilient distributed dataset)를 활용하면 데이터의 병렬처리를 쉽게할 수 있다.

spark 참고 사이트!!







'BigDATA > spark' 카테고리의 다른 글

spark log4j 사용해보기!  (0) 2016.07.04
spark logback 설정?  (0) 2016.06.29
spark-submit 옵션 관련  (0) 2016.05.16
ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread  (0) 2016.05.11
spark와 친해지기!  (0) 2016.03.22
spark + scala + sbt 프로젝트!!  (0) 2016.03.22

acet 박태하가 추천하는 readtrend 추천글!

설정

트랙백

댓글

:::: facebook을 이용하시는 분들은 로그인 후 아래에 코멘트를 남겨주세요 ::::