본문 바로가기
Study/Study group

[Spark-Study] Day-2 예제 돌려보기

by 태하팍 2021. 6. 24.
반응형

p31
The Spark UI

저번시간 스파크 셋팅 이후
cd /usr/local/Cellar/apache-spark/3.1.1/bin에 가서
spark-shell을 실행 시켜준다.
spark-shell
아래와 같이 환영 해줌! ㅋㅋ 

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://아이피나옴:4040
Spark context available as 'sc' (master = local[*], app id = local-1624457045234).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.10)
Type in expressions to have them evaluated.
Type :help for more information.
scala>

위에서 Web UI를 사용할 수 있다고 나오는데 저것이 바로 Spark UI이다!

아래의 내용들을 알수 있다.

pyton을 사용하고 싶다면 pyspark를 실행하면 된다. 또한 spark ui를 제공한다. 

예제 참고 : https://github.com/databricks/LearningSparkV2

위의 예제 참고에서 README.md의 내용을 아래의 경로에 만들어준다.

Learning Spark 2nd Edition
Welcome to the GitHub repo for Learning Spark 2nd Edition.

Chapters 2, 3, 6, and 7 contain stand-alone Spark applications. You can build all the JAR files for each chapter by running the Python script: python build_jars.py. Or you can cd to the chapter directory and build jars as specified in each README. Also, include $SPARK_HOME/bin in $PATH so that you don't have to prefix SPARK_HOME/bin/spark-submit for these standalone applications.

For all the other chapters, we have provided notebooks in the notebooks folder. We have also included notebook equivalents for a few of the stand-alone Spark applications in the aforementioned chapters.

Have Fun, Cheers!

그런 뒤 아래와 같이 수행!

run-example JavaWordCount README.md
아래와 같이 카운트가 된다.

Have: 1
build: 2
stand-alone: 2
chapters,: 1
of: 1
chapters.: 1
prefix: 1
by: 1
chapter: 2
$PATH: 1
aforementioned: 1
JAR: 1
6,: 1
2,: 1
so: 1
Or: 1
and: 2
included: 1

pytion 예제 돌려보기
mnmcount.py

from __future__ import print_function

import sys

from pyspark.sql import SparkSession

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: mnmcount <file>", file=sys.stderr)
        sys.exit(-1)

    spark = (SparkSession
        .builder
        .appName("PythonMnMCount")
        .getOrCreate())
    # get the M&M data set file name
    mnm_file = sys.argv[1]
    # read the file into a Spark DataFrame
    mnm_df = (spark.read.format("csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load(mnm_file))
    mnm_df.show(n=5, truncate=False)

    # aggregate count of all colors and groupBy state and color
    # orderBy descending order
    count_mnm_df = (mnm_df.select("State", "Color", "Count")
                    .groupBy("State", "Color")
                    .sum("Count")
                    .orderBy("sum(Count)", ascending=False))

    # show all the resulting aggregation for all the dates and colors
    count_mnm_df.show(n=60, truncate=False)
    print("Total Rows = %d" % (count_mnm_df.count()))

    # find the aggregate count for California by filtering
    ca_count_mnm_df = (mnm_df.select("*")
                       .where(mnm_df.State == 'CA')
                       .groupBy("State", "Color")
                       .sum("Count")
                       .orderBy("sum(Count)", ascending=False))

    # show the resulting aggregation for California
    ca_count_mnm_df.show(n=10, truncate=False)
    spark.stop()

mnm_dataset.csv 
https://raw.githubusercontent.com/databricks/LearningSparkV2/master/chapter2/py/src/data/mnm_dataset.csv

명령어 수행
spark-submit mnmcount.py mnm_dataset.csv

21/06/24 10:49:35 INFO TaskSetManager: Finished task 199.0 in stage 9.0 (TID 606) in 12 ms on 172.27.114.231 (executor driver) (200/200)
21/06/24 10:49:35 INFO TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks have all completed, from pool
21/06/24 10:49:35 INFO DAGScheduler: ResultStage 9 (showString at NativeMethodAccessorImpl.java:0) finished in 0.561 s
21/06/24 10:49:35 INFO DAGScheduler: Job 5 is finished. Cancelling potential speculative or zombie tasks for this job
21/06/24 10:49:35 INFO TaskSchedulerImpl: Killing all running tasks in stage 9: Stage finished
21/06/24 10:49:35 INFO DAGScheduler: Job 5 finished: showString at NativeMethodAccessorImpl.java:0, took 0.772106 s
+-----+------+----------+
|State|Color |sum(Count)|
+-----+------+----------+
|CA   |Yellow|100956    |
|CA   |Brown |95762     |
|CA   |Green |93505     |
|CA   |Red   |91527     |
|CA   |Orange|90311     |
|CA   |Blue  |89123     |
+-----+------+----------+

이번에는 스칼라로 해보자!
그런데..!
파이썬하고는 다르게 빌드 시켜줘야함!
brew install sbt

/usr/local/Cellar/sbt/1.5.2

build.sbt

//name of the package
name := "main/scala/chapter2"
//version of our package
version := "1.0"
//version of Scala
scalaVersion := "2.12.10"
// spark library dependencies 
// change this to 3.0.0 when released
libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "3.0.0-preview2",
  "org.apache.spark" %% "spark-sql"  % "3.0.0-preview2"
)

/usr/local/Cellar/apache-spark/3.1.1/bin/target/scala-2.12/main-scala-chapter2_2.12-1.0.jar

spark-submit --class main.scala.chapter2.MnMcount \ /usr/local/Cellar/apache-spark/3.1.1/bin/target/scala-2.12/main-scala-chapter2_2.12-1.0.jar mnm_dataset.csv

오류 발생 - 패지키쪽 안맞는듯!
Error: Failed to load class main.scala.chapter2.MnMcount.

다음 스터디 - 다음주 목요일!
39p scala 버전 돌려보기
인털리J 에서 동작하도록 설정 해보기

 

반응형

'Study > Study group' 카테고리의 다른 글

[Spark-Study] Day-6 DataFrame Api  (0) 2021.08.19
EPI group  (0) 2021.07.27
[Spark-Study] Day-1 스파크 셋팅  (0) 2021.06.14
k8s 스터디 흔적  (0) 2020.10.22
2020.07.16 스터디  (0) 2020.07.16