IT Новости
0 просмотров
Рейтинг статьи
1 звезда2 звезды3 звезды4 звезды5 звезд

Apache spark java

Apache Spark Java Tutorial [Code Walkthrough With Examples]

By Matthew Rathbone on December 28 2015 Share Tweet Post

This article was co-authored by Elena Akhmatova

Hire me to supercharge your Hadoop and Spark projects

I help businesses improve their return on investment from big data projects. I do everything from software architecture to staff training. Learn More

This article is part of my guide to map reduce frameworks in which I implement a solution to a real-world problem in each of the most popular Hadoop frameworks.

Spark is itself a general-purpose framework for cluster computing. It can be run, and is often run, on the Hadoop YARN. Thus it is often associated with Hadoop and so I have included it in my guide to map reduce frameworks as well. Spark is designed to be fast for interactive queries and iterative algorithms that Hadoop MapReduce can be slow with.

The Problem

Let me quickly restate the problem from my original article.

I have two datasets:

  1. User information (id, email, language, location)
  2. Transaction information (transaction-id, product-id, user-id, purchase-amount, item-description)

Given these datasets, I want to find the number of unique locations in which each product has been sold. To do that, I need to join the two datasets together.

Previously I have implemented this solution in java, with hive and with pig. The java solution was

500 lines of code, hive and pig were like

The Java Spark Solution

This article is a follow up for my earlier article on Spark that shows a Scala Spark solution to the problem. Even though Scala is the native and more popular Spark language, many enterprise-level projects are written in Java and so it is supported by the Spark stack with it’s own API.

This article partially repeats what was written in my Scala overview, although I emphasize the differences between Scala and Java implementations of logically same code.

As it was mentioned before, Spark is an open source project that has been built and is maintained by a thriving and diverse community of developers. It started in 2009 as a research project in the UC Berkeley RAD Labs. Its aim was to compensate for some Hadoop shortcomings. Spark brings us as interactive queries, better performance for iterative algorithms, as well as support for in-memory storage and efficient fault recovery.

It contains a number of different components, such as Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX. It runs over a variety of cluster managers, including Hadoop YARN, Apache Mesos, and a simple cluster manager included in Spark itself called the Standalone Scheduler. It is used for a diversity of tasks from data exploration through to streaming machine learning algorithms. As a technology stack it has really caught fire in recent years.

Demonstration Data

The tables that will be used for demonstration are called users and transactions .

For this task we have used Spark on a Hadoop YARN cluster. Our code will read and write data from/to HDFS. Before starting work with the code we have to copy the input data to HDFS.

All code and data used in this post can be found in my hadoop examples GitHub repository.

This code does exactly the same thing that the corresponding code of the Scala solution does. The sequence of actions is exactly the same, as well as the input and output data on each step.

  1. read / transform transactions data
  2. read / transform users data
  3. left outer join of transactions on users
  4. get rid of user_id key from the result of the previous step by applying values()
  5. find distinct() values
  6. countByKey()
  7. transform result to an RDD
  8. save result to Hadoop

If this is confusing (it might be), read the Scala version first, it is way more compact.

As with Scala it is required to define a SparkContext first. Again, it is enough to set an app name and a location of a master node.

The resilient distributed dataset (RDD), Spark’s core abstraction for working with data, is named RDD as in Scala. As with any other Spark data-processing algorithm all our work is expressed as either creating new RDDs, transforming existing RDDs, or calling actions on RDDs to compute a result.

Spark’s Key/value RDDs are of JavaPairRDD type. Key/value RDDs are commonly used to perform aggregations, such as groupByKey(), and are useful for joins, such as leftOuterJoin(). Explicitly defining key and value elements allows spark to abstract away a lot of these complex operations (like joins), so they are very useful.

Читать еще:  Как восстановить языковую панель на панели задач

Here is how the input and intermediate data is transformed into a Key/value RDD in Java:

and a stand-alone function

Reading input data is done in exactly same manner as in Scala. Note that the explicit KEY_VALUE_PAIRER transformation is not needed in Scala, but in Java there seems to be no way to skip it.

Spark has added an Optional class for Java (similar to Scala’s Option ) to box values and avoid nulls. There is a special function isPresent() in the Optional class that allows to check whether the value is present, that is it is not null. Calling get() returns the boxed value.

The main code is again more or less a chain of pre-defined functions.

The processData() function from the Scala version was broken into three new functions joinData() , modifyData() and countData() . We simply did this to make the code more clear – Java is verbose. All the data transformation steps could have been put into one function that would be similar to processData() from the Scala solution.

The leftOuterJoin() function joins two RDDs on key.

The values() functions allows us to omit the key of the Key Value RDD as it is not needed in the operations that follow the join. The distinct() function selects distict Tuples.

And finally countByKey() counts the number of countries where the product was sold.

Running the resulting jar


The idea and the set up are exactly the same for Java and Scala.

The test is more or less self-explanatory. As usually we check the content of the output to validate it’s operation.


Java is a lot more verbose than Scala, although this is not a Spark-specific criticism.

The Scala and Java Spark APIs have a very similar set of functions. Looking beyond the heaviness of the Java code reveals calling methods in the same order and following the same logical thinking, albeit with more code.

All things considered, if I were using Spark, I’d use Scala. The functional aspects of Spark are designed to feel native to Scala developers, which means it feels a little alien when working in Java (eg Optional ). That said, if Java is the only option (or you really don’t want to learn Scala), Spark certainly presents a capable API to work with.

Spark Resources

The Spark official site and Spark GitHub have resources related to Spark.

Further Reading

Learning Spark: Lightning-Fast Big Data Analysis by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia.


In this tutorial, we will be demonstrating how to develop Java applications in Apache Spark using Eclipse IDE and Apache Maven. Since our main focus is on Apache Spark related application development, we will be assuming that you are already accustomed to these tools. On the same lines, it is assumed that you already have following tools installed on your machine —

  • JDK 8 since we will be using Lambda expressions
  • Eclipse IDE with Apache Maven Plugin

Setting Up Maven Java Project

Creating Maven Project:

We will be starting with creating a Maven project in Eclipse IDE by following below steps —

    Open New Project wizard in Eclipse IDE as shown below:

On next screen, select option Create a simple project to create quick project as below:

Enter Group Id and Artifiact Id on next screen and finally click on Finish to create the project as below:

At this point, you will start seeing your new project (in my case, it is spark-basics) in Project Explorer.

Adding Spark Dependency:

Next step is to add Apache Spark libraries to our newly created project. In order to do so, we will be adding following maven dependency to our project’s pom.xml file.

For completion purpose, here is what my pom.xml looks like after adding above dependency —

After adding this dependency, Eclipse will automatically start downloading the libraries from Maven repository. Please be patient as it may take a while for Eclipse to download the jars and build your project.

Developing Word Count Example

Now, we will create Java program for counting words in input list of sentences as below —

Main highlights of the program are that we create spark configuration, Java spark context and then use Java spark context to count the words in input list of sentences.

Читать еще:  Создать контрольную точку восстановления системы windows 7

Running Word Count Example

Finally, we will be executing our word count program. We can run our program in following two ways —

  1. Local mode: Since we are setting master as «local» in SparkConf object in our program, we can simply run this application from Eclipse like any other Java application. In other words, we can simply perform these operations on our program: Right Click -> Run As -> Java Application.
  2. Cluster mode: In this mode, we will need to remove setMaster(«local») in SparkConf object as master information will be provided at the time of submitting the code to cluster. We will first need to create JAR file (spark-basics.jar) of our code and use below command to submit it to spark running on YARN cluster —

Either way, here is the output that this program will generate —

Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.

Apache Spark with Java — Hands On!

    7 часов видео по запросу 5 статей 6 ресурсов для скачивания Полный пожизненный доступ Доступ через мобильные устройства и телевизор
    Сертификат об окончании
  • Utilize the most powerful big data batch and stream processing engine to solve big data problems
  • Master the new Spark Java Datasets API to slice and dice big data in an efficient manner
  • Build, deploy and run Spark jobs on the cloud and bench mark performance on various hardware configurations
  • Optimize spark clusters to work on big data efficiently and understand performance tuning
  • Transform structured and semi-structured data using Spark SQL, Dataframes and Datasets
  • Implement popular Machine Learning algorithms in Spark such as Linear Regression, Logistic Regression, and K-Means Clustering
  • Some basic Java programming experience is required. A crash course on Java 8 lambdas is included
  • You will need a personal computer with an internet connection.
  • The software needed for this course is completely freely and I’ll walk you through the steps on how to get it installed on your computer

Recently Updated!

Apache Spark is the next generation batch and stream processing engine. It’s been proven to be almost 100 times faster than Hadoop and much much easier to develop distributed big data applications with. It’s demand has sky rocketed in recent years and having this technology on your resume is truly a game changer. Over 3000 companies are using Spark in production right now and the list is growing very quickly! Some of the big names include: Oracle, Hortonworks, Cisco, Verizon, Visa, Microsoft, Amazon as well as most of the big world banks and financial institutions!

In this course you’ll learn everything you need to know about using Apache Spark in your organization while using their latest and greatest Java Datasets API. Below are some of the things you’ll learn:

How to develop Spark Java Applications using Spark SQL Dataframes

Understand how the Spark Standalone cluster works behind the scenes

How to use various transformations to slice and dice your data in Spark Java

How to marshall/unmarshall Java domain objects (pojos) while working with Spark Datasets

Master joins, filters, aggregations and ingest data of various sizes and file formats (txt, csv, Json etc.)

Analyze over 18 million real-world comments on Reddit to find the most trending words used

Develop programs using Spark Streaming for streaming stock market index files

Stream network sockets and messages queued on a Kafka cluster

Learn how to develop the most popular machine learning algorithms using Spark MLlib

Covers the most popular algorithms: Linear Regression, Logistic Regression and K-Means Clustering

You’ll be developing over 15 practical Spark Java applications crunching through real world data and slicing and dicing it in various ways using several data transformation techniques. This course is especially important for people who would like to be hired as a java developer or data engineer because Spark is a hugely sought after skill. We’ll even go over how to setup a live cluster and configure Spark Jobs to run on the cloud. You’ll also learn about the practical implications of performance tuning and scaling out a cluster to work with big data so you’ll definitely be learning a ton in this course. This course has a 30 day money back guarantee. You will have access to all of the code used in this course.

Зачем разработчикам нужен Apache Spark

Зачем используется Apache Spark? В статье рассмотрим, какую проблему обработки больших объёмов данных решает этот фреймворк.

Цифровая вселенная

Данные повсюду вокруг нас. В IDC оценили размер «цифровой вселенной» в 4,4 зеттабайта (1 триллион гигабайт) в 2013 году. По наблюдениям, цифровая вселенная ежегодно увеличивается на 40%. К 2020 году IDC ожидает, что размер достигнет 44 зеттабайтов – хватит по одному биту данных на каждую звезду физической вселенной.

У нас много данных, и мы не избавляемся от них. Поэтому нужен способ хранения увеличивающихся объёмов данных в масштабе, с защитой от потери в результате аппаратного сбоя. Наконец, нужно средство обработки этой информации с быстрым контуром обратной связи. Спасибо космосу за Hadoop и Spark.

Чтобы продемонстрировать полезность Spark, начнём с примера. 500 ГБ выборочных данных о погоде включают:

Страна | Город | Дата | Температура

Для расчёта максимальной температуры по стране для этих данных начинаем с нативной программы на Java, поскольку Java – наш второй любимый язык программирования:

Решение на Java

Однако при размере 500 ГБ даже такая простая задача займёт 5 часов с использованием этого нативного Java-метода.

«Java отстой, напишу это на Ruby и круто, если быстро, Ruby – мой любимый»

Решение на Ruby

Однако любимый Ruby больше не подходит для этой задачи. Ввод-вывод – не сильная сторона Ruby, поэтому Ruby потребуется ещё больше времени на поиск максимальной температуры, чем Java.

Задачу поиска максимальной температуры по городам лучше решить с помощью Apache MapReduce (держим пари, вы подумали, что скажем Spark). Конкретно здесь MapReduce блистает. С преобразованием городов в ключи, а температур – в значения, получаем результаты за гораздо меньшее время — 15 минут, по сравнению с предыдущими 5+ часами в Java.

MaxTemperatureMapper MaxTemperatureReducer

MapReduce – жизнеспособное решение поставленной задачи. Этот подход будет работать намного быстрее по сравнению с нативным решением Java, потому что MapReduce распределяет задачи между рабочими узлами в кластере. Строки из нашего файла передаются на каждый узел кластера параллельно, а в нативный Java-метод работает всего лишь поочередно.


Вычислить максимальную температуру для каждой страны – по сути новая задача, но далеко не новаторский анализ. Данные реального мира несут в себе более громоздкие схемы и сложный анализ. Это подталкивает к использованию специальных инструментов.

Что, если вместо максимальной температуры, нас попросят найти максимальную температуру по стране и городу, а потом разбить по дням? Что, если мы перепутали, и нужно найти страну с самыми высокими средними температурами? Или, если хотим найти среду обитания, где температура никогда не ниже 15°C и не выше 20°C (Антананариву, Мадагаскар, кажется привлекательной).

MapReduce справляется с пакетной обработкой данных. Однако отстаёт, когда дело доходит до повторного анализа и небольших циклов обратной связи. Единственный способ повторно использовать данные между вычислениями – записать их во внешнюю систему хранения (например, HDFS). MapReduce записывает содержимое всех Map для каждого задания – до Reduce-шага. Это означает, что каждое MapReduce-задание выполнит одну задачу, которая определена в его начале.

Для выполнения упомянутого выше анализа потребовалось бы три отдельных задания MapReduce:

  1. MaxTemperatureMapper , MaxTemperatureReducer , MaxTemperatureRunner
  2. MaxTemperatureByCityMapper , MaxTemperatureByCityReducer , MaxTemperatureByCityRunner
  3. MaxTemperatureByCityByDayMapper , MaxTemperatureByCityByDayReducer , MaxTemperatureByCityByDayRunner .

Очевидно, что это легко выйдет из-под контроля.

Обмен данными в MapReduce происходит медленно из-за самой природы распределённых файловых систем: репликации, сериализации и, главное, дискового ввода-вывода. Приложения MapReduce тратят до 90% времени на чтение и запись с диска.

Столкнувшись с такой проблемой, исследователи решили разработать специальный фреймворк, который мог бы выполнять то, что MapReduce не может: вычисления в оперативной памяти на кластере подключенных машин.

Решение: Apache Spark

Spark решает эту проблему. Spark обрабатывает несколько запросов быстро и с минимальными издержками.

Три указанных выше маппера реализуются в одном и том же задании Spark. И выдают несколько результатов, если нужно. Закомментированные строки легко использовать для установки правильного ключа в зависимости от требований конкретного задания.

Spark-реализация MaxTemperaMapper с использованием RDD

Spark ещё и выполняет обработку в 10 раз быстрее, чем MapReduce, при сопоставимых задачах, поскольку Spark работает только в оперативной памяти. Поэтому не используются операции записи и чтения с диска, как правило, медленные и дорогостоящие.

Apache Spark – удивительно мощный инструмент для анализа и преобразования данных. Если эта публикация пробудила в вас интерес, следите за обновлениями. Скоро будем постепенно углубляться в тонкости Apache Spark Framework.

Ссылка на основную публикацию