Dataflow is a distributed stream-based batch processing engine for big data applications.
You can create tasks on your client that will be executed on Dataflow servers with datasets. The task compiles into an
execution graph that will be executed on partitions.
In this example we will posts a simple Map-Reduce task to a cluster of Dataflow nodes.
To run the examples, you need to clone ActiveJ from GitHub:
$ git clone https://github.com/activej/activej And import it as a Maven project. Check out branch master. Before running the examples, build the project.
These examples are located at activej -> examples -> core -> dataflow.
1. Create a Dataflow Server Launcher
First, we need to launch two Dataflow servers. Each of them will have its own “items” dataset that contains 10K random
words that can overlap.
To create a Dataflow server launcher we’ll use pre-defined DataflowServerLauncher
class that extends Launcher class.
Let’s now create a Dataflow client launcher. We’ll make it by analogy with the server launcher. So we override
getOverrideModule and provide the same required dependencies:
Now let’s create a task for our Dataflow partitions. We’ll define it in the overridden Launcher’s main method run:
This code does the following:
Maps strings by creating ('word', 1) pairs
Sorts the pairs in alphabetic order by the String value
Reduces the pairs by merging similar word pairs. For example, (apple, 1) and (apple, 1) will be reduced into (apple, 2)
Distributes the pairs according to a provided rule. For example, if partition 1 contains (apple, 2), (dog, 2) and
partition 2 contains (apple, 3), (dog, 1) the result of repartitioning might be (apple, 2), (apple, 3) on partition 1 and
(dog, 2), (dog, 1) on partition 2.
Repeats step 3. As a result of these steps, we receive a sorted and reduced dataset with unique items across all the partitions
With the help of collector, client pulls streams from all the nodes and merges them into a single stream
The stream is collected into a list that we can work with. In the example we simply print out the first 100 words
Finally, we create main method that launches the client.
First, launch two dataflow servers. You need to specify nodes’ addresses and source files as program arguments.
Server Launcher 1 arguments: 9000 words1.txt. Server Launcher 2 arguments: 9001 words2.txt.
Next, launch dataflow client to post the created task to the servers. You’ll need to specify the ports of the partitions
as program arguments: 9000 9001.
After everything is launched and the task is executed, in the console you will see the list of the first 100 words
collected from the two dataflow servers.
All the data flows between partitions are represented in sysout and can be transformed into the following graph: