the alternative larry moore. 5 nodes and variant input file sizes hadoop alternative

The Alternative

Larry Moore

5 Nodes and Variant Input File Sizes

•Hadoop

•Alternative

25 MB and Variant Quantity of Nodes

•Hadoop

•Alternative

400 MB and Variant Quantity of Nodes

•Hadoop

•Alternative

Fibonacci(25) For A Fixed Size File And Variant Nodes

•Hadoop

•Alternative

Hadoop reads line by line from the input stream, then processes it and then repeats without overlapping time. This is good for network traffic when you have 1000s of nodes or low network bandwidth, but it degrades performance otherwise.

Alternative: Work is done piece-wise, using many files. Multiple files are made by reading from the input stream of the socket and writing to new temporary files. As soon as enough has been received or the end of the stream is reached, a new Task is created to process the file.

As tasks are created, they are enqueued to be processed in parallel or serial depending on the settings and hardware in use.

Once a piece is finished, it is enqueued to be sent bask to the master without waiting for the later tasks to finish, but not until all previous tasks are complete.

Worker Nodes

Piece-wise Work

Hadoop: ???

Alternative: The workers will preset and prestart everything that can to be done to quickly act upon work. This may waste some ram, but it’s worth the speed.

Worker Nodes

Jump Start

Hadoop makes the user source fully responsible for utilizing the hardware. It is not safe to assume that the user source can make efficient use of it however (ex. The work is too small for the cost). Continuously making threads for a small amount of work will cost more. If Hadoop truly is doing the work in serial, then it may be difficult for the user to ever fully utilize the hardware as the startup costs may offset this.

Alternative: (Optional) Forced parallelizationIf this feature is enabled, the files being received will always be split into pieces and up to N file pieces will be process at a time, where N is either the overridden amount to use or the number of VM processors. This technique will ensure that the hardware will be used thoroughly regardless of the age of the user source.

However, this is a bad thing only if the user source is already being done in parallel because the threads will slow each other down.

Worker Nodes

Parallel Work

Hadoop: ???

Alternative: Resources are reused after “paying” for them the first time.

Ex. By using a thread pool, the threads can remain alive and be reused.

Worker Nodes

Reuse and recycle

Alternative: A failed attempt to connect will result in infinite retries until a connection is established. If a port is already in use for the server socket, it will also wait until available to bind itself.

Before running all the workers, there will be a check to see if the server is alive and to see if it is responding promptly. If the server fails to respond or if it responds after an undesirable delay, that worker will be excluded from the list of reachable hosts and therefore not used.

Using the reachable host list, the master will AGAIN verify that the servers are alive and responding in an ideal amount of time. Workers can be reused after starting them, therefore we must check to make sure each time that they are still usable.

Worker and Master Nodes

Fault tolerance

Alternative: The framework allows for tracking of the fragments to be able to discern when and where a problem occurred. In such case, additional fault tolerance can easily be added to ensure that the problem is handled.

When the nodes being used have heterogeneous hardware, it may be necessary to use the framework's ability of making progress checks. At any point from any class, the user can add additional code that gives the status, the details of assigned work, the name of the server, the ports being used or a combination.


Fault tolerance

■Hadoop uses a single port and causes the sending progress to be more stagnant than necessary since connection to that port will block all other connections until ready.

■Hadoop's master node must continuously send and receive information from the workers. It is not yet clear why, but it may be to keep order in addition to aid its fault tolerance. Even when mapping has finished, very large bursts of activity can be seen although no further transmissions are necessary. Due to these transmissions, network performance is very noticeably degraded.

Alternative: To increase the overall transfer speed and to decrease network latency, each worker is assigned it's own port on the master with which to respond. This allows for parallel transfer from multiple nodes at once without blocking. Flow control can be done using the framework.


Network

Alternative: Everything that is customizable is easily accessible and more settings are available. Since more features are available, everything from the amount of RAM to use to the number of cores to use, can be controlled from the properties file.

This implementation uses property files to store the settings for increased speed whereas Hadoop uses XML formatted files and should be slightly slower. This is a very minor speed inmpact.


Settings

Alternative: All classes are ran as service threads using a thread pool. Each class has their own task and all of them are designed to work together asynchronously.

Both master and worker nodes can be configured automatically or manually to make efficient use of the hardware features, which is not available in Hadoop.

By compiling Java to native machine code via GCJ, some methods can receive a large speed increase. If the JVM neglects to make use of or inadequately uses certain CPU instructions (possibly new instructions), it may be better to use GCJ. This option should only be considered on closed networks if and when security is an issue.


Hardware usage

Alternative: To help keep it on par with Hadoop, a scheduler can be used in any part of the code for uses like flow control.

Master Node

Other features

the alternative larry moore. 5 nodes and variant input file sizes hadoop alternative

Documents