Application of Hadoop counters and data cleaning

Data cleaning (ETL)

Before running the core business MapReduce program, it is often necessary to clean the data first to remove data that does not meet user requirements. The cleanup process often only requires running the Mapper program, not the Reduce program.

1. need

Remove the logs whose field length is less than or equal to 11.

(1) Input data

web.log

(2) Expected output data

The length of each line field is greater than 11

2. Demand Analysis

The input data needs to be filtered and cleaned according to the rules in the Map stage.

3. Implementation Code

(1) Write the LogMapper class

package com.atguigu.mapreduce.weblog;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class LogMapper extends Mapper<LongWritable, Text, Text, NullWritable>{
  Text k = new Text();
  @Override
  protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
   // 1 Get 1 line of data String line = value.toString();
   // 2 Parse log boolean result = parseLog(line,context);
   // 3 Log is illegal and exit if (!result) {
     return;
   }
   // 4 Set key
   k.set(line);
   // 5 Write data context.write(k, NullWritable.get());
  }
  // 2 Parse log private boolean parseLog(String line, Context context) {
   // 1 intercept String[] fields = line.split(" ");
   // 2 Logs with a length greater than 11 are legal if (fields.length > 11) {
     // System counter context.getCounter("map", "true").increment(1);
     return true;
   }else {
     context.getCounter("map", "false").increment(1);
     return false;
   }
  }
}

(2) Write the LogDriver class

package com.atguigu.mapreduce.weblog;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class LogDriver {
  public static void main(String[] args) throws Exception {
// The input and output paths need to be set according to the actual input and output paths on your computer args = new String[] { "e:/input/inputlog", "e:/output1" };
   // 1 Get job information Configuration conf = new Configuration();
   Job job = Job.getInstance(conf);
   // 2 Load the jar package job.setJarByClass(LogDriver.class);
   // 3 associated maps
   job.setMapperClass(LogMapper.class);
   // 4 Set the final output type job.setOutputKeyClass(Text.class);
   job.setOutputValueClass(NullWritable.class);
   // Set the number of reducetask to 0
   job.setNumReduceTasks(0);
   // 5 Set input and output paths FileInputFormat.setInputPaths(job, new Path(args[0]));
   FileOutputFormat.setOutputPath(job, new Path(args[1]));
   // 6 Submit job.waitForCompletion(true);
  }
}

Summarize

The above is the full content of this article. I hope that the content of this article will have certain reference learning value for your study or work. Thank you for your support of 123WORDPRESS.COM. If you want to learn more about this, please check out the following links

You may also be interested in:

Hadoop NameNode Federation
Explanation of the new feature of Hadoop 2.X, the recycle bin function
A practical tutorial on building a fully distributed Hadoop environment under Ubuntu 16.4
Hadoop 2.x vs 3.x 22-point comparison, Hadoop 3.x improvements over 2.x
How to build a Hadoop cluster environment with ubuntu docker
Detailed steps to build Hadoop in CentOS
Hadoop wordcount example code
Java/Web calls Hadoop for MapReduce sample code
Explanation of the working mechanism of namenode and secondarynamenode in Hadoop

<<: MySQL error: Deadlock found when trying to get lock; try restarting transaction solution

>>: Detailed explanation of formatting numbers in MySQL

Detailed explanation of how to view the current number of MySQL connections

Application of Hadoop counters and data cleaning

Detailed explanation of how to view the current number of MySQL connections

Example of adding multi-language function to Vue background management

Use Docker to build a Redis master-slave replication cluster

Why do we need Map when we already have Object in JavaScript?

Docker Consul Overview and Cluster Environment Construction Steps (Graphical Explanation)

Explain TypeScript enumeration types in detail

How to write memory-efficient applications with Node.js

Detailed explanation of installing redis in docker and starting it as a configuration file

In-depth study of MySQL composite index

In-depth analysis of the various backgrounds, usage scenarios and techniques of CSS

Recommend

Detailed explanation of Javascript Echarts air quality map effect

How to modify the time zone and time in Ubuntu system

Nginx monitoring issues under Linux

A brief analysis of the issues that should be paid attention to when making 404 error pages

Determine whether MySQL update will lock the table through examples

Analysis of three parameters of MySQL replication problem

Using trap to perform environment cleanup before graceful shutdown of docker container

Why MySQL chooses Repeatable Read as the default isolation level

MySQL Learning (VII): Detailed Explanation of the Implementation Principle of Innodb Storage Engine Index

Alibaba Cloud ECS cloud server (linux system) cannot connect remotely after installing MySQL (pitfall)

CenterOS7 installation and configuration environment jdk1.8 tutorial

How to use mixins in Vue

Solve the problem of inconsistent front and back end ports of Vue

How to use rem adaptation in Vue

Solution to the problem of Access denied for user'root'@'localhost' (using password: YES) in MySQL 8.0 login under win10