How to use Spark and Scala to analyze Apache access logs

Install

First you need to install Java and Scala, then download Spark and install it, make sure PATH and JAVA_HOME are set, and then you need to use Scala's SBT to build Spark as follows:

$ sbt/sbt assembly

The build time is relatively long. Once the build is complete, verify that the installation was successful by running:

$ ./bin/spark-shell

scala> val textFile = sc.textFile("README.md") // Create a reference to README.md scala> textFile.count // Count the number of lines in this file scala> textFile.first // Print the first line

Apache Access Log Analyzer

First we need to use Scala to write an analyzer for Apache access logs. Fortunately, someone has already written it. Download the Apache logfile parser code. Use SBT to compile and package:

sbt compile
sbt test
sbt package

The package name is assumed to be AlsApacheLogParser.jar.
Then start Spark on the Linux command line:

// this works
$ MASTER=local[4] SPARK_CLASSPATH=AlsApacheLogParser.jar ./bin/spark-shell

For Spark 0.9, some methods do not work:

// does not work
$ MASTER=local[4] ADD_JARS=AlsApacheLogParser.jar ./bin/spark-shell
// does not work
spark> :cp AlsApacheLogParser.jar

After the upload is successful, create an AccessLogParser instance in the Spark REPL:

import com.alvinalexander.accesslogparser._
val p = new AccessLogParser

Now you can read the Apache access log accesslog.small just like reading readme.cmd before:

scala> val log = sc.textFile("accesslog.small")
14/03/09 11:25:23 INFO MemoryStore: ensureFreeSpace(32856) called with curMem=0, maxMem=309225062
14/03/09 11:25:23 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 32.1 KB, free 294.9 MB)
log: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:15
scala> log.count
(a lot of output here)
res0: Long = 100000

Analyzing Apache logs

We can analyze how many 404s there are in the Apache log. The creation method is as follows:

def getStatusCode(line: Option[AccessLogRecord]) = {
 line match {
  case Some(l) => l.httpStatusCode
  case None => "0"
 }
}

Option[AccessLogRecord] is the return value of the analyzer.

Then use it in the Spark command line as follows:

log.filter(line => getStatusCode(p.parseRecord(line)) == "404").count

This statistic will return the number of rows where the httpStatusCode is 404.

Digging Deeper

Now if we want to know which URLs are problematic, such as a space in the URL that causes a 404 error, the following steps are obviously required:

Filter out all 404 records
Get the request field from each 404 record (whether the URL string requested by the analyzer has spaces, etc.)
Do not return duplicate records

Create the following method:

// get the `request` field from an access log record
def getRequest(rawAccessLogString: String): Option[String] = {
 val accessLogRecordOption = p.parseRecord(rawAccessLogString)
 accessLogRecordOption match {
  case Some(rec) => Some(rec.request)
  case None => None
 }
}

Paste this code into the Spark REPL and run the following code:

log.filter(line => getStatusCode(p.parseRecord(line)) == "404").map(getRequest(_)).count
val recs = log.filter(line => getStatusCode(p.parseRecord(line)) == "404").map(getRequest(_))
val distinctRecs = log.filter(line => getStatusCode(p.parseRecord(line)) == "404").map(getRequest(_)).distinct
distinctRecs.foreach(println)

Summarize

For simple analysis of access logs, grep is of course a better choice, but more complex queries require Spark. It is difficult to judge the performance of Spark on a single system. This is because Spark is designed for distributed systems with large files.

The above is the full content of this article. I hope it will be helpful for everyone’s study. I also hope that everyone will support 123WORDPRESS.COM.

You may also be interested in:

What are the new features of Apache Spark 2.4, which will be released in 2018?
Apache Spark 2.0 jobs take a long time to finish when they are finished

<<: How to configure SSL for koa2 service

>>: Mysql5.7.17 winx64.zip decompression version installation and configuration graphic tutorial

Analysis of mysql view functions and usage examples

Detailed explanation of the solution to the problem that the font in HTML cannot be vertically centered even with line-height

Blog

How to completely uninstall mysql under CentOS

How to use Spark and Scala to analyze Apache access logs

Analysis of mysql view functions and usage examples

Detailed explanation of InnoDB architecture and features (summary of InnoDB storage engine reading notes)

JavaScript design pattern learning proxy pattern

How to use the realip module in Nginx basic learning

Complete steps for Docker to pull images

Detailed explanation of the solution to the problem that the font in HTML cannot be vertically centered even with line-height

How to completely uninstall mysql under CentOS

Detailed explanation of how to run jmeter under Linux system and optimize local memory

A small question about the execution order of SQL in MySQL

Docker network mode and configuration method

Recommend

MySQL 8.0.19 supports locking an account after entering an incorrect password three times (example)

How to choose the right index in MySQL

Multiple solutions for cross-domain reasons in web development

Difference between querySelector and getElementById methods in JS

Docker installation method and detailed explanation of Docker's four network modes

mysql5.5.28 installation tutorial is super detailed!

An example of how to write a big sun weather icon in pure CSS

Summary of Node.js service Docker container application practice

The process of deploying and running countly-server in docker in win10

JavaScript method to detect the type of file

Differences between ES6 inheritance and ES5 inheritance in js

Detailed explanation of using backgroundImage to solve the image carousel switching

Tomcat source code analysis of Web requests and processing

Native js implementation of slider interval component

CSS XTHML writing standards and common problems summary (page optimization)