rex/ 0000755 0001762 0000144 00000000000 13172201330 011040 5 ustar ligges users rex/inst/ 0000755 0001762 0000144 00000000000 13172175577 012044 5 ustar ligges users rex/inst/doc/ 0000755 0001762 0000144 00000000000 13172175577 012611 5 ustar ligges users rex/inst/doc/log_parsing.Rmd 0000644 0001762 0000144 00000004311 12670366101 015543 0 ustar ligges users ---
title: "Server Log Parsing"
author: "Jim Hester"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Server Log Parsing}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
Parsing server log files is a common task in server administration.
[1](http://link.springer.com/article/10.1007/BF03325089),[2](http://stackoverflow.com/search?q=%22Apache+log%22)
Historically R would not be well suited to this and it would be better
performed using a scripting language such as perl. Rex, however, makes this
easy to do and allows you to perform both the data cleaning and analysis in R!
Common server logs consist of space separated fields.
> 198.214.42.14 - - [21/Jul/1995:14:31:46 -0400] "GET /images/ HTTP/1.0" 200 17688
> lahal.ksc.nasa.gov - - [24/Jul/1995:12:42:40 -0400] "GET /images/USA-logosmall.gif HTTP/1.0" 200 234
The logs used in this vignette come from two months of all HTTP requests
to the NASA Kennedy Space Center WWW server in Florida and are freely available
for use. [3](http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html)
```{r include = FALSE}
library(rex)
library(dplyr)
library(knitr)
library(ggplot2)
```
```{r show.warnings=FALSE}
parsed <- scan("NASA.txt", what = "character", sep = "\n") %>%
re_matches(
rex(
# Get the time of the request
"[",
capture(name = "time",
except_any_of("]")
),
"]",
space, double_quote, "GET", space,
# Get the filetype of the request if requesting a file
maybe(
non_spaces, ".",
capture(name = "filetype",
except_some_of(space, ".", "?", double_quote)
)
)
)
) %>%
mutate(filetype = tolower(filetype),
time = as.POSIXct(time, format="%d/%b/%Y:%H:%M:%S %z"))
```
This gives us a nicely formatted data frame of the time and filetypes of the requests.
```{r echo = FALSE}
kable(head(parsed, n = 10))
```
We can also easily generate a histogram of the filetypes, or a plot of requests over time.
```{r FALSE, fig.show='hold', warning = FALSE, message = FALSE}
ggplot(na.omit(parsed)) + stat_count(aes(x=filetype))
ggplot(na.omit(parsed)) + geom_histogram(aes(x=time)) + ggtitle("Requests over time")
```
rex/inst/doc/log_parsing.R 0000644 0001762 0000144 00000002141 13172175576 015235 0 ustar ligges users ## ----include = FALSE-----------------------------------------------------
library(rex)
library(dplyr)
library(knitr)
library(ggplot2)
## ----show.warnings=FALSE-------------------------------------------------
parsed <- scan("NASA.txt", what = "character", sep = "\n") %>%
re_matches(
rex(
# Get the time of the request
"[",
capture(name = "time",
except_any_of("]")
),
"]",
space, double_quote, "GET", space,
# Get the filetype of the request if requesting a file
maybe(
non_spaces, ".",
capture(name = "filetype",
except_some_of(space, ".", "?", double_quote)
)
)
)
) %>%
mutate(filetype = tolower(filetype),
time = as.POSIXct(time, format="%d/%b/%Y:%H:%M:%S %z"))
## ----echo = FALSE--------------------------------------------------------
kable(head(parsed, n = 10))
## ----FALSE, fig.show='hold', warning = FALSE, message = FALSE------------
ggplot(na.omit(parsed)) + stat_count(aes(x=filetype))
ggplot(na.omit(parsed)) + geom_histogram(aes(x=time)) + ggtitle("Requests over time")
rex/inst/doc/url_parsing.R 0000644 0001762 0000144 00000005761 13172175577 015272 0 ustar ligges users ## ----url_parsing_stock, eval=F-------------------------------------------
# "^(?:(?:http(?:s)?|ftp)://)(?:\\S+(?::(?:\\S)*)?@)?(?:(?:[a-z0-9\u00a1-\uffff](?:-)*)*(?:[a-z0-9\u00a1-\uffff])+)(?:\\.(?:[a-z0-9\u00a1-\uffff](?:-)*)*(?:[a-z0-9\u00a1-\uffff])+)*(?:\\.(?:[a-z0-9\u00a1-\uffff]){2,})(?::(?:\\d){2,5})?(?:/(?:\\S)*)?$"
## ----url_parsing_url-----------------------------------------------------
library(rex)
valid_chars <- rex(except_some_of(".", "/", " ", "-"))
re <- rex(
start,
# protocol identifier (optional) + //
group(list("http", maybe("s")) %or% "ftp", "://"),
# user:pass authentication (optional)
maybe(non_spaces,
maybe(":", zero_or_more(non_space)),
"@"),
#host name
group(zero_or_more(valid_chars, zero_or_more("-")), one_or_more(valid_chars)),
#domain name
zero_or_more(".", zero_or_more(valid_chars, zero_or_more("-")), one_or_more(valid_chars)),
#TLD identifier
group(".", valid_chars %>% at_least(2)),
# server port number (optional)
maybe(":", digit %>% between(2, 5)),
# resource path (optional)
maybe("/", non_space %>% zero_or_more()),
end
)
## ----url_parsing_validate------------------------------------------------
good <- c("http://foo.com/blah_blah",
"http://foo.com/blah_blah/",
"http://foo.com/blah_blah_(wikipedia)",
"http://foo.com/blah_blah_(wikipedia)_(again)",
"http://www.example.com/wpstyle/?p=364",
"https://www.example.com/foo/?bar=baz&inga=42&quux",
"http://✪df.ws/123",
"http://userid:password@example.com:8080",
"http://userid:password@example.com:8080/",
"http://userid@example.com",
"http://userid@example.com/",
"http://userid@example.com:8080",
"http://userid@example.com:8080/",
"http://userid:password@example.com",
"http://userid:password@example.com/",
"http://➡.ws/䨹",
"http://⌘.ws",
"http://⌘.ws/",
"http://foo.com/blah_(wikipedia)#cite-1",
"http://foo.com/blah_(wikipedia)_blah#cite-1",
"http://foo.com/unicode_(✪)_in_parens",
"http://foo.com/(something)?after=parens",
"http://☺.damowmow.com/",
"http://code.google.com/events/#&product=browser",
"http://j.mp",
"ftp://foo.bar/baz",
"http://foo.bar/?q=Test%20URL-encoded%20stuff",
"http://مثال.إختبار",
"http://例子.测试",
"http://-.~_!$&'()*+,;=:%40:80%2f::::::@example.com",
"http://1337.net",
"http://a.b-c.de",
"http://223.255.255.254")
bad <- c(
"http://",
"http://.",
"http://..",
"http://../",
"http://?",
"http://??",
"http://??/",
"http://#",
"http://##",
"http://##/",
"http://foo.bar?q=Spaces should be encoded",
"//",
"//a",
"///a",
"///",
"http:///a",
"foo.com",
"rdar://1234",
"h://test",
"http:// shouldfail.com",
":// should fail",
"http://foo.bar/foo(bar)baz quux",
"ftps://foo.bar/",
"http://-error-.invalid/",
"http://-a.b.co",
"http://a.b-.co",
"http://0.0.0.0",
"http://3628126748",
"http://.www.foo.bar/",
"http://www.foo.bar./",
"http://.www.foo.bar./")
all(grepl(re, good) == TRUE)
all(grepl(re, bad) == FALSE)
rex/inst/doc/log_parsing.html 0000644 0001762 0000144 00000131766 13172175576 016020 0 ustar ligges users
Server Log Parsing
Server Log Parsing
Jim Hester
2017-10-19
Parsing server log files is a common task in server administration. 1,2 Historically R would not be well suited to this and it would be better performed using a scripting language such as perl. Rex, however, makes this easy to do and allows you to perform both the data cleaning and analysis in R!
Common server logs consist of space separated fields.
The logs used in this vignette come from two months of all HTTP requests to the NASA Kennedy Space Center WWW server in Florida and are freely available for use. 3
parsed <-scan("NASA.txt", what ="character", sep ="\n") %>%re_matches(
rex(
# Get the time of the request"[",
capture(name ="time",
except_any_of("]")
),
"]",
space, double_quote, "GET", space,
# Get the filetype of the request if requesting a filemaybe(
non_spaces, ".",
capture(name ="filetype",
except_some_of(space, ".", "?", double_quote)
)
)
)
) %>%mutate(filetype =tolower(filetype),
time =as.POSIXct(time, format="%d/%b/%Y:%H:%M:%S %z"))
This gives us a nicely formatted data frame of the time and filetypes of the requests.
time
filetype
1995-07-21 14:31:46
1995-07-24 12:42:40
gif
1995-07-02 02:30:34
gif
1995-07-05 13:51:39
1995-07-10 23:11:49
gif
1995-07-15 11:27:49
mpg
1995-07-13 11:02:50
xbm
1995-07-23 09:11:06
1995-07-14 10:38:04
gif
1995-07-25 09:33:01
gif
We can also easily generate a histogram of the filetypes, or a plot of requests over time.
ggplot(na.omit(parsed)) +stat_count(aes(x=filetype))
ggplot(na.omit(parsed)) +geom_histogram(aes(x=time)) +ggtitle("Requests over time")
Creating a correct regular expression is hard! (only 1 out of 13 regexs were valid for all cases).
Because of this one may be tempted to simply copy the best regex you can find (gist).
The problem with this is that while you can copy it now, what happens later when you find a case that is not handled correctly? Can you correctly interpret and modify this?