Fetch all archived URLs from wayback machine with Golang
While penetration testing a Web Application the very first thing is to gather as much information as we can, and this process of gathering information is known as reconnaissance (in short Recon). This practice helps us to go from a single root domain, we have to gather thousands of interesting endpoints. The more advanced/deep you go for recon the better chance of finding bugs (lower hanging bugs).
The Internet wayback machine stores billions of web pages from different websites and provides access to them with no restriction. For hackers, it is really a good place to start looking for bugs on archived URLs and endpoints.
So, let's build a tool in Golang to take domains as input and return all archived URLs from the wayback machine as output.
How to fetch all archived URLs from the wayback machine with Go language?
Let's start by importing required packages from Golang's standard library. All of these are built-in packages.
bufio
(buffered I/O) - bufio wraps an io.Reader
or io.Writer
object, and creates another object (Reader or Writer) in the buffer. Buffering is a technique which improves the performance of I/O operations.
fmt
(format) - This package is used to format input and output. Here we will use this package to print strings (with/without errors) to the console.
io
- This package provides input-output primitives as well as interfaces to them. We will use it to copy HTTP response body to standard output (stdout).
net
(network) - This package provides a portable interface for network I/O.
http
- This is included in the net
package. It provides HTTP client and server implementations. You can make HTTP requests to the server with this package.
os
(operating system) - This package provides the ability to access native operating-system features in a platform-agnostic manner. The design is Unix-like, although the error handling is Go-like.
There are basically three packages involved in I/O operations in Golang: fmt
, os
, and bufio
.
package main
import (
"bufio"
"fmt"
"io"
"net/http"
"os"
)
Declare a constant outside the function in Golang
Before jumping to the main()
function. Let's declare a constant archiveURL
, which stores a formatted string containing the Wayback cdx server.
Interesting HTTP parameters used here:
url=*.%s
/
- To include subdomains also.
output=text
(or json
) - To get output as plain text.
// archive.org
const archiveURL = "http://web.archive.org/cdx/search/cdx?url=*.%s/*&output=text&fl=original&collapse=urlkey"
Working with standard input (stdin
) in Golang
Now, In the main()
function, we need to get domains from stdin and move forward with those.
- Print error if standard input is empty (
os.Stdin.stat()
) and return exit status 1
as STDERR.
-
func main() {
// Check if input is provided
stat, _ := os.Stdin.Stat()
// Exits the crawler if no urls are provided as STDIN.
if (stat.Mode() & os.ModeCharDevice) != 0 {
fmt.Fprintln(os.Stderr, "No urls detected. Hint: cat domains.txt | crawler")
os.Exit(1)
}
}
- Store stdin in the buffer. Use
bufio.NewScanner()
function.
-
func main() {
// Check if input is provided
stat, _ := os.Stdin.Stat()
// Exits the crawler if no urls are provided as STDIN.
if (stat.Mode() & os.ModeCharDevice) != 0 {
fmt.Fprintln(os.Stderr, "No urls detected. Hint: cat domains.txt | crawler")
os.Exit(1)
}
// Reads input from the input stream.
readIn := bufio.NewScanner(os.Stdin)
}
HTTP request and response with Golang
To process each domain for further actions, we can use a for loop which iterates through all domains provided in the buffer through STDIN.
func main() {
// Check if input is provided
stat, _ := os.Stdin.Stat()
// Exits the crawler if no urls are provided as STDIN.
if (stat.Mode() & os.ModeCharDevice) != 0 {
fmt.Fprintln(os.Stderr, "No urls detected. Hint: cat domains.txt | crawler")
os.Exit(1)
}
// Reads input from the input stream.
readIn := bufio.NewScanner(os.Stdin)
for readIn.Scan() {
// Reads hostname from stdin.
hostname := readIn.Text()
}
}
Fetch archived URLs from wayback cdx server with Golang
To retrieve all endpoints for each domain, we will use the http.Get()
function.
archiveURL
is a formatted string constant.
func main() {
// Check if input is provided
stat, _ := os.Stdin.Stat()
// Exits the crawler if no urls are provided as STDIN.
if (stat.Mode() & os.ModeCharDevice) != 0 {
fmt.Fprintln(os.Stderr, "No urls detected. Hint: cat domains.txt | crawler")
os.Exit(1)
}
// Reads input from the input stream.
readIn := bufio.NewScanner(os.Stdin)
for readIn.Scan() {
// Reads hostname from stdin.
hostname := readIn.Text()
// Get all archived URLs.
resp, err := http.Get(fmt.Sprintf(archiveURL, hostname))
if err != nil {
panic(err)
}
}
}
Golang script to fetch all archived URLs and return all endpoints on STDOUT
Finally, use the io
package to copy every HTTP response body (in plain text format) to STDOUT.
package main
import (
"bufio"
"fmt"
"io"
"net/http"
"os"
)
// archive.org
const archiveURL = "http://web.archive.org/cdx/search/cdx?url=*.%s/*&output=text&fl=original&collapse=urlkey"
func main() {
// Check if input is provided
stat, _ := os.Stdin.Stat()
// Exits the crawler if no urls are provided as STDIN.
if (stat.Mode() & os.ModeCharDevice) != 0 {
fmt.Fprintln(os.Stderr, "No urls detected. Hint: cat domains.txt | crawler")
os.Exit(1)
}
// Reads input from the input stream.
readIn := bufio.NewScanner(os.Stdin)
for readIn.Scan() {
// Reads hostname from stdin.
hostname := readIn.Text()
// Get all archived URLs.
resp, err := http.Get(fmt.Sprintf(archiveURL, hostname))
if err != nil {
panic(err)
}
// Response body to stdout.
io.Copy(os.Stdout, resp.Body)
}
}
Conclusion
In this tutorial, we learnt about using a wayback machine to find hidden endpoints on the target web application. We built a Golang package which takes target domains to retrieve all archived URLs stored on wayback cdx server.