Signup/Sign In
PUBLISHED ON: FEBRUARY 13, 2023

Fetch all archived URLs from wayback machine with Golang

While penetration testing a Web Application the very first thing is to gather as much information as we can, and this process of gathering information is known as reconnaissance (in short Recon). This practice helps us to go from a single root domain, we have to gather thousands of interesting endpoints. The more advanced/deep you go for recon the better chance of finding bugs (lower hanging bugs).

The Internet wayback machine stores billions of web pages from different websites and provides access to them with no restriction. For hackers, it is really a good place to start looking for bugs on archived URLs and endpoints.

So, let's build a tool in Golang to take domains as input and return all archived URLs from the wayback machine as output.

How to fetch all archived URLs from the wayback machine with Go language?

Let's start by importing required packages from Golang's standard library. All of these are built-in packages.

  • bufio (buffered I/O) - bufio wraps an io.Reader or io.Writer object, and creates another object (Reader or Writer) in the buffer. Buffering is a technique which improves the performance of I/O operations.
  • fmt (format) - This package is used to format input and output. Here we will use this package to print strings (with/without errors) to the console.
  • io - This package provides input-output primitives as well as interfaces to them. We will use it to copy HTTP response body to standard output (stdout).
  • net (network) - This package provides a portable interface for network I/O.
    • http - This is included in the net package. It provides HTTP client and server implementations. You can make HTTP requests to the server with this package.
  • os (operating system) - This package provides the ability to access native operating-system features in a platform-agnostic manner. The design is Unix-like, although the error handling is Go-like.

There are basically three packages involved in I/O operations in Golang: fmt, os, and bufio.

package main

import (
	"bufio"
	"fmt"
	"io"
	"net/http"
	"os"
)

Declare a constant outside the function in Golang

Before jumping to the main() function. Let's declare a constant archiveURL, which stores a formatted string containing the Wayback cdx server.

Interesting HTTP parameters used here:

  • url=*.%s/ - To include subdomains also.
  • output=text (or json) - To get output as plain text.
// archive.org

const archiveURL = "http://web.archive.org/cdx/search/cdx?url=*.%s/*&output=text&fl=original&collapse=urlkey"

Working with standard input (stdin) in Golang

Now, In the main() function, we need to get domains from stdin and move forward with those.

  1. Print error if standard input is empty (os.Stdin.stat()) and return exit status 1 as STDERR.
    1. func main() {
      	
      	// Check if input is provided
      	stat, _ := os.Stdin.Stat()
      	
      	// Exits the crawler if no urls are provided as STDIN.
      	if (stat.Mode() & os.ModeCharDevice) != 0 {
      		fmt.Fprintln(os.Stderr, "No urls detected. Hint: cat domains.txt | crawler")
      		os.Exit(1)
      	}	
      }
  2. Store stdin in the buffer. Use bufio.NewScanner() function.
    1. func main() {
      	
      	// Check if input is provided
      	stat, _ := os.Stdin.Stat()
      	
      	// Exits the crawler if no urls are provided as STDIN.
      	if (stat.Mode() & os.ModeCharDevice) != 0 {
      		fmt.Fprintln(os.Stderr, "No urls detected. Hint: cat domains.txt | crawler")
      		os.Exit(1)
      	}
      	
      	// Reads input from the input stream.
      	readIn := bufio.NewScanner(os.Stdin)
      	
      }

HTTP request and response with Golang

To process each domain for further actions, we can use a for loop which iterates through all domains provided in the buffer through STDIN.

func main() {
	
	// Check if input is provided
	stat, _ := os.Stdin.Stat()
	
	// Exits the crawler if no urls are provided as STDIN.
	if (stat.Mode() & os.ModeCharDevice) != 0 {
		fmt.Fprintln(os.Stderr, "No urls detected. Hint: cat domains.txt | crawler")
		os.Exit(1)
	}
	
	// Reads input from the input stream.
	readIn := bufio.NewScanner(os.Stdin)
	
	for readIn.Scan() {
		
		// Reads hostname from stdin.
		hostname := readIn.Text()
		
	}
}

Fetch archived URLs from wayback cdx server with Golang

To retrieve all endpoints for each domain, we will use the http.Get() function.

archiveURL is a formatted string constant.

func main() {
	
	// Check if input is provided
	stat, _ := os.Stdin.Stat()
	
	// Exits the crawler if no urls are provided as STDIN.
	if (stat.Mode() & os.ModeCharDevice) != 0 {
		fmt.Fprintln(os.Stderr, "No urls detected. Hint: cat domains.txt | crawler")
		os.Exit(1)
	}
	
	// Reads input from the input stream.
	readIn := bufio.NewScanner(os.Stdin)
	
	for readIn.Scan() {
		
		// Reads hostname from stdin.
		hostname := readIn.Text()
		
		// Get all archived URLs.
		resp, err := http.Get(fmt.Sprintf(archiveURL, hostname))
		if err != nil {
			panic(err)
		}
	}
}

Golang script to fetch all archived URLs and return all endpoints on STDOUT

Finally, use the io package to copy every HTTP response body (in plain text format) to STDOUT.

package main

import (
	"bufio"
	"fmt"
	"io"
	"net/http"
	"os"
)

// archive.org
const archiveURL = "http://web.archive.org/cdx/search/cdx?url=*.%s/*&output=text&fl=original&collapse=urlkey"

func main() {
	
	// Check if input is provided
	stat, _ := os.Stdin.Stat()
	
	// Exits the crawler if no urls are provided as STDIN.
	if (stat.Mode() & os.ModeCharDevice) != 0 {
		fmt.Fprintln(os.Stderr, "No urls detected. Hint: cat domains.txt | crawler")
		os.Exit(1)
	}
	
	// Reads input from the input stream.
	readIn := bufio.NewScanner(os.Stdin)
	
	for readIn.Scan() {
		
		// Reads hostname from stdin.
		hostname := readIn.Text()
		
		// Get all archived URLs.
		resp, err := http.Get(fmt.Sprintf(archiveURL, hostname))
		if err != nil {
			panic(err)
		}
		
		// Response body to stdout.
		io.Copy(os.Stdout, resp.Body)
	}
}

Conclusion

In this tutorial, we learnt about using a wayback machine to find hidden endpoints on the target web application. We built a Golang package which takes target domains to retrieve all archived URLs stored on wayback cdx server.



About the author:
Pradeep has expertise in Linux, Go, Nginx, Apache, CyberSecurity, AppSec and various other technical areas. He has contributed to numerous publications and websites, providing his readers with insightful and informative content.