Signup/Sign In
PUBLISHED ON: FEBRUARY 13, 2023

Create a Golang tool to scrape comments from a HTML document

In this tutorial, we will create a tool in Golang to parse html files and extract html comments.

Make sure you have installed Golang and properly set up the environment variable. Check out Golang installation tutorial here.

1. Start by creating a file main.go and import the required Golang packages:

  • fmt - To print extracted html comments.
  • os - To read HTML files from standard input (stdin)
  • strings - To manipulate UTF-8 encoded strings.
  • golang.org/x/net/html - To implement an HTML5-compliant tokenizer and parser.
package main

import (
	"fmt"
	"os"
	"strings"

	"golang.org/x/net/html"
)

2. Create a HTML Tokeniser in Golang

Create a Tokenizer. You should provide UTF-8 as os.stdin text to Tokenizer.

Assign a variable to html.NewTokenizer(os.stdin)

package main

import (
	"fmt"
	"os"
	"strings"

	"golang.org/x/net/html"
)

func main() {

	// Generates a html Tokenizer.
	t := html.NewTokenizer(os.Stdin)

	}

}

3. Tokenize all HTML tokens in Golang.

We need to repeatedly call t.New() to parse the next token. We can use a for loop to repeatedly move to the next token.

package main

import (
	"fmt"
	"os"
	"strings"

	"golang.org/x/net/html"
)

func main() {

	// Generates a html Tokenizer.
	t := html.NewTokenizer(os.Stdin)

	for {
		// Moves the cursor to the next node in the tree.
		tt := t.Next()

	}

}

4. Tokenize all HTML tokens in Golang and handle errors.

Error handling is important to write smart code. Tokeniser returns the type of token or error.

If the Tokenizer returns an error we need to break the loop code immediately. So, use the html.ErrorToken function.

package main

import (
	"fmt"
	"os"
	"strings"

	"golang.org/x/net/html"
)

func main() {

	// Generates a html Tokenizer.
	t := html.NewTokenizer(os.Stdin)

	for {
		// Moves the cursor to the next node in the tree.
		tt := t.Next()
		
		// html.ErrorToken is an error token.
		if tt == html.ErrorToken {
			break
		}

	}

}

5. Find HTML comments with tokenizer in Golang

Assign the token to a new variable.

Let's use the if (conditional) statement to execute a program if the token is a Comment Token(html.CommentToken).

package main

import (
	"fmt"
	"os"
	"strings"

	"golang.org/x/net/html"
)

func main() {

	// Generates a html Tokenizer.
	t := html.NewTokenizer(os.Stdin)

	for {
		// Moves the cursor to the next node in the tree.
		tt := t.Next()
		
		// html.ErrorToken is an error token.
		if tt == html.ErrorToken {
			break
		}

		// Returns a token for the current token.
		tc := t.Token()

		
		// If the token is a CommentToken.
		if tc.Type == html.CommentToken {
			
		}

	}

}

6. Manipulate strings in Golang

We can apply simple manipulation to the comment to make it more readable.

  • Remove new line characters (\n). So, each comment becomes a one-liner. strings.ReplaceAll(tc.Data, "\n", " ") or strings.Replace(tc.Data, "\n", " ", -1)
  • Trim unwanted white space. strings.TrimSpace(d)
  • Do not print if the comment is an empty string. (if d=="")
package main

import (
	"fmt"
	"os"
	"strings"

	"golang.org/x/net/html"
)

func main() {

	// Generates a html Tokenizer.
	t := html.NewTokenizer(os.Stdin)

	for {
		// Moves the cursor to the next node in the tree.
		tt := t.Next()
		
		// html.ErrorToken is an error token.
		if tt == html.ErrorToken {
			break
		}

		// Returns a token for the current token.
		tc := t.Token()

		
		// If the token is a CommentToken.
		if tc.Type == html.CommentToken {
			
			// Replaces all new lines in token comment data with  a white space.
			d := strings.ReplaceAll(tc.Data, "\n", " ")
			
			// Removes leading and trailing white space and returns a string.
			d = strings.TrimSpace(d)
			
			// If d is an empty string continue the loop without printing.
			if d == "" {
				continue
			}

		}

	}

}

7. Return HTML comments in STDOUT in Golang

Now, we just have to print HTML comments on the console. and test if our program is working.

package main

import (
	"fmt"
	"os"
	"strings"

	"golang.org/x/net/html"
)

func main() {

	// Generates a html Tokenizer.
	t := html.NewTokenizer(os.Stdin)

	for {
		// Moves the cursor to the next node in the tree.
		tt := t.Next()
		
		// html.ErrorToken is an error token.
		if tt == html.ErrorToken {
			break
		}

		// Returns a token for the current token.
		tc := t.Token()

		
		// If the token is a CommentToken.
		if tc.Type == html.CommentToken {
			
			// Replaces all new lines in token comment data with  a white space.
			d := strings.ReplaceAll(tc.Data, "\n", " ")
			
			// Removes leading and trailing white space and returns a string.
			d = strings.TrimSpace(d)
			
			// If d is an empty string continue the loop without printing.
			if d == "" {
				continue
			}
			// Println outputs d to stdout
			fmt.Println(d)
		}

	}

}

Build and run Golang HTML comment extractor program

Now open terminal and go to directory/folder and run these commands:

go mod init main.go
go get golang.org/x/net/html
go build

go build and run http comment extractor

Conclusion

We have created a Golang tool to increase our attack surface. Now you can use regular expressions to fetch different interesting terms like URLs.



About the author:
Pradeep has expertise in Linux, Go, Nginx, Apache, CyberSecurity, AppSec and various other technical areas. He has contributed to numerous publications and websites, providing his readers with insightful and informative content.