Create a Golang tool to scrape comments from a HTML document

In this tutorial, we will create a tool in Golang to parse html files and extract html comments.

Make sure you have installed Golang and properly set up the environment variable. Check out Golang installation tutorial here.

1. Start by creating a file `main.go` and import the required Golang packages:

fmt - To print extracted html comments.
os - To read HTML files from standard input (stdin)
strings - To manipulate UTF-8 encoded strings.
golang.org/x/net/html - To implement an HTML5-compliant tokenizer and parser.

package main

import (
	"fmt"
	"os"
	"strings"

	"golang.org/x/net/html"
)

2. Create a HTML Tokeniser in Golang

Create a Tokenizer. You should provide UTF-8 as os.stdin text to Tokenizer.

Assign a variable to html.NewTokenizer(os.stdin)

package main

import (
	"fmt"
	"os"
	"strings"

	"golang.org/x/net/html"
)

func main() {

	// Generates a html Tokenizer.
	t := html.NewTokenizer(os.Stdin)

	}

}

3. Tokenize all HTML tokens in Golang.

We need to repeatedly call t.New() to parse the next token. We can use a for loop to repeatedly move to the next token.

package main

import (
	"fmt"
	"os"
	"strings"

	"golang.org/x/net/html"
)

func main() {

	// Generates a html Tokenizer.
	t := html.NewTokenizer(os.Stdin)

	for {
		// Moves the cursor to the next node in the tree.
		tt := t.Next()

	}

}

4. Tokenize all HTML tokens in Golang and handle errors.

Error handling is important to write smart code. Tokeniser returns the type of token or error.

If the Tokenizer returns an error we need to break the loop code immediately. So, use the html.ErrorToken function.

package main

import (
	"fmt"
	"os"
	"strings"

	"golang.org/x/net/html"
)

func main() {

	// Generates a html Tokenizer.
	t := html.NewTokenizer(os.Stdin)

	for {
		// Moves the cursor to the next node in the tree.
		tt := t.Next()
		
		// html.ErrorToken is an error token.
		if tt == html.ErrorToken {
			break
		}

	}

}

5. Find HTML comments with tokenizer in Golang

Assign the token to a new variable.

Let's use the if (conditional) statement to execute a program if the token is a Comment Token(html.CommentToken).

package main

import (
	"fmt"
	"os"
	"strings"

	"golang.org/x/net/html"
)

func main() {

	// Generates a html Tokenizer.
	t := html.NewTokenizer(os.Stdin)

	for {
		// Moves the cursor to the next node in the tree.
		tt := t.Next()
		
		// html.ErrorToken is an error token.
		if tt == html.ErrorToken {
			break
		}

		// Returns a token for the current token.
		tc := t.Token()

		
		// If the token is a CommentToken.
		if tc.Type == html.CommentToken {
			
		}

	}

}

6. Manipulate strings in Golang

We can apply simple manipulation to the comment to make it more readable.

Remove new line characters (\n). So, each comment becomes a one-liner. strings.ReplaceAll(tc.Data, "\n", " ") or strings.Replace(tc.Data, "\n", " ", -1)
Trim unwanted white space. strings.TrimSpace(d)
Do not print if the comment is an empty string. (if d=="")

package main

import (
	"fmt"
	"os"
	"strings"

	"golang.org/x/net/html"
)

func main() {

	// Generates a html Tokenizer.
	t := html.NewTokenizer(os.Stdin)

	for {
		// Moves the cursor to the next node in the tree.
		tt := t.Next()
		
		// html.ErrorToken is an error token.
		if tt == html.ErrorToken {
			break
		}

		// Returns a token for the current token.
		tc := t.Token()

		
		// If the token is a CommentToken.
		if tc.Type == html.CommentToken {
			
			// Replaces all new lines in token comment data with  a white space.
			d := strings.ReplaceAll(tc.Data, "\n", " ")
			
			// Removes leading and trailing white space and returns a string.
			d = strings.TrimSpace(d)
			
			// If d is an empty string continue the loop without printing.
			if d == "" {
				continue
			}

		}

	}

}

7. Return HTML comments in STDOUT in Golang

Now, we just have to print HTML comments on the console. and test if our program is working.

package main

import (
	"fmt"
	"os"
	"strings"

	"golang.org/x/net/html"
)

func main() {

	// Generates a html Tokenizer.
	t := html.NewTokenizer(os.Stdin)

	for {
		// Moves the cursor to the next node in the tree.
		tt := t.Next()
		
		// html.ErrorToken is an error token.
		if tt == html.ErrorToken {
			break
		}

		// Returns a token for the current token.
		tc := t.Token()

		
		// If the token is a CommentToken.
		if tc.Type == html.CommentToken {
			
			// Replaces all new lines in token comment data with  a white space.
			d := strings.ReplaceAll(tc.Data, "\n", " ")
			
			// Removes leading and trailing white space and returns a string.
			d = strings.TrimSpace(d)
			
			// If d is an empty string continue the loop without printing.
			if d == "" {
				continue
			}
			// Println outputs d to stdout
			fmt.Println(d)
		}

	}

}

Build and run Golang HTML comment extractor program

Now open terminal and go to directory/folder and run these commands:

go mod init main.go
go get golang.org/x/net/html
go build

go build and run http comment extractor

Conclusion

We have created a Golang tool to increase our attack surface. Now you can use regular expressions to fetch different interesting terms like URLs.

C TUTORIAL

C PROGRAMS

INTERVIEW TESTS

EXECUTE CODE

C++ TUTORIAL

C++ PROGRAMS

INTERVIEW TESTS

EXECUTE CODE

PYTHON TUTORIAL

PYTHON HOW TOS

INTERVIEW TESTS

EXECUTE CODE

JAVA TUTORIAL

JAVA CODE EXAMPLES

SPRING TUTORIAL

MORE IN JAVA

COMPUTER ARCHITECTURE

COMPUTER NETWORK

OPERATING SYSTEM

DBMS & SQL

PL/SQL

MongoDB

EXECUTE SQL

ANDROID DEVELOPMENT

GO LANGUAGE

LINUX

DOCKER

HTML TAGS (A to Z)

CSS REFERENCES

SASS/SCSS

KOTLIN

GAME DEVELOPMENT

PHP

GIT GUIDE

JAVASCRIPT

ADVANCED DSA

Create a Golang tool to scrape comments from a HTML document

1. Start by creating a file main.go and import the required Golang packages:

2. Create a HTML Tokeniser in Golang

3. Tokenize all HTML tokens in Golang.

4. Tokenize all HTML tokens in Golang and handle errors.

5. Find HTML comments with tokenizer in Golang

6. Manipulate strings in Golang

7. Return HTML comments in STDOUT in Golang

Build and run Golang HTML comment extractor program

Conclusion

1. Start by creating a file `main.go` and import the required Golang packages: