Skip to content

htmlquery is golang XPath package for HTML query.

License

Notifications You must be signed in to change notification settings

antchfx/htmlquery

Repository files navigation

htmlquery

Build Status GoDoc Go Report Card

Overview

htmlqueryis an XPath query package for HTML, lets you extract data or evaluate from HTML documents by an XPath expression.

htmlquerybuilt-in the query object caching feature based onLRU,this feature will caching the recently used XPATH query string. Enable query caching can avoid re-compile XPath expression each query.

You can visit this page to learn about the supported XPath(1.0/2.0) syntax.https://github /antchfx/xpath

XPath query packages for Go

Name Description
htmlquery XPath query package for the HTML document
xmlquery XPath query package for the XML document
jsonquery XPath query package for the JSON document

Installation

go get github /antchfx/htmlquery

Getting Started

Query, returns matched elements or error.

nodes,err:=htmlquery.QueryAll(doc,"//a")
iferr!=nil{
panic(`not a valid XPath expression.`)
}

Load HTML document from URL.

doc,err:=htmlquery.LoadURL("http://example /")

Load HTML from document.

filePath:="/home/user/sample.html"
doc,err:=htmlquery.LoadDoc(filePath)

Load HTML document from string.

s:=`<html>....</html>`
doc,err:=htmlquery.Parse(strings.NewReader(s))

Find all A elements.

list:=htmlquery.Find(doc,"//a")

Find all A elements that havehrefattribute.

list:=htmlquery.Find(doc,"//a[@href]")

Find all A elements withhrefattribute and only returnhrefvalue.

list:=htmlquery.Find(doc,"//a/@href")
for_,n:=rangelist{
fmt.Println(htmlquery.InnerText(n))// output @href value
}

Find the third A element.

a:=htmlquery.FindOne(doc,"//a[3]")

Find children element (img) under Ahrefand print the source

a:=htmlquery.FindOne(doc,"//a")
img:=htmlquery.FindOne(a,"//img")
fmt.Prinln(htmlquery.SelectAttr(img,"src"))// output @src value

Evaluate the number of all IMG element.

expr,_:=xpath.Compile("count(//img)")
v:=expr.Evaluate(htmlquery.CreateXPathNavigator(doc)).(float64)
fmt.Printf("total count is %f",v)

Quick Starts

funcmain() {
doc,err:=htmlquery.LoadURL("https:// bing /search?q=golang")
iferr!=nil{
panic(err)
}
// Find all news item.
list,err:=htmlquery.QueryAll(doc,"//ol/li")
iferr!=nil{
panic(err)
}
fori,n:=rangelist{
a:=htmlquery.FindOne(n,"//a")
ifa!=nil{
fmt.Printf("%d %s(%s)\n",i,htmlquery.InnerText(a),htmlquery.SelectAttr(a,"href"))
}
}
}

FAQ

Find()vsQueryAll(),which is better?

FindandQueryAllboth do the same things, searches all of matched html nodes. TheFindwill panics if you give an error XPath query, butQueryAllwill return an error for you.

Can I save my query expression object for the next query?

Yes, you can. We offer theQuerySelectorandQuerySelectorAllmethods, It will accept your query expression object.

Cache a query expression object(or reused) will avoid re-compile XPath query expression, improve your query performance.

XPath query object cache performance

goos: windows
goarch: amd64
pkg: github /antchfx/htmlquery
BenchmarkSelectorCache-4 20000000 55.2 ns/op
BenchmarkDisableSelectorCache-4 500000 3162 ns/op

How to disable caching?

htmlquery.DisableSelectorCache = true

Questions

Please let me know if you have any questions.