Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite for current Syntax Level 3 #13

Open
wants to merge 33 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
7563929
update README to point to the spec
riking Mar 13, 2018
55f8973
Export TokenType, add missing token types
riking Mar 13, 2018
6328600
Add a text/transform preprocessor for input
riking Mar 13, 2018
c02b43f
implement the 'consume a token' algorithm
riking Mar 13, 2018
77fb512
stub out the rest of the parsing algo
riking Mar 13, 2018
c626905
Finish tokenizer, start fixing tests
riking Mar 14, 2018
7690df4
fix all usages of "starts with a valid escape": cannot unread after p…
riking Mar 14, 2018
1080914
Change test data, make more fixes
riking Mar 14, 2018
b99c1dd
add fuzzing corpus from existing testdata
riking Mar 14, 2018
aa841ce
widen ParseInt calls to accept too-big codepoints
riking Mar 14, 2018
b5986f0
Add round-tripping test
riking Mar 14, 2018
6e71edb
Fix: was discarding the leading 'u'
riking Mar 14, 2018
c5a4afb
Fix more fuzzer findings
riking Mar 15, 2018
4c0a5ef
More fixes from fuzzing
riking Mar 20, 2018
4c09d63
Fix '5e', '#123', and comments
riking Mar 20, 2018
3c8aa10
fixup: add more comment tests
riking Mar 20, 2018
d163d68
add tests for '5e', '5e-', '5e-3'
riking Mar 20, 2018
f065792
fix missing space after hex escape
riking Mar 20, 2018
0386e01
call escapeIdentifer() for TokenFunction
riking Mar 20, 2018
b5c30c6
Fuzz fixes for bad-string
riking Mar 20, 2018
87fb86e
Rename package, update documentation
riking Mar 20, 2018
2fc5ca6
Restore original 'scanner' directory
riking Mar 20, 2018
08b0d9c
Ignore fuzz results
riking Mar 20, 2018
ff8d7b8
Remove failing "--\--" test, add test for #2
riking Mar 20, 2018
df4d3f6
Improve documentation, delete unused methods
riking Mar 20, 2018
2689bbf
tighten signature of TokenExtraTypeLookup
riking Mar 20, 2018
5f3baa3
improve documentation of TokenExtra.String()
riking Mar 20, 2018
ad83c8e
Change Token.WriteTo to standard signature
riking Mar 25, 2018
f4312d7
Update README, update tokenizer docs
riking Mar 25, 2018
551cdba
Suppress output from Fuzz during tests
riking Mar 25, 2018
05a2682
Oops, WriteTo returns int64 not int
riking Mar 25, 2018
35e0c2b
travis.yml: skip tokenizer package in old versions
riking Mar 25, 2018
c37ded0
travis.yml: Drop go 1.3 and 1.4 support (bufio.Reader.Discard)
riking Mar 25, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,12 @@ sudo: false

matrix:
include:
- go: 1.3
- go: 1.4
- go: 1.5
- go: 1.6
- go: 1.7
- go: 1.8
- go: "1.5"
- go: "1.6"
- go: "1.7"
- go: "1.8"
- go: "1.9"
- go: "1.10"
- go: tip
allow_failures:
- go: tip
Expand Down
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,7 @@ css
[![GoDoc](https://godoc.org/github.com/gorilla/css?status.svg)](https://godoc.org/github.com/gorilla/css) [![Build Status](https://travis-ci.org/gorilla/css.png?branch=master)](https://travis-ci.org/gorilla/css)

A CSS3 tokenizer.

This repository contains two packages. The 'scanner' package is based on an older version of the CSS specification, and is kept around for compatibility with existing code. Minimum Go version is 1.3.

The 'tokenizer' package is based on the CSS Syntax Level 3 specification at <https://www.w3.org/TR/css-syntax-3/#tokenizer-algorithms>. Minimum Go version is 1.5.
1 change: 1 addition & 0 deletions tokenizer/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
testdata/fuzz
60 changes: 60 additions & 0 deletions tokenizer/crlf.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
// Copyright (c) 2018 Kane York. Licensed under 2-Clause BSD.

package tokenizer

// The crlf package helps in dealing with files that have DOS-style CR/LF line
// endings.
//
// Copyright (c) 2015 Andy Balholm. Licensed under 2-Clause BSD.
//
// package crlf

import "golang.org/x/text/transform"

// Normalize takes CRLF, CR, or LF line endings in src, and converts them
// to LF in dst.
//
// cssparse: Also replace null bytes with U+FFFD REPLACEMENT CHARACTER.
type normalize struct {
prev byte
}

const replacementCharacter = "\uFFFD"

func (n *normalize) Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error) {
for nDst < len(dst) && nSrc < len(src) {
c := src[nSrc]
switch c {
case '\r':
dst[nDst] = '\n'
case '\n':
if n.prev == '\r' {
nSrc++
n.prev = c
continue
}
dst[nDst] = '\n'
case 0:
// nb: len(replacementCharacter) == 3
if nDst+3 >= len(dst) {
err = transform.ErrShortDst
return
}
copy(dst[nDst:], replacementCharacter[:])
nDst += 2
default:
dst[nDst] = c
}
n.prev = c
nDst++
nSrc++
}
if nSrc < len(src) {
err = transform.ErrShortDst
}
return
}

func (n *normalize) Reset() {
n.prev = 0
}
52 changes: 52 additions & 0 deletions tokenizer/doc.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
// Copyright 2018 Kane York.
// Copyright 2012 The Gorilla Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

/*
Package gorilla/css/tokenizer generates tokens for a CSS3 input.

It follows the CSS3 specification located at:

http://www.w3.org/TR/css-syntax-3/#tokenizer-algorithms

To use it, create a new tokenizer for a given CSS input and call Next() until
the token returned is a "stop token":

s := tokenizer.New(strings.NewReader(myCSS))
for {
token := s.Next()
if token.Type.StopToken() {
break
}
// Do something with the token...
}

If the consumer wants to accept malformed input, use the following check
instead:

token := s.Next()
if token.Type == tokenizer.TokenEOF || token.Type == tokenizer.TokenError {
break
}

The three potential tokenization errors are a "bad-escape" (backslash-newline
outside a "string" or url() in the input), a "bad-string" (unescaped newline
inside a "string"), and a "bad-url" (a few different cases). Parsers can
choose to abort when seeing one of these errors, or ignore the declaration and
attempt to recover.

Returned tokens that carry extra information have a non-nil .Extra value. For
TokenError, TokenBadEscape, TokenBadString, and TokenBadURI, the
TokenExtraError type carries an `error` with informative text about the nature
of the error. For TokenNumber, TokenPercentage, and TokenDimension, the
TokenExtraNumeric specifies whether the number is integral, and for
TokenDimension, contains the unit string (e.g. "px"). For TokenUnicodeRange,
the TokenExtraUnicodeRange type contains the actual start and end values of the
range.

Note: the tokenizer doesn't perform lexical analysis, it only implements
Section 4 of the CSS Syntax Level 3 specification. See Section 5 for the
parsing rules.
*/
package tokenizer
108 changes: 108 additions & 0 deletions tokenizer/fuzz.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
// Copyright 2018 Kane York.

package tokenizer

import (
"bytes"
"fmt"
"io"
"reflect"
)

// Tests should set this to true to suppress fuzzer output except on failure.
var fuzzNoPrint = false

// Entry point for fuzz testing.
func Fuzz(b []byte) int {
success := false

var testLogBuf bytes.Buffer
fuzzPrintf := func(f string, v ...interface{}) {
fmt.Fprintf(&testLogBuf, f, v...)
}
defer func() {
if !success {
fmt.Print(testLogBuf.String())
}
}()
fuzzPrintf("=== Start fuzz test ===\n%s\n", b)

var tokens []Token
tz := NewTokenizer(bytes.NewReader(b))
for {
tt := tz.Next()
fuzzPrintf("[OT] %v\n", tt)
if tt.Type == TokenError {
// We should not have reading errors
panic(tt)
} else if tt.Type == TokenEOF {
break
} else {
tokens = append(tokens, tt)
}
}

// Render and retokenize

var wr TokenRenderer
var rerenderBuf bytes.Buffer
defer func() {
if !success {
fuzzPrintf("RE-RENDER BUFFER:\n%s\n", rerenderBuf.String())
}
}()
pr, pw := io.Pipe()
defer pr.Close()

go func() {
writeTarget := io.MultiWriter(&rerenderBuf, pw)
for _, v := range tokens {
wr.WriteTokenTo(writeTarget, v)
}
pw.Close()
}()

tz = NewTokenizer(pr)
i := 0
for {
for i < len(tokens) && tokens[i].Type == TokenComment {
i++
}
tt := tz.Next()
fuzzPrintf("[RT] %v\n", tt)
if tt.Type == TokenComment {
// Ignore comments while comparing
continue
}
if tt.Type == TokenError {
panic(tt)
}
if tt.Type == TokenEOF {
if i != len(tokens) {
panic(fmt.Sprintf("unexpected EOF: got EOF from retokenizer, but original token stream is at %d/%d\n%v", i, len(tokens), tokens))
} else {
break
}
}
if i == len(tokens) {
panic(fmt.Sprintf("expected EOF: reached end of original token stream but got %v from retokenizer\n%v", tt, tokens))
}

ot := tokens[i]
if tt.Type != ot.Type {
panic(fmt.Sprintf("retokenizer gave %v, expected %v (.Type not equal)\n%v", tt, ot, tokens))
}
if tt.Value != ot.Value && !tt.Type.StopToken() {
panic(fmt.Sprintf("retokenizer gave %v, expected %v (.Value not equal)\n%v", tt, ot, tokens))
}
if TokenExtraTypeLookup[tt.Type] != nil {
if !reflect.DeepEqual(tt, ot) && !tt.Type.StopToken() {
panic(fmt.Sprintf("retokenizer gave %v, expected %v (.Extra not equal)\n%v", tt, ot, tokens))
}
}
i++
continue
}
success = true
return 1
}
Loading