Skip to content

Commit

Permalink
Auto merge of #176 - servo:1.0, r=asajeffrey
Browse files Browse the repository at this point in the history
Version 1.0, rewrite the data structure and API

**Do not merge yet**, I’d like to get feedback on API design.

This is a large rewrite of the data structures and API that aims to eventually become version 1.0.0. Motivations are:

* I designed rust-url when I was relatively new to Rust, coming from Python where memory allocation and ownership are not concerns. The URL Standard also writes algorithms to optimize for human comprehension, not computer resources. As a result, each component of the `Url` struct is a `String` or a `Vec<String>`.
* At the same time, I tried to use enums to enforce some invariants in the type system. For example, “non-relative” like `data:` don’t have a host or port number. In practice I think this makes the API a pain to use more than it’s worth.
* The spec [has changed](whatwg/url@b266a43) a lot (notably around dealing with protocols/schemes outside the small list it knows about), and I’m not sure the API can be adapted backward-compatibly anyway.

I think the API can be more "rusty", and a bunch of long-standing issues be fixed along the way. The big idea is that the `Url` struct now contains a single `String` of the URL’s serialization (the only heap memory allocation), plus some indices into it (and some other meta-data). The indices allow accessor methods for URL components to return `&str` in O(1) time.

The fields of `Url` are now private, as they need to maintain a number of invariants. Getting invariants wrong is not a memory-safety hazard (e.g. string indexing is still checked) but can cause non-sensical results.

Rather than multiple ad-hoc methods like `serialize_authority` and `serialize_without_fragment`, `Url` implements indexing with ranges of `Position`, which is an enum indicating a position between URL components. For example, `serialize_without_fragment` is now `&url[..Position::AfterQuery]`, takes O(1) time, and returns a `&str` slice.

While we’re making breaking changes, this is an opportunity to clean up APIs that didn’t turn out so great:

* Drop `UrlParser`. The builder pattern works well for `std::process::Command`, but for URLs in practice almost only `base_url` is ever used. v0.5.2 already has [`base_url.join(str)`](https://github.com/servo/rust-url/blob/v0.5.2/src/lib.rs#L867-L874), which stays the same.
* Drop all the `lossy_percent_decode_*` methods. They don’t seem to be used at all. Maybe accessors could return a `PercentEncodedStr` wrapper that dereferences to `&str` and has a `percent_decode()` method?
* In the `percent_encoding` module, there was “duplicated” APIs that returned a new `String` or `Vec<u8>` for convenience, or pushed to an existing `&mut String` or `&mut Vec<u8>` for flexibility. This was a bit awkward, so I’ve changed it to return iterators of `char` or `u8` which can be used with `.extend(…)` or `.collect()`. I’m not sure this an improvement and might revert it. Feedback appreciated.

This new design is great for accessing components, but not so much for modifying them. We need to make changes in the middle of the string, and with percent-encoding we don’t know the encoded size of the new thing until we’ve encoded it. Then we need to fix up indices. As a result, the `set_*` methods have lots of fiddly code where it’s easy to make subtle mistakes. This code should be well-tested before this PR is merged, but it’s not at all yet.

To see what the new API looks like more easily than by reading the code, you can find the output of `cargo doc` at https://simonsapin.github.io/rust-url-1.0-docs/url/

Too see what the API usage is like in practice, I’ve updated some crates that use rust-url for this PR:

* https://github.com/alexcrichton/git2-rs/pull/115
* rust-lang/cargo#2428
* rwf2/cookie-rs#42
* hyperium/hyper#740
* https://github.com/cyderize/rust-websocket/pull/70
* servo/servo#9840

Since this PR isn’t merged yet, to do the same in your crates you’ll need a clone of the git repository at the right branch:

```
git clone https://github.com/servo/rust-url --branch 1.0
```

… and a Cargo [path override](http://doc.crates.io/guide.html#overriding-dependencies) in the directory of your crate:

`.cargo/config`
```
path = [
    "../rust-url",
]
```

Since we’re making breaking changes, this is the opportunity to make *more* (or different) breaking changes before version 1.0 is published and we commit to this new API.

Please comment here to say what are your pain points with the 0.5.x API, what could be improved, what I’ve done in this PR that doesn’t seem like a good idea, etc.

Thanks!

<!-- Reviewable:start -->
<!--  (Simon) Hidden for now
[<img src="https://reviewable.io/review_button.svg" height="40" alt="Review on Reviewable"/>](https://reviewable.io/reviews/servo/rust-url/176)
-->
<!-- Reviewable:end -->
  • Loading branch information
bors-servo committed Apr 20, 2016
2 parents be00f8f + 4a59d93 commit e33f72f
Show file tree
Hide file tree
Showing 40 changed files with 9,345 additions and 3,259 deletions.
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
/target
/Cargo.lock
target
Cargo.lock
/.cargo/config
48 changes: 15 additions & 33 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
[package]

name = "url"
version = "0.5.9"
authors = [ "Simon Sapin <[email protected]>" ]
version = "1.0.0"
authors = ["The rust-url developers"]

description = "URL library for Rust, based on the WHATWG URL Standard"
documentation = "http://servo.github.io/rust-url/url/index.html"
Expand All @@ -12,46 +12,28 @@ keywords = ["url", "parser"]
license = "MIT/Apache-2.0"

[[test]]
name = "format"
[[test]]
name = "form_urlencoded"
[[test]]
name = "idna"
[[test]]
name = "punycode"
[[test]]
name = "tests"
name = "unit"

[[test]]
name = "wpt"
name = "data"
harness = false

[lib]
test = false

[dev-dependencies]
rustc-test = "0.1"
rustc-serialize = "0.3"

[features]
query_encoding = ["encoding"]
serde_serialization = ["serde"]
heap_size = ["heapsize", "heapsize_plugin"]

[dependencies.heapsize]
version = ">=0.1.1, <0.4"
optional = true

[dependencies.heapsize_plugin]
version = "0.1.0"
optional = true

[dependencies.encoding]
version = "0.2"
optional = true

[dependencies.serde]
version = ">=0.6.1, <0.8"
optional = true

[dependencies]
uuid = { version = "0.2", features = ["v4"] }
rustc-serialize = "0.3"
unicode-bidi = "0.2.3"
unicode-normalization = "0.1.2"
idna = { version = "0.1.0", path = "./idna" }
heapsize = {version = ">=0.1.1, <0.4", optional = true}
heapsize_plugin = {version = "0.1.0", optional = true}
encoding = {version = "0.2", optional = true}
serde = {version = ">=0.6.1, <0.8", optional = true}
rustc-serialize = {version = "0.3", optional = true}
matches = "0.1"
3 changes: 1 addition & 2 deletions LICENSE-MIT
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
Copyright (c) 2006-2009 Graydon Hoare
Copyright (c) 2009-2013 Mozilla Foundation
Copyright (c) 2013-2016 The rust-url developers

Permission is hereby granted, free of charge, to any
person obtaining a copy of this software and associated
Expand Down
4 changes: 1 addition & 3 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
test:
cargo test --features query_encoding
cargo test --features serde_serialization
cargo test
cargo test --features "query_encoding serde rustc-serialize"
[ x$$TRAVIS_RUST_VERSION != xnightly ] || cargo test --features heap_size

doc:
Expand Down
24 changes: 24 additions & 0 deletions idna/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
[package]
name = "idna"
version = "0.1.0"
authors = ["The rust-url developers"]
description = "IDNA (Internationalizing Domain Names in Applications) and Punycode."
repository = "https://github.com/servo/rust-url/"
license = "MIT/Apache-2.0"

[lib]
doctest = false
test = false

[[test]]
name = "tests"
harness = false

[dev-dependencies]
rustc-test = "0.1"
rustc-serialize = "0.3"

[dependencies]
unicode-bidi = "0.2.3"
unicode-normalization = "0.1.2"
matches = "0.1"
File renamed without changes.
73 changes: 73 additions & 0 deletions idna/src/lib.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
// Copyright 2016 The rust-url developers.
//
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
// option. This file may not be copied, modified, or distributed
// except according to those terms.

//! This Rust crate implements IDNA
//! [per the WHATWG URL Standard](https://url.spec.whatwg.org/#idna).
//!
//! It also exposes the underlying algorithms from [*Unicode IDNA Compatibility Processing*
//! (Unicode Technical Standard #46)](http://www.unicode.org/reports/tr46/)
//! and [Punycode (RFC 3492)](https://tools.ietf.org/html/rfc3492).
//!
//! Quoting from [UTS #46’s introduction](http://www.unicode.org/reports/tr46/#Introduction):
//!
//! > Initially, domain names were restricted to ASCII characters.
//! > A system was introduced in 2003 for internationalized domain names (IDN).
//! > This system is called Internationalizing Domain Names for Applications,
//! > or IDNA2003 for short.
//! > This mechanism supports IDNs by means of a client software transformation
//! > into a format known as Punycode.
//! > A revision of IDNA was approved in 2010 (IDNA2008).
//! > This revision has a number of incompatibilities with IDNA2003.
//! >
//! > The incompatibilities force implementers of client software,
//! > such as browsers and emailers,
//! > to face difficult choices during the transition period
//! > as registries shift from IDNA2003 to IDNA2008.
//! > This document specifies a mechanism
//! > that minimizes the impact of this transition for client software,
//! > allowing client software to access domains that are valid under either system.
#[macro_use] extern crate matches;
extern crate unicode_bidi;
extern crate unicode_normalization;

pub mod punycode;
pub mod uts46;

/// The [domain to ASCII](https://url.spec.whatwg.org/#concept-domain-to-ascii) algorithm.
///
/// Return the ASCII representation a domain name,
/// normalizing characters (upper-case to lower-case and other kinds of equivalence)
/// and using Punycode as necessary.
///
/// This process may fail.
pub fn domain_to_ascii(domain: &str) -> Result<String, uts46::Errors> {
uts46::to_ascii(domain, uts46::Flags {
use_std3_ascii_rules: false,
transitional_processing: true, // XXX: switch when Firefox does
verify_dns_length: false,
})
}

/// The [domain to Unicode](https://url.spec.whatwg.org/#concept-domain-to-unicode) algorithm.
///
/// Return the Unicode representation of a domain name,
/// normalizing characters (upper-case to lower-case and other kinds of equivalence)
/// and decoding Punycode as necessary.
///
/// This may indicate [syntax violations](https://url.spec.whatwg.org/#syntax-violation)
/// but always returns a string for the mapped domain.
pub fn domain_to_unicode(domain: &str) -> (String, Result<(), uts46::Errors>) {
uts46::to_unicode(domain, uts46::Flags {
use_std3_ascii_rules: false,

// Unused:
transitional_processing: true,
verify_dns_length: false,
})
}
7 changes: 3 additions & 4 deletions make_idna_table.py → idna/src/make_uts46_mapping_table.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,17 @@
# Copyright 2013-2014 Valentin Gosu.
# Copyright 2013-2014 The rust-url developers.
#
# Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
# http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
# <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
# option. This file may not be copied, modified, or distributed
# except according to those terms.


# Run as: python make_idna_table.py idna_table.txt > src/idna_table.rs
# Run as: python make_uts46_mapping_table.py IdnaMappingTable.txt > uts46_mapping_table.rs
# You can get the latest idna table from
# http://www.unicode.org/Public/idna/latest/IdnaMappingTable.txt

print('''\
// Copyright 2013-2014 Valentin Gosu.
// Copyright 2013-2014 The rust-url developers.
//
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
Expand Down
17 changes: 8 additions & 9 deletions src/punycode.rs → idna/src/punycode.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
// Copyright 2013 Simon Sapin.
// Copyright 2013 The rust-url developers.
//
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
Expand Down Expand Up @@ -185,11 +185,11 @@ pub fn encode(input: &[char]) -> Option<String> {
break
}
let value = t + ((q - t) % (BASE - t));
value_to_digit(value, &mut output);
output.push(value_to_digit(value));
q = (q - t) / (BASE - t);
k += BASE;
}
value_to_digit(q, &mut output);
output.push(value_to_digit(q));
bias = adapt(delta, processed + 1, processed == basic_length);
delta = 0;
processed += 1;
Expand All @@ -203,11 +203,10 @@ pub fn encode(input: &[char]) -> Option<String> {


#[inline]
fn value_to_digit(value: u32, output: &mut String) {
let code_point = match value {
0 ... 25 => value + 0x61, // a..z
26 ... 35 => value - 26 + 0x30, // 0..9
fn value_to_digit(value: u32) -> char {
match value {
0 ... 25 => (value as u8 + 'a' as u8) as char, // a..z
26 ... 35 => (value as u8 - 26 + '0' as u8) as char, // 0..9
_ => panic!()
};
unsafe { output.as_mut_vec().push(code_point as u8) }
}
}
Loading

0 comments on commit e33f72f

Please sign in to comment.