rust的string是utf-8串，而char是32字节，这样不会有性能损失吗

< 返回版块

rust的string是utf-8串，而char是32字节，这样不会有性能损失吗

office-windows11 发表于 2023-06-24 09:12

比如要遍历字符串，相比C/C++，rust要做无数次编码转换，这样不会有性能损失吗？

英文的utf-8转utf-32简单，相当于byte转int；

但是emoji≈4个utf-8，汉字≈3个utf-8，它转utf-32代价就有点大

评论区

写评论

munpf 2023-07-05 15:44

今天突然在B站上刷到了这篇文章，Rust WebAssembly性能的真相，里面第一条评论总结的是“rust wasm性能已经不比很多js框架慢了，而现在性能的主要问题并不是不能直接操作dom api，而是js字符串是用utf16编码，在rust中是使用utf8，在字符串转换时需要重新编码是造成性能损失的一个重要原因。”，所以如果要转换成utf16，或者按照unicode字符处理，确实会有性能损失，但是如果能保证都是ascii字符，那直接按byte处理就可以了。

--
👇
munpf: 不好意思，之前头晕没看清问题，理解错了。我记得之前有篇文章就提到过你说的这个问题，忘了是哪篇文章了，不放过大致意思好像就是rust和js之间的字符串传递会消耗很多时间。

GUO 2023-07-03 15:35

项目设计问题，要是反复转来转去肯定有性能损耗。内部统一使用UTF8，对外你随意，因此最多转换一次就行了，比如界面上的显示要转成UTF16，这点性能损耗可以忽略不计。

Aya0wind 2023-06-25 10:09

例如go也有个rune类型，也是4字节，原因跟rust的char一样，但是你照样可以在明知其是ascii字符串时按byte处理，不会有性能损失。

--
👇
Aya0wind: char是你用来处理单个utf-8字符的时候用的，仅仅是传递或者拼接字符串，是不需要转换成char的，因为utf-8变长编码的特性，你换哪个语言来都一样。

Aya0wind 2023-06-25 10:03

char是你用来处理单个utf-8字符的时候用的，仅仅是传递或者拼接字符串，是不需要转换成char的，因为utf-8变长编码的特性，你换哪个语言来都一样。

7sDream 2023-06-25 01:50

推荐一篇文章：http://utf8everywhere.org/zh-cn

Mike Tang 2023-06-25 00:13

Rust中很灵活，多种方案。某些场合下你可以用 [char] 或 Vec

munpf 2023-06-24 16:06

不好意思，之前头晕没看清问题，理解错了。我记得之前有篇文章就提到过你说的这个问题，忘了是哪篇文章了，不放过大致意思好像就是rust和js之间的字符串传递会消耗很多时间。

--
👇
office-windows11: 可以看看 rust 的源代码：里面从 string 得到一个 char 是个复杂的过程。

lib/rustlib/src/rust/library/core/src/str/validations.rs

pub unsafe fn next_code_point<'a, I: Iterator<Item = &'a u8>>(bytes: &mut I) -> Option<u32> {
    // Decode UTF-8
    let x = *bytes.next()?;
    if x < 128 {
        return Some(x as u32);
    }

    // Multibyte case follows
    // Decode from a byte combination out of: [[[x y] z] w]
    // NOTE: Performance is sensitive to the exact formulation here
    let init = utf8_first_byte(x, 2);
    // SAFETY: `bytes` produces an UTF-8-like string,
    // so the iterator must produce a value here.
    let y = unsafe { *bytes.next().unwrap_unchecked() };
    let mut ch = utf8_acc_cont_byte(init, y);
    if x >= 0xE0 {
        // [[x y z] w] case
        // 5th bit in 0xE0 .. 0xEF is always clear, so `init` is still valid
        // SAFETY: `bytes` produces an UTF-8-like string,
        // so the iterator must produce a value here.
        let z = unsafe { *bytes.next().unwrap_unchecked() };
        let y_z = utf8_acc_cont_byte((y & CONT_MASK) as u32, z);
        ch = init << 12 | y_z;
        if x >= 0xF0 {
            // [x y z w] case
            // use only the lower 3 bits of `init`
            // SAFETY: `bytes` produces an UTF-8-like string,
            // so the iterator must produce a value here.
            let w = unsafe { *bytes.next().unwrap_unchecked() };
            ch = (init & 7) << 18 | utf8_acc_cont_byte(y_z, w);
        }
    }

    Some(ch)
}n<u32>

C#，Java，JavaScript 的 string 都是 UTF-16 数组；

而新生代的语言 rust, go, swift 的 string 是 UTF-8 数组

作者 office-windows11 2023-06-24 10:07

可以看看 rust 的源代码：里面从 string 得到一个 char 是个复杂的过程。

lib/rustlib/src/rust/library/core/src/str/validations.rs

pub unsafe fn next_code_point<'a, I: Iterator<Item = &'a u8>>(bytes: &mut I) -> Option<u32> {
    // Decode UTF-8
    let x = *bytes.next()?;
    if x < 128 {
        return Some(x as u32);
    }

    // Multibyte case follows
    // Decode from a byte combination out of: [[[x y] z] w]
    // NOTE: Performance is sensitive to the exact formulation here
    let init = utf8_first_byte(x, 2);
    // SAFETY: `bytes` produces an UTF-8-like string,
    // so the iterator must produce a value here.
    let y = unsafe { *bytes.next().unwrap_unchecked() };
    let mut ch = utf8_acc_cont_byte(init, y);
    if x >= 0xE0 {
        // [[x y z] w] case
        // 5th bit in 0xE0 .. 0xEF is always clear, so `init` is still valid
        // SAFETY: `bytes` produces an UTF-8-like string,
        // so the iterator must produce a value here.
        let z = unsafe { *bytes.next().unwrap_unchecked() };
        let y_z = utf8_acc_cont_byte((y & CONT_MASK) as u32, z);
        ch = init << 12 | y_z;
        if x >= 0xF0 {
            // [x y z w] case
            // use only the lower 3 bits of `init`
            // SAFETY: `bytes` produces an UTF-8-like string,
            // so the iterator must produce a value here.
            let w = unsafe { *bytes.next().unwrap_unchecked() };
            ch = (init & 7) << 18 | utf8_acc_cont_byte(y_z, w);
        }
    }

    Some(ch)
}n<u32>

C#，Java，JavaScript 的 string 都是 UTF-16 数组；

而新生代的语言 rust, go, swift 的 string 是 UTF-8 数组


--  
👇  
munpf: 遍历byte就可以了

munpf 2023-06-24 09:37

遍历byte就可以了

1 共 9 条评论, 1 页