# OS

**Repository Path**: lagrangetay/os

## Basic Information

- **Project Name**: OS
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-03-21
- **Last Updated**: 2021-04-27

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# uCore plus

<!-- @import "[TOC]" {cmd="toc" depthFrom=2 depthTo=4 orderedList=true} -->

<!-- code_chunk_output -->

1. [uCore 探究部分](#ucore-探究部分)
    1. [第一个用户进程 sh](#第一个用户进程-sh)
    2. [对 top 指令的结果分析](#对-top-指令的结果分析)
    3. [对用户栈的研究](#对用户栈的研究)
    4. [时钟中断](#时钟中断)
2. [用户态内核进程实现](#用户态内核进程实现)
    1. [linux 实现参考](#linux-实现参考)
    2. [uCore 实现](#ucore-实现)
        1. [PCB 支持](#pcb-支持)
        2. [clone 调用](#clone-调用)
        3. [祖宗线程 exit 时守护](#祖宗线程-exit-时守护)
        4. [线程一致性维护](#线程一致性维护)
3. [用户态进程是如何退出的](#用户态进程是如何退出的)
4. [为何代码不从 0 地址开始](#为何代码不从-0-地址开始)

<!-- /code_chunk_output -->

## uCore 探究部分

### 第一个用户进程 sh

第一个用户进程是从 `int pid = kernel_thread(user_main, NULL, 0);` 而来。

```c
int
kernel_thread(int (*fn)(void *), void *arg, uint32_t clone_flags) {
    struct trapframe tf;
    memset(&tf, 0, sizeof(struct trapframe));
    tf.tf_cs = KERNEL_CS;
    tf.tf_ds = tf.tf_es = tf.tf_ss = KERNEL_DS;
    tf.tf_regs.reg_ebx = (uint32_t)fn;
    tf.tf_regs.reg_edx = (uint32_t)arg;
    tf.tf_eip = (uint32_t)kernel_thread_entry;
    return do_fork(clone_flags | CLONE_VM, 0, &tf);
}
```

送了这个即将执行的 user_main 函数一个内核栈帧和一个 kernel_thread_entry 函数（该函数定义 user_main 执行完成后会自动调用 do_wait） 然后就把 user_main 进程给 fork 出来了（此时还在内核态）。

最后 user_main 进程执行了 user_main 函数，引发一个中断调用 exec，把这个进程掏空，该进程变成了进程 sh ，注意在从文件系统载入代码的 load_icode 函数中，在中断栈帧中把 user_main 的段子换成了用户态的段子，只要中断一返回，该进程回到了用户态

虽然上面定义了 user_main 执行完成后会自动调用 do_wait ，但一个 exec 调用过后， eip 指针的值已经被修改，sh 已经和 kernel_thread_entry 没关系了。 所以最后 sh 若要退出需要通过中断调用（用户进程也没法直接执行 do_wait）, 具体如何进行的中断调用仍需要研究。

### 对 top 指令的结果分析

```txt
the num of process is 4
process id:0 name:idle state:RUNNABLE
        running times: 123402
        need resched by os?: 0
        parent pid: 0
        time slice runing: 0
        vruntime of cfs sched: 0
        cfs_prior: 1
process id:3 name:top state:RUNNABLE
        running times: 2
        need resched by os?: 0
        parent pid: 2
        time slice runing: 5
        vruntime of cfs sched: 6
        cfs_prior: 1
process id:2 name:sh state:SLEEPING
        running times: 11
        need resched by os?: 0
        parent pid: 1
        time slice runing: 5
        vruntime of cfs sched: 6
        cfs_prior: 1
process id:1 name:init state:SLEEPING
        running times: 1
        need resched by os?: 0
        parent pid: 0
        time slice runing: 5
        vruntime of cfs sched: 0
        cfs_prior: 1
```

进程数量等于 os 内的全局变量 nr_process 这个变量包含了 idle 进程所以总共 4 个没问题。

init 进程从其创建，第一次执行 do_wait 的时候，就因为其有 user_main（该进程创造 sh 后结束） 这个进程而陷入了 sleep ，只要其子进程（sh）不死（就是变 Zombie，sleep 都没用），永远不会唤醒该 init 进程来回收资源（其他父线程托孤的时候可能唤醒 init）。

```c
while (do_wait(0, NULL) == 0) {
    schedule();
}

if (haskid) {
    current->state = PROC_SLEEPING;
    current->wait_state = WT_CHILD;
    schedule();
    if (current->flags & PF_EXITING) {
        do_exit(-E_KILLED);
    }
    goto repeat;
}
```

idle 的调度是被特判过的, 运行队列里面没有 idleproc ， OS 在没有进程可供调度的时候选择进程 idleproc，自然其 vruntime 不会被计算

```c
static void
sched_class_proc_tick(struct proc_struct *proc) {
    if (proc != idleproc) {
        sched_class->proc_tick(rq, proc);
    }
    else {
        proc->need_resched = 1;
    }
}
```

sh 是在等待输入的函数 `dev_stdin_read` 主动调用 `schedule()` 放弃时间片，所以 在间隔两次输入`top`指令的中间，sh 一直在等待输入，并没有被调度，所以两次间隔后 idle 的调度次数会巨多因为除了 idle ，ucore 没有其他进程可供调度，但 vruntime 是 0（因为特判），sh 的 vruntime 可能会有小幅 提升因为调用了 top ，中间处理会给 sh 时间，但一旦 top 开始执行， sh 又会因为等 top 执行完毕被挂起，top 执行完成后又在等输入。。。。

### 对用户栈的研究

每个用户进程都设置了用户栈为，栈顶是 USTACKTOP - 1 （第一个地址不是），栈大小是 USTACKSIZE 。调用 mm_mmap 函数建立用户栈的 vma 结构，明确用户栈的位置在用户虚空间的顶端，大小为 256 个页，即 1MB。但这个只是设置了 vma 。下面调用 pgdir_alloc_page 分配了 4 页（16KB）的栈空间给用户。

```c
vm_flags = VM_READ | VM_WRITE | VM_STACK;
if ((ret = mm_map(mm, USTACKTOP - USTACKSIZE, USTACKSIZE, vm_flags, NULL)) != 0) {
    goto bad_cleanup_mmap;
}
assert(pgdir_alloc_page(mm->pgdir, USTACKTOP-PGSIZE , PTE_USER) != NULL);
assert(pgdir_alloc_page(mm->pgdir, USTACKTOP-2*PGSIZE , PTE_USER) != NULL);
assert(pgdir_alloc_page(mm->pgdir, USTACKTOP-3*PGSIZE , PTE_USER) != NULL);
assert(pgdir_alloc_page(mm->pgdir, USTACKTOP-4*PGSIZE , PTE_USER) != NULL);
```

一旦这 4KB 的栈被用完了，那么就会引起缺页异常。

```c
struct vma_struct *vma = find_vma(mm, addr);
if (vma == NULL || vma->vm_start > addr) {
        cprintf("not valid addr %x, and  can not find it in vma\n", addr);
        goto failed;
    }
ptep = get_pte(mm->pgdir, addr, 1))
pgdir_alloc_page(mm->pgdir, addr, perm)
```

因为 vma 里写的是 1MB 空间，实际只分配给了 16KB ，缺页的时候只要地址在 1MB 的范围内，那么 通过缺页的地址是找得到对应的 vma 的，说明这个是个合法地址，os 会自动根据缺页地址新建页表项并分配 page 。当栈的使用超过 1MB，就会找不到 vma 直接报错，这个行为目前会直接崩掉内核，可用考虑后续的优化。

### 时钟中断

```c
//设置时钟每秒中断100次
    outb(IO_TIMER1, TIMER_DIV(100) % 256);
    outb(IO_TIMER1, TIMER_DIV(100) / 256);
// 通过中断控制器使能时钟中断
    pic_enable(IRQ_TIMER);
```

每 1s 中断 100 次则一个时间片的时长为 10ms ， usleep(100) 睡眠 1s。

## 用户态内核进程实现

### linux 实现参考

**_[How Stack or memory is allocated for threads under the same process in Linux](https://stackoverflow.com/questions/28535838/how-stack-or-memory-is-allocated-for-threads-under-the-same-process-in-linux)_**

The current 'thread' concept in Linux is the NPTL one. NPTL uses `clone()`, which wraps `sys_clone()`. Allocating a stack for a new 'thread' is handled in the user space (ie. libc), not in kernel (ie. Linux). A library can allocate a stack using the allocation of choice (eg. `malloc`) and then call `clone()` passing this address as the stack (of course, needs to pass the top of the allocated region, since stacks grow downwards on most platforms):

Unlike `fork()`, `clone()` allows the child process to share parts of its execution context with the calling process, such as the memory space, the table of file descriptors, and the table of signal handlers....

The main use of `clone()` is to implement threads: multiple threads of control in a program that run concurrently in a shared memory space.

When the child process is created with `clone()`, it executes the function `fn(arg)` ...

The child_stack argument specifies the location of the stack used by the child process ...

If you want to learn more specific details, open the source of your distro [pthread_create](https://man7.org/linux/man-pages/man3/pthread_create.3.html) implementation and get reading.

For example [pthread_create.c](https://github.com/lattera/glibc/blob/a2f34833b1042d5d8eeb263b4cf4caaea138c4ad/nptl/pthread_create.c):

```c
int
__pthread_create_2_1 (newthread, attr, start_routine, arg)
  ...
  struct pthread *pd = NULL;
  int err = ALLOCATE_STACK (iattr, &pd);
```

and [allocatestack.c](https://github.com/lattera/glibc/blob/a2f34833b1042d5d8eeb263b4cf4caaea138c4ad/nptl/allocatestack.c):

```c
 # define ALLOCATE_STACK(attr, pd) allocate_stack (attr, pd, &stackaddr)

 static int
 allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
    ALLOCATE_STACK_PARMS)
 ...
```

You'll see that stack allocation has some whistles and bells, like caching and reusing stack regions, guard pages, but in the end is just a memory region allocated in user space.

### uCore 实现

在 linux 中，thread 通过 clone 调用实现，但是分配栈空间是用户空间关心的事，比如使用 malloc 申请之类的。

在 uCore 中并没有实现用户态的 malloc，所以暂时无法使用 malloc 在堆区为一个新的进程开辟一个栈空间，但这并非就是无法实现进程。

于是我们换了另外一种方式就是挪用调用 clone 进程的栈空间。linux 默认为每个进程分配 1M 的栈空间，一旦某个进程调用 clone，我们将该调用进程 1M 的栈空间等分为 16 份，取出一块空闲的栈空间分配给该子线程。针对可能出现递归调用 phthread_create 的情况，不断向上找到第一个调用 clone，然后栈被分割成 16 份的进程，从该进程处获取栈空间。

![stack](images/stack.png)

子线程退出时会向该进程归还所使用的栈空间，所以包含该第一个调用 clone 的进程在内（下文统一称为祖宗线程），同一时间进程里可以存在 16 个线程。

#### PCB 支持

在 PCB 中增加以下三个变量用于支持内核线程

| Type    | name              | meaning                                                                                                                            |
| ------- | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| int     | is_thread         | 标志该进程是否是一个子线程                                                                                                         |
| int     | stack_num         | 标志该子线程占用了父进程的哪一个栈帧，is_thread = 1 才有效                                                                         |
| int [ ] | stack[MAX_THREAD] | 每个主进程能够开启 16 个线程（包括主线程（自己）在内），每个块为 0 表示该块的栈没有被占用，不为 0 表示被占用，且值是该子线程的 pid |

特别说明只有祖宗线程和普通进程 is_thread 为 0，其他子线程该值都是 1。

#### clone 调用

```c
int do_clone(void *(*fn)(void *), void *arg, void (*exit)(int))
{
    int ret = -E_NO_FREE_PROC;
    struct proc_struct *proc;
    if (nr_process >= MAX_PROCESS)
        goto fork_out;

    ret = -E_NO_MEM;

    // 新建一个空进程描述符
    if ((proc = alloc_proc()) == NULL)
        goto fork_out;

    // 设置线程名称为 父线程名称-t
    int i = 0;
    for (i = 0; current->name[i] != '\0'; i++)
        proc->name[i] = current->name[i];
    proc->name[i] = '-';
    proc->name[i + 1] = 't';
    proc->name[i + 2] = '\0';

    proc->is_thread = 1; //标志该进程是一个子线程

    // 针对可能出现递归调用pthread_create的情况，找到不为线程的主进程
    struct proc_struct *father = current;

    while (father->is_thread)
        father = father->parent;

    // 如果不设置线程归属于调用clone的线程，直接指向主线程会导致子线程中没法调用join来等待
    proc->parent = current;

    // 在主线程里面找一块栈分配给该子线程
    proc->stack_num = 1;
    for (; proc->stack_num < MAX_THREAD; proc->stack_num++)
        if (father->stack[proc->stack_num] == 0)
        {
            father->stack[proc->stack_num] = 1;
            break;
        }

    assert(proc->stack_num != MAX_THREAD);

    assert(current->wait_state == 0);

    // 设置内核栈
    if (setup_kstack(proc) != 0)
        goto bad_fork_cleanup_proc;

    // 文件系统直接指向父进程
    if (copy_fs(CLONE_FS, proc) != 0)
        goto bad_fork_cleanup_kstack;

    // mm 直接指向父进程， 一个用户进程有1MB的栈空间
    // 256页，一个线程给16页，包括原有的主线程的话，能开16个线程
    if (copy_mm(CLONE_VM, proc) != 0)
        goto bad_fork_cleanup_fs;

    // !! 注意这个地址不是这个线程的地址，是上一个线程栈的栈底地址
    // 给线程分配栈,一个进程16页
    uint32_t thread_stack_top = USTACKTOP - (proc->stack_num) * 16 * PGSIZE;

    // 预先给两页给新分配的线程
    assert(pgdir_alloc_page(proc->mm->pgdir, thread_stack_top - PGSIZE, PTE_USER) != NULL);
    assert(pgdir_alloc_page(proc->mm->pgdir, thread_stack_top - 2 * PGSIZE, PTE_USER) != NULL);

    proc->tf = (struct trapframe *)(proc->kstack + KSTACKSIZE) - 1;

    struct trapframe *tf = proc->tf;
    memset(tf, 0, sizeof(struct trapframe));
    tf->tf_cs = USER_CS;
    tf->tf_ds = tf->tf_es = tf->tf_ss = USER_DS;

    // 把栈往上抬4个字节,才开始放东西，否则栈会越界
    // 先放exit的参数为0
    *(uint32_t *)(thread_stack_top - 1 * sizeof(uint32_t)) = (uint32_t)0;
    // 压入线程函数的参数地址
    *(uint32_t *)(thread_stack_top - 2 * sizeof(uint32_t)) = (uint32_t)arg;
    // 压入上一个函数的返回地址为 exit ，保证函数结束并没有显式调用 exit 时系统能帮助该线程退出。
    *(uint32_t *)(thread_stack_top - 3 * sizeof(uint32_t)) = exit;
    tf->tf_esp = thread_stack_top - 3 * sizeof(uint32_t);

    // 设置 eip 指向当前函数开始执行
    tf->tf_eip = fn;
    tf->tf_eflags = FL_IF;
    ret = 0;
    tf->tf_regs.reg_eax = 0;
    tf->tf_eflags |= FL_IF;

    // context 在调度的时候会弹出 tf 内的寄存器，恢复程序执行
    proc->context.eip = (uintptr_t)forkret;
    proc->context.esp = (uintptr_t)(proc->tf);

    bool intr_flag;
    local_intr_save(intr_flag);
    {
        proc->pid = get_pid();
        hash_proc(proc);
        set_links(proc);
    }
    local_intr_restore(intr_flag);

    wakeup_proc(proc);

    // 重新设置父线程的栈被该子线程占用
    father->stack[proc->stack_num] = proc->pid;

    ret = proc->pid;
fork_out:
    return ret;

bad_fork_cleanup_fs: //for LAB8
    put_fs(proc);
bad_fork_cleanup_kstack:
    put_kstack(proc);
bad_fork_cleanup_proc:
    kfree(proc);
    goto fork_out;
}
```

注意因为用户有可能不会在线程执行的函数结束返回时调用 exit，所以 clone 的一个任务就是帮助用户将线程栈栈底的返回地址设置为 sys_exit (不是直接 do_exit 因为该线程执行在用户态，只有通过系统调用才能执行内核态代码)

最后的线程栈栈帧如图所示

![2021-03-30-17-59-37.png](images/2021-03-30-17-59-37.png)

#### 祖宗线程 exit 时守护

若祖宗线程没有调用 join 来等待子线程结束工作，祖宗线程会直接退出，虽然这样内存空间的引用计数不为 0，不会释放他们共用的内存空间，剩余线程还能继续执行，但是所有栈的分配信息都存在祖宗线程的 PCB 表中，如果祖宗线程一旦释放，子进程再想调用 phthread_create 新建子进程会导致失败。于是必须在祖宗线程调用 sys_exit 时守护所有子线程退出后祖宗线程才能退出。

```c
static int
sys_exit(uint32_t arg[]) {
    int error_code = (int)arg[0];
    if (is_ancestral_thread(current)) //祖宗进程只有在所有子线程全部释放后才能退出
        while (current_have_kid())    //只要有儿子就等着
            do_wait(0, NULL);
    return do_exit(error_code);
}
```

#### 线程一致性维护

linux 上的线程就是基于轻量级进程, 由用户态的 pthread 库实现的.使用 pthread 以后, 在用户看来, 每一个 task_struct 就对应一个线程, 而一组线程以及它们所共同引用的一组资源就是一个进程。但是, 一组线程并不仅仅是引用同一组资源就够了, 它们还必须被视为一个整体。当"进程"收到一个致命信号(比如由于段错误收到 SIGSEGV 信号), 对应的这一组 task_struct 将全部退出。这是 POSIX 对线程实现提出的要求，如果某个线程”挂”了, 整个进程还在若无其事地运行着, 可能会出现很多的不一致状态. 进程将不是一个整体, 而线程也不能称为线程。

在 uCore 中我们也实现了该线程的一致性

```c
case T_PGFLT:  //page fault
    if ((ret = pgfault_handler(tf)) != 0) {
        print_trapframe(tf);
        if (current == NULL) {
            panic("handle pgfault failed. ret=%d\n", ret);
        }
        else {
            if (trap_in_kernel(tf)) {
                panic("handle pgfault failed in kernel mode. ret=%d\n", ret);
            }
            cprintf("killed by kernel.\n");
            // panic("handle user mode pgfault failed. ret=%d\n", ret);
            kill_all_thread();
        }
    }
    break;
```

比如有一个线程缺页异常的话，对应的这组进程都会被杀死。

## 用户态进程是如何退出的

以 stackOverflow.c 这个程序为例, 在编译的时候总共涉及 gcc 的调用和 ld 的链接操作

```shell
gcc -Iuser/ -fno-builtin -fno-PIC -Wall -ggdb -m32 -gstabs -nostdinc  -fno-stack-protector -Ilibs/ -Iuser/include/ -Iuser/libs/ -c user/stackOverFlow.c -o obj/user/stackOverFlow.o

ld -m    elf_i386 -nostdlib -T tools/user.ld -o obj/__user_stackOverFlow.out  obj/user/libs/panic.o obj/user/libs/syscall.o obj/user/libs/ulib.o obj/user/libs/semaphore.o obj/user/libs/initcode.o obj/user/libs/file.o obj/user/libs/stdio.o obj/user/libs/dir.o obj/user/libs/umain.o obj/user/libs/pthread.o obj/user/libs/proc.o  obj/libs/rbtree.o obj/libs/string.o obj/libs/printfmt.o obj/libs/hash.o obj/libs/rand.o obj/user/stackOverFlow.o
```

之所以选这个程序是因为简单，使用 objdump 反汇编第一条 gcc 编译指令执行完毕生成的 stackOverFlow.o 文件的 text 段如下所示：

注意现在还没有到链接的阶段，所有 text 段代码的起始位置都是 0 地址处：

```c
Disassembly of section .text:

00000000 <loop>:
   0:	f3 0f 1e fb          	endbr32
   4:	55                   	push   %ebp
   5:	89 e5                	mov    %esp,%ebp
   7:	83 ec 08             	sub    $0x8,%esp
   a:	e8 fc ff ff ff       	call   b <loop+0xb>
   f:	90                   	nop
  10:	c9                   	leave
  11:	c3                   	ret

00000012 <main>:
  12:	f3 0f 1e fb          	endbr32
  16:	55                   	push   %ebp
  17:	89 e5                	mov    %esp,%ebp
  19:	83 e4 f0             	and    $0xfffffff0,%esp
  1c:	e8 fc ff ff ff       	call   1d <main+0xb>
  21:	b8 00 00 00 00       	mov    $0x0,%eax
  26:	c9                   	leave
  27:	c3                   	ret
```

下一条链接指令会将 user/lib 下的所有 .o 文件与 stackOverFlow.o 一起进行链接操作

其中链接的 obj/user/libs/initcode.o 是一段汇编代码

```c
.text
.globl _start
_start:
    # set ebp for backtrace
    movl $0x0, %ebp

    # load argc and argv
    movl (%esp), %ebx
    lea 0x4(%esp), %ecx


    # move down the esp register
    # since it may cause page fault in backtrace
    subl $0x20, %esp

    # save argc and argv on stack
    pushl %ecx
    pushl %ebx

    # call user-program function
    call umain
1:  jmp 1b
```

结合链接脚本中的 `ENTRY(_start)` 可知在链接完成后 initcode.o 是最先被执行的，该段汇编指令调用了 umain 函数，定义在 umain.o 里。

umain 打开了输入输出流两个文件描述符，并调用函数 main （注意：umain 上面的 main 函数只是声明，在链接的时候会把 stackOverFlow.o 里面的 main 函数实现给链接上，**main 函数执行完成后 umain 调用系统调用 exit 退出**）

```c
int main(int argc, char *argv[]);

static int
initfd(int fd2, const char *path, uint32_t open_flags) {
    int fd1, ret;
    if ((fd1 = open(path, open_flags)) < 0) {
        return fd1;
    }
    if (fd1 != fd2) {
        close(fd2);
        ret = dup2(fd1, fd2);
        close(fd1);
    }
    return ret;
}

void
umain(int argc, char *argv[]) {
    int fd;
    if ((fd = initfd(0, "stdin:", O_RDONLY)) < 0) {
        warn("open <stdin> failed: %e.\n", fd);
    }
    if ((fd = initfd(1, "stdout:", O_WRONLY)) < 0) {
        warn("open <stdout> failed: %e.\n", fd);
    }
    int ret = main(argc, argv);
    exit(ret);
}
```

截取几段链接完成后的反汇编代码

```c
Disassembly of section .text:

00800020 <__panic>:
#include <stdio.h>
#include <ulib.h>
#include <error.h>

void
__panic(const char *file, int line, const char *fmt, ...) {
  800020:	f3 0f 1e fb          	endbr32
  800024:	55                   	push   %ebp
  800025:	89 e5                	mov    %esp,%ebp
  800027:	83 ec 18             	sub    $0x18,%esp
```

可以看到代码段真如 ld 脚本规定的从 0x00800020 地址处开始

```c
0080066b <_start>:
.text
.globl _start
_start:
    # set ebp for backtrace
    movl $0x0, %ebp
  80066b:	bd 00 00 00 00       	mov    $0x0,%ebp

    # load argc and argv
    movl (%esp), %ebx
  800670:	8b 1c 24             	mov    (%esp),%ebx
    lea 0x4(%esp), %ecx
  800673:	8d 4c 24 04          	lea    0x4(%esp),%ecx
```

最后操作系统在执行这段代码的时候会根据 elf 的格式将 eip 设置为 0x0080066b 从这开始执行。

## 为何代码不从 0 地址开始

The stack, which is usually quite small but could grow quite dramatically in some occasions. The stack grows down, and when the stack is full, we really want the process to predictably crash rather than overwriting some data. So there had to be a wide area for the stack, with, at the low end of that area, an unmapped page. And lo! There is an unmapped page at address zero, to catch null pointer dereferences. Hence it was defined that the stack would get the first 128 MB of address space, except for the first page. This means that the code had to go after those 128 MB, at an address similar to 0x080xxxxx.

As Michael points out, "losing" 128 MB of address space was no big deal because the address space was very large with regards to what could be actually used. At that time, the Linux kernel was limiting the address space for a single process to 1 GB, over a maximum of 4 GB allowed by the hardware, and that was not considered to be a big issue.