Loading presentation...

Present Remotely

Send the link below via email or IM

Copy

Present to your audience

Start remote presentation

  • Invited audience members will follow you as you navigate and present
  • People invited to a presentation do not need a Prezi account
  • This link expires 10 minutes after you close the presentation
  • A maximum of 30 users can follow your presentation
  • Learn more about this feature in our knowledge base article

Do you really want to delete this prezi?

Neither you, nor the coeditors you shared it with will be able to recover it again.

DeleteCancel

Make your likes visible on Facebook?

Connect your Facebook account to Prezi and let your likes appear on your timeline.
You can change this under Settings & Account at any time.

No, thanks

Linux wprowadzenie

scheduling
by

Radoslaw Biernacki

on 6 September 2015

Comments (0)

Please log in to add your comment.

Report abuse

Transcript of Linux wprowadzenie

Przerwanie
maskowalne/niemaskowalne
współdzelenie przez HW
zagniedzenie, wiele CPU
przerwania SMP

Wyjatki:
Fault - page exception (naprawialne)
Traps - debbuging (naprawialne)
Aborts - invalid-opcode (zakonczenie app)

Programowalne:
najczesciej system calls (arch dependent)
historia
GNU/Linux
Czym jest GNU/Linux
Linux =>
Linux is a clone of the operating system Unix, written from scratch by Linus Torvalds
GNU =>
GNU is a Unix-like operating system that is free software
Linux Kernel + GNU
. This combination is the GNU/Linux operating system. GNU/Linux is used by millions, though many call it "Linux" by mistake
GNU Hurd kernel (microkernel)
Linux kernel licencja GPL
Troche histori
wprowadzenie
Linux
process
context

Kernel
Proces & watek
jak programowac z glowa
Efektywność
busy loops (kiedy i jak userspace i kernelspace)
jak efektywnie zarządzać pamięcią, kiedy mapować pliki, memory pool allocators
asd
zarządzanie priorytetami, REALTime jeszcze raz i eksplozja wątków czyli jak nie robić„a jakos to bedzie” ;)
Przejscia USER/KERNEL i przełaczenia kontekstu
wywołania systemowe (np read, mutex)
exception (np nieistniejacy opcode, paging request)
interrupt (np asynchroniczne zdarzenie w urzadzeniu, tick interrupt)
Co ponnadto ?
TLB miss
cache lines (cache locality)
Debian
Slackware
RedHat
Proces & watek
108 #define TASK_RUNNING 0 /* normal */
109 #define TASK_INTERRUPTIBLE 1 /* waits on wait_queue and accepts signals */
110 #define TASK_UNINTERRUPTIBLE 2 /* waits on wait_queue cannot be woken up by signal */
111 #define TASK_STOPPED 4 /* after SIGSTOP */
112 #define TASK_TRACED 8 /* under debugger */
113 #define EXIT_ZOMBIE 16 /* parent exit */
114 #define EXIT_DEAD 32 /* before reclaim */
Proces
Zbiór zasobów przypisany do conajmniej jednego kontekstu wykonania, posiadajacy parametry opisujace prawa dostepu

Kontekst wykonania
Stan procesora (rejestrow) ustalajacy miejsce w wykonywanym programie

W Linux watki implementowane sa jako Light Weight Process - czyli kernel widzi je jako zgrupowane procesy i zarzadza nimi jako calosc (np exit()) - dodatkowo mapuja ta sama pamiec. Biblioteka standardowa lub wywołania Posix udostepniaja odpowiednie API dla watkow.

Stos procesu
W Linux kazdy proces (w tym LWP) posiada dwa stosy - user (rozszezalny), kernel stack (zazwyczaj 8kb)

Stan procesora jest automatycznie zrzucany na kernel stack przy przerwaniu/wyjatku
Memory concepts
separacja od HW, abstrakcja liniowej pamieci adresowej
kod plikow wykonywalnych współdzielony miedzy procesy (osobne sterty i stosy)
pamiec współdzielona (system V, Posix, mmap)
copy-on-write (fast fork, read only flaged pages)
demand paging (no physical page start/alloc)
Wszystkie te funkcjonalnosci mozliwe sa dzieki MMU oraz tablicy stron procesu
Interrupt/Exception/SystemCall
czyli
ogrom tematu
aktywnie uczestniczymy w szkoleniu
zadajemy pytania
kazde pytanie jest istotne
bede dobierał tempo na podstawie waszych pytan, wiecej pytan znaczy wieksze zainteresowanie, brak pytan znaczy ze temat jest nudny/znany i nalezy zmienic temat
1970
1970 - Unix - Ken Thompson and Dennis Ritchie
1980
1990
2000
1970
1983 - Richard Stallman started the GNU project - the GNU kernel, called Hurd, failed to attract enough attention from developers leaving GNU incomplete.
1985 - Intel released the 80386, the first x86 microprocessor with 32-bit instruction set and MMU with paging
1986 - Maurice J. Bach, of AT&T Bell Labs, published The Design of the UNIX Operating System
1987 - Andrew S. Tanenbaum - MINIX - Public source, modification and redistribution were restricted - 16-bit design
25 August 1991 - "Hello everybody out there using minix"
1992 - The Linux kernel is relicensed under the GNU GPL
1989 - 386BSD not available until until 1992
ipc kiedy jakie, analiza kosztu dla przykładowych problemów (komunikacja typu request-response z wykorzystaniem struktur C, jednokierunkowy producent konsument oraz co do tego ma scheduling, komunikacja typu “property read, state notify”)
nie tylko algorytmy, oraz pamiętaj o sprzęcie: cache locality (kiedy dla data cache a kiedy dla code cache), gcc cpu optimizations
Context switch
thread 1 wywołuje read()
trap ->
automatyczne przełaczenie na drugi stos procesu (kernel)
zrzut rejestrów


sys_call -> vfs_read -> flip->read()
request do urzadzenia o dane
sleep_on(queue)

schedule()
wybranie procesu -> thread2
context_switch(thread1, thread2)
uzywamy kernel stack dla thread2
ostatnia intrukja to ret













































okazuje sie ze wracamy do sleep_on(queue)

ostatecznie ret_from_interrupt
pending_signals() ?
podniesienie kontekstu
reti (przelaczenie na stos user)
















okazuje sie ze wracamy z schedule() ale w calkowicie innym miejscu kernela

ostatecznie ret_from_interrupt()
pending_signals() ?
podniesienie kontekstu
reti (przelaczenie na stos user)

















IRQ <-
automatyczne przełaczenie na drugi stos procesu (kernel)
zrzut rejestrów

do_IRQ()
ISR (bottom half)
potwierdzenie przerwanie w HW
wake_up()

ostatecznie ret_from_interrupt()
schedule()
teraz juz wyjasnia sie gdzie bylo poprzednie wywolanie schedule()
wybranie procesu -> thread1
context_switch(thread2, thread1)
uzywamy kernel stack dla thread1
ostatnia intrukcja to ret
KERNEL
PROCES 2
PROCES 1
thread 1 powrót z read()
574 struct task_struct {
..
608 struct mm_struct *mm, *active_mm;
...
576 struct thread_info *thread_info;
...
618 pid_t pid;
619 pid_t tgid;
625 struct task_struct *real_parent; /* real parent process (when being debugged) */
626 struct task_struct *parent; /* parent process */
631 struct list_head children; /* list of my children */
632 struct list_head sibling; /* linkage in my parent's children list */
633 struct task_struct *group_leader; /* threadgroup leader */
...
606 uid_t uid,euid,suid,fsuid;
607 gid_t gid,egid,sgid,fsgid;
...
636 struct pid pids[PIDTYPE_MAX]; /* PID/PID hash table linkage. */
...
584 struct list_head run_list; /* head for tasks with the same prio at same cpu (prio_array_t) */
585 prio_array_t *array; /* reverse pointer to prio array */
578 unsigned long flags;

612 long exit_state;
613 int exit_code, exit_signal;
}
Czas
Blokowanie w kernelu
1 void fastcall __sched interruptible_sleep_on(wait_queue_head_t *q)
2 {
3 unsigned long flags;
4 wait_queue_t wait;
5 init_waitqueue_entry(&wait, current);
6
7 current->state = TASK_INTERRUPTIBLE;
8
9 spin_lock_irqsave(&q->lock,flags);
10 __add_wait_queue(q, &wait);
11 spin_unlock(&q->lock);
12 schedule();
13 spin_lock_irq(&q->lock);
14 __remove_wait_queue(q, &wait);
15 spin_unlock_irqrestore(&q->lock, flags);
16 }
93 #define SPINLOCK_MAGIC 0x1D244B3C
94 typedef struct {
95 unsigned long magic;
96 volatile unsigned long lock;
97 volatile unsigned int babble;
98 const char *module;
99 char *owner;
100 int oline;
101 } spinlock_t;
295 #define _spin_lock_irqsave(lock, flags) \
296 do { \
297 local_irq_save(flags); \
298 preempt_disable(); \
299 _raw_spin_lock(lock); \
300 __acquire(lock); \
301 } while (0)

311 #define _spin_lock_bh(lock) \
312 do { \
313 local_bh_disable(); \
314 preempt_disable(); \
315 _raw_spin_lock(lock); \
316 __acquire(lock); \
317 } while (0)
spinlock
wait_queues
semaphore
mutex
SpinLock
Semaphore
busy wait loop
uzywa synchronizacji SMP/cache
SMP - wylacza preemption i przerwania
UP - tylko przerwania
44 struct semaphore {
45 atomic_t count;
46 int sleepers;
47 wait_queue_head_t wait;
48 };
głównie:
zainicjalizowanie process_struct i thread_struct
najczesciej kopiowane sa wartosci z pol rodzica w zaleznosci od flag dziedziczenia
inicjalizacja kernel stack adresem funkcji wyjscia oraz funkcji watku
Dla i386 nie sa uzywane TSS oraz instrukcje zaprojektowane do przelaczania kontekstu
lepsza kontrola
sanity checks
ten sam czas wykonania
thread_struct uzywany do przechowania glownych rejestrow
kernel_stack uzywany dla rejestrow general purpose
Tworzenie procesu
424 asmlinkage void __init start_kernel(void)
425 {
426 char * command_line;
427 extern struct kernel_param __start___param[], __stop___param[];
...
434 printk(KERN_NOTICE);
435 printk(linux_banner);
436 setup_arch(&command_line);
...
450 sched_init();
464 trap_init();
465 rcu_init();
466 init_IRQ();
467 pidhash_init();
468 init_timers();
469 softirq_init();
470 time_init();
...
525 /* Do the rest non-__init'ed, we're now alive */
526 rest_init();
527 }

379 static void noinline rest_init(void)
380 __releases(kernel_lock)
381 {
382 kernel_thread(init, NULL, CLONE_FS | CLONE_SIGHAND);
383 numa_default_policy();
384 unlock_kernel();
385 preempt_enable_no_resched();
386 cpu_idle();
387 }


46 void default_idle(void)
dla kazdej z arch jest osobna funkcja

50
51 void
52 cpu_idle(void)
53 {
54 while (1) {
55 void (*idle)(void) = default_idle;
59 while (!need_resched())
60 idle();
61 schedule();
62 }
63 }
Idle/swaper oraz init
636 static int init(void * unused)
637 {
...
704 if (execute_command)
705 run_init_process(execute_command);
706
707 run_init_process("/sbin/init");
708 run_init_process("/etc/init");
709 run_init_process("/bin/init");
710 run_init_process("/bin/sh");
711
712 panic("No init found. Try passing init= option to kernel.");
713}

612 static void run_init_process(char *init_filename)
613 {
614 argv_init[0] = init_filename;
615 execve(init_filename, argv_init, envp_init);
616 }

czy da sie zabic init ?

774 fastcall NORET_TYPE void do_exit(long code)
775 {
776 struct task_struct *tsk = current;
777 int group_dead;
778
779 profile_task_exit(tsk);
780
781 if (unlikely(in_interrupt()))
782 panic("Aiee, killing interrupt handler!");
783 if (unlikely(!tsk->pid))
784 panic("Attempted to kill the idle task!");
785 if (unlikely(tsk->pid == 1))
786 panic("Attempted to kill init!");
skrypty startowe init
orphaned zombies
Proces & watek
Procesy tworza swoja kopie za pomoca sys_clone() a nastepnie nadpisuja swoja pamiec wywolaniem execv().

Przy clone() kopiowane sa:
wszystkie strony procesu
wszystkie deskryptory
Kopiowanie jest szybkie ze wzgledu na wykorzystanie copy_on_write (tylko mm struct clone, nastepnie MMU ustawiane przy przelaczeniu kontekstu)
658 .data
659 ENTRY(sys_call_table)
660 .long sys_restart_syscall /* 0 - old "setup()" */
661 .long sys_exit
662 .long sys_fork
663 .long sys_read
664 .long sys_write
665 .long sys_open /* 5 */
666 .long sys_close
667 .long sys_waitpid
668 .long sys_creat
669 .long sys_link
670 .long sys_unlink /* 10 */
671 .long sys_execve
672 .long sys_chdir
673 .long sys_time
674 .long sys_mknod
675 .long sys_chmod /* 15 */
676 .long sys_lchown16
677 .long sys_ni_syscall /* old break syscall holder */
678 .long sys_stat
679 .long sys_lseek
680 .long sys_getpid /* 20 */
SystemCall
ENTRY(sysenter_entry)
call *sys_call_table(,%eax,4)
Przyrost lini kodu z czasem
#include <sys/mman.h>

void *mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off);
int munmap(void *addr, size_t length);

PROT_WRITE, PROT_READ Data can be written/read.
PROT_EXEC Data can be executed.
PROT_NONE Data cannot be accessed

MAP_PRIVATE, MAP_SHARED Changes are private/shared (can be used to utilize copy-on-write and read())
MAP_FIXED Interpret addr exactly.

The off argument is constrained to be aligned and sized according to the value returned by sysconf() when passed _SC_PAGESIZE or _SC_PAGE_SIZE

Note that references beyond the end of the object do not extend the object as the new end cannot be determined precisely by most virtual memory hardware. Instead, the size can be directly manipulated by ftruncate()
podstawy
przypomnienie

Unix/Linux
Wszystko jest plikiem
File ops
Unix = prostota -> wstepnie tylko 40 funkcji kernela
Virtual File System
Typ pliku:
regular file
directory
symbolic link
block-oriented device file
character-oriented device file
pipe and named pipe
socket
Virtual file system
iNode
hard and soft links
Kazdy uzytkownik pliku nalezy do przynajmniej jednej z klas:
właściciel pliku (UID)
GID uzytkownika zgadza sie z GID grupy
wszyscy pozostali

W ramach klas wystepuja nastepujace prawa dla plików i katalogów
rwx - ...
suid (set user id), sgid (set group id)
dla plików - po excec na pliku process nabiera praw wlasciciela pliku
dla katalogow - tylko sgid - user/group pliku bedzie ustawiony na user/group katalogu
sticky (file system dependent)
dla plików - system operacyjny nie usuwa stron wykonywanlych pliku po zakonczeniu procesu (obsolete)
dla katalogow - tylko wlasciciel pliku moze usunac swoj plik np /tmp
Gdzie szukać informacji
Sygnaly
open
close
read
write
lseek (pread, pwrite)
readv(scatter read)
fcntl
ioctl
Sys V IPC
Pipes
Message queues
Semaphore sets
Shared memory
mmap
Asynchronous I/O
mmap
man
man -a (all successive)
man -k (search for keyword)
man [nr]
apropos
The Linux Documentation Project http://www.tldp.org/
man i info
file, open, close, read, write, lseek (pread, pwrite), readv(scatter read),
fcntl, ioctl, sockopt
sygnaly
select story (select, poll, epoll)
asynchoniczne czytanie
mmap
IPC to mutex, semafor, conditional, kolejka, shm
sockety: tworzenie (stream, dgram), listen, close, shutdown
1 Executable programs or shell commands
2 System calls (functions provided by the kernel)
3 Library calls (functions within program libraries)
4 Special files (usually found in /dev)
5 File formats and conventions eg /etc/passwd
6 Games
7 Miscellaneous (including macro packages and conventions), e.g. man(7), groff(7)
8 System administration commands (usually only for root)
9 Kernel routines [Non standard]
Open, Creat
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int open(const char *pathname, int flags);
int open(const char *pathname, int flags, mode_t mode);
int creat(const char *pathname, mode_t mode);

open() and creat() return the new file descriptor, or -1 if an error occurred (in which case, errno is set appropriately)

flags = O_RDONLY, O_WRONLY, or O_RDWR,
O_CREAT, O_EXCL
O_TRUNC,O_APPEND,
O_SYNC The file is opened for synchronous I/O (block until write complete)
O_ASYNC Enable signal-driven I/O: generate a signal (SIGIO by default, but this can be changed via fcntl(2)), only feasible for sockets and terminals
O_NONBLOCK or O_NDELAY When possible, the file is opened in non-blocking mode

mode - must be specified when O_CREAT is in the flags, and is ignored otherwise.

creat() is equivalent to open() with flags equal to O_CREAT|O_WRONLY|O_TRUNC
Close
#include <unistd.h>

int close(int fd);

close() returns zero on success. On error, -1 is returned, and errno is set appropriately

Note: if the descriptor was the last reference to a file which has been removed using unlink(2) the file is deleted

Note2: Not checking the return value of close() is a common but nevertheless serious programming error. It is quite possible that errors on a previous write(2) operation are first reported at the final close(). Not checking the return value when closing the file may lead to silent loss of data. This can especially be observed with NFS and with disk quota

A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a filesystem to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored use fsync(2)

#include <unistd.h>

int unlink(const char *pathname);
Read/Write
#include <unistd.h>

ssize_t read(int fd, void *buf, size_t count);
ssize_t write(int fd, const void *buf, size_t count)

return:
count - all deamanded data read/writen
< count - some data writen
0 - end of file (read)
-1 - error (errno)

count - If count is zero, read() returns zero and has no other results. If count is greater than SSIZE_MAX, the result is unspecified

The number of bytes written may be less than count!!!
Poprawne czytanie w trybie blokujacym
size_t off = 0;
size_t count = 1024;
char buff[count]
while(off==count) {
ret = read(f, &buff[off], count - off);
if(ret<0) {
if(EINTR==errno) continue;
break;
} else if(0==ret) {
break;
}
off += ret;
}

Dyskusja, read zwraca 0, sygnaly i debbuger, przerwanie polaczenia
lseek
#include <sys/types.h>
#include <unistd.h>

off_t lseek(int fd, off_t offset, int whence);

#define _LARGEFILE64_SOURCE
#include <sys/types.h>
#include <unistd.h>

off64_t lseek64(int fd, off64_t offset, int whence);

off_t is a 32-bit signed type on 32-bit architectures, unless one compiles with
#define _FILE_OFFSET_BITS 64 in which case it is a 64-bit signed type

Upon successful completion, lseek() returns the resulting offset location as measured in bytes from the beginning of the file. Otherwise, a value of (off_t) -1 is returned and errno is set to indicate the error

SEEK_SET - The offset is set to offset bytes.
SEEK_CUR - The offset is set to its current location plus offset bytes.
SEEK_END - The offset is set to the size of the file plus offset bytes

The lseek() function allows the file offset to be set beyond the end of the file (but this does not change the size of the file). If data is later written at this point, subsequent reads of the data in the gap (a "hole") return null bytes (’\0’) until data is actually written into the gap
pread, pwrite
#define _XOPEN_SOURCE 500

#include <unistd.h>

ssize_t pread(int fd, void *buf, size_t count, off_t offset);
ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset);
readv, writev
#include <sys/uio.h>

ssize_t readv(int fd, const struct iovec *iov, int iovcnt);
ssize_t writev(int fd, const struct iovec *iov, int iovcnt);

struct iovec {
void *iov_base; /* Starting address */
size_t iov_len; /* Number of bytes to transfer */
};

Note: readv() is guaranteed to read a contiguous block of data from the file, regardless of read operations performed in other threads or processes that have file descriptors referring to the same open file description
dup
#include <unistd.h>

int dup(int oldfd);
int dup2(int oldfd, int newfd);

Note: dup() uses the lowest-numbered unused descriptor for the new descriptor
Both descriptorsrefer to the same open file description and thus share file offset and file status flags; for example, if the file offset is modified by using lseek(2) on one of the descriptors, the offset is also changed for the other.
Two types of signals:
realtime
Multiple instances can be queued
guaranteed order of same signal type, in case of multiple types - lowest types first
standard
only one instance is queued
unspecified order of delivery
In case both real-time and standard signals are queued the sequence betwen them is unspecified (Linux delivers the standard signal first)

Control:
mask
peer thread
signal disposition (default (usually close), ignore, handler)
peer process
processing will be done only if at least one thread is not masking the signal, otherwise signal will be queued
Sygnaly
child created via fork inherits a copy of its parent's signal dispositions. During an execve, the dispositions of handled signals are reset to the default; the dispositions of ignored signals are left unchanged
C and pthread library uses some of the first realtime signals for internal use. To not overlap for those, user should use the SIGRTMIN define.
thread safe functions, usage of sig_atomic_t
system calls are interrupted by signals. SA_RESTART can be used for automatic restart of interrupted system call. Following calls uses this rule:
- read(2), readv(2), write(2), writev(2), and ioctl(2) calls on "slow" devices. A "slow" device is one where the I/O call may block for an indefinite time, for example, a terminal, pipe, or socket. (A disk is not a slow device according to this definition.) If an I/O call on a slow device has already transferred some data by the time it is interrupted by a signal handler, then the call will return a success status (normally, the number of bytes transferred)
- open(2), if it can block (e.g., when opening a FIFO; see fifo(7)).
- wait(2), wait3(2), wait4(2), waitid(2), and waitpid(2).
- Socket interfaces: accept(2), connect(2), recv(2), recvfrom(2), recvmsg(2), send(2), sendto(2), and sendmsg(2), unless a timeout has been set on the socket (see below).
- File locking interfaces: flock(2) and fcntl(2) F_SETLKW.
- POSIX message queue interfaces: mq_receive(3), mq_timedreceive(3), mq_send(3), and mq_timedsend(3).
- futex(2) FUTEX_WAIT (since Linux 2.6.22; beforehand, always failed with EINTR).
- POSIX semaphore interfaces: sem_wait(3) and sem_timedwait(3) (since Linux 2.6.22; beforehand, always failed with EINTR).
Sygnaly
Sygnaly
#include <signal.h>

int sigaction(int signum, const struct sigaction *act, struct sigaction *oldact);

struct sigaction {
void (*sa_handler)(int);
void (*sa_sigaction)(int, siginfo_t *, void *);
sigset_t sa_mask;
int sa_flags;
};

signum specifies the signal and can be any valid signal except SIGKILL and SIGSTOP.

If act is non-NULL, the new action for signal signum is installed from act. If oldact is non-NULL, the previous action is saved in oldact. Using only the oldact may be used to query the current signal disposition.

sa_handler specifies the action to be associated with signum and may be SIG_DFL for the default action, SIG_IGN to ignore this signal, or a pointer to a signal handling function. This function receives the signal number as its only argument, or use SA_SIGINFO flag, then sa_sigaction (instead of sa_handler

sa_mask specifies a mask of signals which should be blocked (i.e., added to the signal mask of the thread in which the signal handler is invoked) during execution of the signal handler. In addition, the signal which triggered the handler will be blocked, unless the SA_NODEFER flag is used.

sa_flags specifies a set of flags which modify the behavior of the signal (lots of them ... see man for more details)
#include <aio.h>

struct aiocb {
int aio_fildes; /* File descriptor */
off_t aio_offset; /* File offset regardless of the current file offset*/
volatile void *aio_buf; /* Location of buffer */
size_t aio_nbytes; /* Length of transfer */
int aio_reqprio; /* Request priority */
struct sigevent aio_sigevent; /* Notification method */
int aio_lio_opcode; /* Operation; lio_listio() only */
};

aio_sigevent -> {sigev_notify, sigev_signo, sigev_value }
sigev_notify = SIGEV_NONE, SIGEV_SIGNAL, and SIGEV_THREAD

int aio_read (struct aiocb *aiocbp);
int aio_write (struct aiocb *aiocbp);
int aio_error (const struct aiocb *aiocbp); /* check for errors */
int aio_return (struct aiocb *aiocbp); /* return the status code of op */
int aio_cancel (int fd, struct aiocb *aiocbp);
int aio_fsync (int op, struct aiocb *aiocbp);
int aio_suspend (const struct aiocb * const cblist[], int n, const struct timespec *timeout); /* suspend until io list will be completed */

The current Linux POSIX AIO implementation is provided in userspace by glibc
select/poll
select/poll
int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *errorfds, struct timeval *timeout);
int pselect(int nfds, fd_set *readfds, fd_set *writefds, fd_set *errorfds, const struct timespec *timeout, const sigset_t *sigmask);

void FD_CLR(int fd, fd_set *fdset);
int FD_ISSET(int fd, fd_set *fdset);
void FD_SET(int fd, fd_set *fdset);
void FD_ZERO(fd_set *fdset);

select timeout is with milisecond resolution, pselect resolution is up to nanosecond, this field can be NULL which force select to set appropriate fdset flags and return immediately
select may be interrupted by signal, pselect sets the mask attomicaly
maximal number of descriptor is FD_SETSIZE
Upon successful completion, the pselect() and select() functions shall return the total number of bits set in the bit masks. Otherwise, -1 shall be returned, and errno shall be set to indicate the error.

#include <poll.h>
int poll(struct pollfd fds[], nfds_t nfds, int timeout);
epoll
#include <sys/epoll.h>

int epoll_create(int size);
int epoll_create1(int flags); /* since size is not used */
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);
int epoll_pwait(int epfd, struct epoll_event *events, int maxevents, int timeout, const sigset_t *sigmask);

struct epoll_event {
uint32_t events; /* Epoll events */
epoll_data_t data; /* User data variable union int or ptr */
};

op:
EPOLL_CTL_ADD
EPOLL_CTL_MOD
EPOLL_CTL_DEL

events:
EPOLLIN
EPOLLOUT
EPOLLRDHUP - Stream socket peer closed connection, or shut down writing half of connection.
EPOLLPRI - There is urgent data available for read operations.
EPOLLET - Sets the Edge Triggered behavior for the associated file descriptor.
EPOLLONESHOT - The user must call epoll_ctl() with EPOLL_CTL_MOD to rearm the file descriptor with a new event mask.
Memory concepts
Hello everybody out there using minix
I'm doing a (free) operating system (just a hobby, won't be big and professional like gnu) for 386(486) AT clones.

It is NOT protable (uses 386 task switching etc), and it probably never will support anything other than AT-harddisks, as that's all I have :-(.

Q: Is that max 64 64Mb tasks or max 64 tasks no matter what their size?
A: I'm afraid that is 64 tasks max (and one is used as swapper), no matter how small they should be.
...
I don't want to be on the machine when someone is spawning >64 processes, though :-)
Sys V Message queues
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/msg.h>

key_t ftok(const char *pathname, int proj_id)
int msgget(key_t key, int msgflg);
int msgctl(int msqid, int cmd, struct msqid_ds *buf);

key = key id or IPC_PRIVATE
msgflg = same as in open for instance (O_CREAT | O_EXCL and access right flags)
return = non negative msg queue descriptor of -1 in case of error
MSGMNI - System wide maximal number of msg queues => /proc/sys/kernel/msgmni
SGMAX - Maximum size for a message text: 8192 bytes (/proc/sys/kernel/msgmax).
MSGMNB - Default maximum size in bytes of a message queue: 16384 bytes (/proc/sys/kernel/msgmnb)
Sys V Message queues
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/msg.h>

int msgsnd(int msqid, const void *msgp, size_t msgsz, int msgflg);
ssize_t msgrcv(int msqid, void *msgp, size_t msgsz, long msgtyp, int msgflg);

struct msgbuf {
long mtype; /* message type, must be > 0 */
char mtext[1]; /* message data */
};

On failure both functions return -1 with errno indicating the error, otherwise msgsnd() returns 0 and msgrcv() returns the number of bytes actually copied into the mtext array

Messages of zero length (i.e., no mtext field) are permitted. The mtype field must have a strictly positive integer value
If insufficient space is available in the queue, then the default behavior of msgsnd() is to block until space becomes available (unless IPC_NOWAIT)
Size of the queue is initially set to MSGMNB bytes
call cannot be automaticly restarted upon signal (SA_RESTART does not work)
argument msgsz specifies the maximum size in bytes for the member mtext of the structure pointed to by the msgp argument
Sys V Message queues
in case of msgrcv if message to be read is greater than msgsz and MSG_NOERROR is specified, then the message text will be truncated. if MSG_NOERROR is notspecified, then the message isn't removed from the queue and the system call fails returning -1 (errno set to E2BIG)
The argument msgtyp specifies the type of message requested as follows:
If msgtyp is 0, then the first message in the queue is read.
If msgtyp is greater than 0, then the first message in the queue of type msgtyp is read, unless MSG_EXCEPT was specified in msgflg, in which case the first message in the queue of type not equal to msgtyp will be read.
If msgtyp is less than 0, then the first message in the queue with the lowest type less than or equal to the absolute value of msgtyp will be read
msgflg:
IPC_NOWAIT - Return immediately if no message of the requested type is in the queue.
MSG_EXCEPT - Used with msgtyp greater than 0 to read the first message in the queue with message type that differs from msgtyp.
MSG_NOERROR - To truncate the message text if longer than msgsz bytes.
Sys V Semaphore Sets
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/sem.h>

int semget(key_t key, int nsems, int semflg);
int semctl(int semid, int semnum, int cmd, ...);

If successful, the return value will be the semaphore set identifier (a nonnegative integer), otherwise -1 is returned, with errno indicating the error.

A newly created set are indeterminate. (POSIX.1-2001 is explicit on this point.) Although Linux, like many other implementations, initializes the semaphore values to 0. In order to initialize the semaphores, semctl() must be used to perform a SETVAL or a SETALL operation
SEMMNI System wide maximum number of semaphore sets: policy dependent (/proc/sys/kernel/sem).
SEMMSL Maximum number of semaphores per semid: implementation dependent (/proc/sys/kernel/sem).
SEMMNS System wide maximum number of semaphores: policy dependent (/proc/sys/kernel/sem). Values greater than SEMMSL * SEMMNI makes it irrelevant.
Sys V Semaphore Sets
#include <sys/types.h>
#include <sys/ipc.h>
#include <sys/sem.h>

int semop(int semid, struct sembuf *sops, unsigned nsops);
int semtimedop(int semid, struct sembuf *sops, unsigned nsops, struct timespec *timeout);

struct sembuf {
unsigned short sem_num; /* semaphore number */
short sem_op; /* semaphore operation */
short sem_flg; /* operation flags */
};

If successful semop() and semtimedop() return 0; otherwise they return -1 with errno indicating the error

sem_flg: IPC_NOWAIT and SEM_UNDO (automatically undone when the process terminates)
sem_op: >0 <0, or 0 "wait-for-zero"

operations are performed either as a complete unit, or not at all
semop() is never automatically restarted after being interrupted by a signal handler (SA_RESTART does ot work)
Sys V Semaphore Sets
/* Code to set semid omitted */
sops[0].sem_num = 0; /* Operate on semaphore 0 */
sops[0].sem_op = 0; /* Wait for value to equal 0 */
sops[0].sem_flg = 0;

sops[1].sem_num = 0; /* Operate on semaphore 0 */
sops[1].sem_op = 1; /* Increment value by one */
sops[1].sem_flg = 0;

if (semop(semid, sops, 2) == -1) {
perror("semop");
exit(EXIT_FAILURE);
}
Sys V Shared Memory
#include <sys/ipc.h>
#include <sys/shm.h>

int shmget(key_t key, size_t size, int shmflg);
void *shmat(int shmid, const void *shmaddr, int shmflg);
int shmdt(const void *shmaddr);
int shmctl(int shmid, int cmd, struct shmid_ds *buf);

shmget: A valid segment identifier, shmid, is returned on success, -1 on error.
shmat: On success returns the address of the attached shared memory segment; on error (void *) -1 is returned, and errno is set to indicate the cause of the error.
shmdt: On success returns 0; on error -1 is returned, and errno is set to indicate the cause of the error

size rounded up to a multiple of PAGE_SIZE
shmaddr is NULL, the system chooses a suitable (unused) address at which toattach the segment
shmget ->shmflg: same as in open
shmat -> shmflg: SHM_RDONLY , (Linux-specific) SHM_REMAP
The same segment may be attached as a read and as a read-write one, and more than once, in the process's address space
After a fork() the child inherits the attached shared memory segments
After an execve() all attached shared memory segments are detached from the process
Pipes and Fifo
#include <unistd.h>

int pipe(int pipefd[2]);
int pipe2(int pipefd[2], int flags);

#include <sys/types.h>
#include <sys/stat.h>
int mkfifo(const char *pathname, mode_t mode);

both return: On success, zero is returned. On error, -1 is returned, and errno is set appropriately.

unidirectional data channel
pipefd[0] refers to the read end of the pipe. pipefd[1] refers to the write end
Data written to the write end of the pipe is buffered by the kernel until it is read from the read end of the pipe
A pipe has a limited capacity, since Linux 2.6.11, the pipe capacity is 65536 bytes
If a process attempts to read from an empty pipe, then read() will block until data is available. If a process attempts to write to a full pipe, then write() blocks until sufficient data has been read from the pipe to allow the write to complete
Normally, opening the FIFO blocks until the other end is opened also. A process can open a FIFO in nonblocking mode. In this case, opening for read only will succeed even if no-one has opened on the write side yet, opening for write only will fail with ENXIO (no such device or address) unless the other end has already been opened
Posix IPC
Message queues
Semaphores/Mutex
Shared memory
POSIX Shared Memory
#include <sys/mman.h>
#include <sys/stat.h> /* For mode constants */
#include <fcntl.h> /* For O_* constants */

int shm_open(const char *name, int oflag, mode_t mode);
int shm_unlink(const char *name);

return: On success, shm_open() returns a nonnegative file descriptor. On failure, shm_open() returns -1. shm_unlink() returns 0 on success, or -1 on error

parameters: same as open

The file descriptor is normally used in subsequent calls to ftruncate() (for a newly created object) and mmap(). After a call to mmap() the file descriptor may be closed without affecting the memory mapping.
The operation of shm_unlink() is analogous to unlink(): it removes a shared memory object name, and, once all processes have unmapped the object, de-allocates and destroys the contents of the associated memory region. After a successful shm_unlink(), attempts to shm_open() an object with the same name will fail (unless O_CREAT was specified, in which case a new, distinct object is created).
The POSIX shared memory object implementation on Linux makes use of a dedicated file system, which is normally mounted under /dev/shm.
POSIX message queues
#include <fcntl.h> /* For O_* constants */
#include <sys/stat.h> /* For mode constants */
#include <mqueue.h>

mqd_t mq_open(const char *name, int oflag);
mqd_t mq_open(const char *name, int oflag, mode_t mode, struct mq_attr *attr);

return: On success, mq_open() returns a message queue descriptor for use by other message queue functions. On error, mq_open() returns (mqd_t) -1, with errno set to indicate the error

Each message queue is identified by a name of the form /somename; that is, a null-terminated string of up to NAME_MAX (i.e., 255) characters consisting of an initial slash, followed by one or more characters, none of which are slashes. Two processes can operate on the same queue by passing the same name to mq_open(3).
After a fork(2), a child inherits copies of its parent's message queue descriptors, and these descriptors refer to the same open message queue descriptions as the corresponding descriptors in the parent
On Linux, message queues are created in a virtual file system
# mkdir /dev/mqueue
# mount -t mqueue none /dev/mqueue
The sticky bit is automatically enabled on the mount directory.
There is also /proc configuration directory /proc/sys/fs/mqueue/
POSIX message queues
#include <mqueue.h>

int mq_send(mqd_t mqdes, const char *msg_ptr, size_t msg_len, unsigned msg_prio);

#include <time.h>
#include <mqueue.h>

int mq_timedsend(mqd_t mqdes, const char *msg_ptr, size_t msg_len, unsigned msg_prio, const struct timespec *abs_timeout);

return: On success, mq_send() and mq_timedsend() return zero; on error, -1 is returned, with errno set to indicate the error

The msg_len argument specifies the length of the message pointed to by msg_ptr
Zero-length messages are allowed
msg_prio argument is a nonnegative integer that specifies the priority of this message. Messages are placed on the queue in decreasing order of priority, with newer messages of the same priority being placed after older messages with the same priority.
ms_send() on full queue will block til sufficient space becomes available to allow the message to be queued, or until the call is interrupted by a signal handler
mq_timedsend() behaves just like mq_send(), except that if the queue is full, then abs_timeout points to a structure which specifies a ceiling on the time for which the call will block
On Linux, mq_timedsend() is a system call, and mq_send() is a library function layered on top of that system call
POSIX message queues
#include <mqueue.h>

ssize_t mq_receive(mqd_t mqdes, char *msg_ptr, size_t msg_len, unsigned *msg_prio);

#include <time.h>
#include <mqueue.h>

ssize_t mq_timedreceive(mqd_t mqdes, char *msg_ptr, size_t msg_len, unsigned *msg_prio, const struct timespec *abs_timeout);

return: On success, mq_receive() and mq_timedreceive() return the number of bytes in the received message; on error, -1 is returned, with errno set to indicate the error

Each message has an associated priority, and messages are always delivered to the receiving process highest priority first
POSIX message queues have kernel persistence: if not removed by mq_unlink(), a message queue will exist until the system is shut down
If the queue is empty, then, by default, mq_receive() blocks until a message becomes available, or the call is interrupted by a signal handler
mq_timedreceive() behaves just like mq_receive(), except that if the queue is empty and the O_NONBLOCK flag is not enabled, then abs_timeout points to a structure which specifies a ceiling on the time for which the call will block
The msg_len argument specifies the size of the buffer pointed to by msg_ptr; this must be greater than the mq_msgsize attribute of
the queue (or error will be returned with errno = EMSGSIZE "msg_len was less than the mq_msgsize attribute of the message queue")
If prio is not NULL, then the buffer to which it points is used to return the priority associated with the received message
POSIX message queues
#include <mqueue.h>

int mq_notify(mqd_t mqdes, const struct sigevent *sevp);

union sigval { /* Data passed with notification */
int sival_int; /* Integer value */
void *sival_ptr; /* Pointer value */
};

struct sigevent {
int sigev_notify; /* Notification method SIGEV_SIGNAL or SIGEV_THREAD */
int sigev_signo; /* Notification signal in case of SIGEV_SIGNAL */
union sigval sigev_value; /* Data passed with notification */
void (*sigev_notify_function) (union sigval); /* Function used for thread notification (SIGEV_THREAD) */
void *sigev_notify_attributes; /* Attributes for notification thread (SIGEV_THREAD) */
pid_t sigev_notify_thread_id; /* ID of thread to signal (SIGEV_THREAD_ID) */
};

return: On success mq_notify() returns 0; on error, -1 is returned, with errno set to indicate the error.

mq_notify() allows the calling process to register or unregister for delivery of an asynchronous notification when a new message arrives on the empty message queue referred to by the descriptor mqdes
Only one process can be registered to receive notification from a message queue
If sevp is NULL, and the calling process is currently registered to receive notifications for this message queue, then the registration is removed; another process can then register to receive a message notification for this queue
If another process or thread is waiting to read a message from an empty queue using mq_receive(3), then any message notification registration is ignored
Notification occurs once: after a notification is delivered, the notification registration is removed, and another process can register for message notification (it can use mq_notify() to request a further notification). This should be done before emptying all unread messages from the queue
POSIX message queues
#include <mqueue.h>

int mq_close(mqd_t mqdes);

error: On success mq_close() returns 0; on error, -1 is returned, with errno set to indicate the error

All open message queues are automatically closed on process termination, or upon execve(2).

int mq_unlink(const char *name);

error: On success mq_unlink() returns 0; on error, -1 is returned, with errno set to indicate the error.

The message queue name is removed immediately. The queue itself is destroyed once any other processes that have the queue open close their descriptors referring to the queue
POSIX named semaphores
#include <fcntl.h>
#include <sys/stat.h>
#include <semaphore.h>

sem_t *sem_open(const char *name, int oflag);
sem_t *sem_open(const char *name, int oflag, mode_t mode, unsigned int value);

return: On success, sem_open() returns the address of the new semaphore; this address is used when calling other semaphore-related functions. On error, sem_open() returns SEM_FAILED, with errno set to indicate the error

name: in form /somename; a null-terminated string of up to NAME_MAX-4 (i.e., 251) characters consisting of an initial slash, followed by one or more characters, none of which are slashes
POSIX named semaphores have kernel persistence: if not removed by sem_unlink(), a semaphore will exist until the system is shut down.
On Linux, named semaphores are created in a virtual file system, normally mounted under /dev/shm, with names of the form sem.somename
POSIX semaphores
#include <semaphore.h>

int sem_post(sem_t *sem);
int sem_wait(sem_t *sem);
int sem_trywait(sem_t *sem);
int sem_timedwait(sem_t *sem, const struct timespec *abs_timeout);

return: returns 0 on success; on error, the value of the semaphore is left unchanged, -1 is returned, and errno is set to indicate the error (sem_post() can return EOVERFLOW)

sem_post() is async-signal-safe: it may be safely called within a signal handler.
all wait functions can be interrupted by signals regardless of the use of the sigaction() SA_RESTART flag.

#include <semaphore.h>
int sem_close(sem_t *sem);
int sem_unlink(const char *name);

All open named semaphores are automatically closed on process termination, or upon execve(2).
After sem_unlink() the semaphore is destroyed once all other processes that have the semaphore open close it.
POSIX unnamed semaphores
#include <semaphore.h>

int sem_init(sem_t *sem, int pshared, unsigned int value);
int sem_destroy(sem_t *sem);

return: sem_init() returns 0 on success; on error, -1 is returned, and errno is set to indicate the error.

pshared: 0 - the semaphore is shared between the threads, != 0 - the semaphore is shared between processes, and should be located in a region of shared memory
unnamed semaphore should be destroyed using sem_destroy().
destroying a semaphore that other processes or threads are currently blocked on (in sem_wait(3)) produces undefined behavior.
an unnamed semaphore should be destroyed with sem_destroy() before the memory in which it is located is deallocated. Failure to do this can result in resource leaks on some implementations.
x86 memory segmentation and paging
386 struct tss_struct {
387 unsigned short back_link,__blh;
388 unsigned long esp0;
389 unsigned short ss0,__ss0h;
390 unsigned long esp1;
391 unsigned short ss1,__ss1h; /* ss1 is used to cache MSR_IA32_SYSENTER_CS */
392 unsigned long esp2;
393 unsigned short ss2,__ss2h;
394 unsigned long __cr3;
395 unsigned long eip;
396 unsigned long eflags;
397 unsigned long eax,ecx,edx,ebx;
398 unsigned long esp;
399 unsigned long ebp;
400 unsigned long esi;
401 unsigned long edi;
402 unsigned short es, __esh;
403 unsigned short cs, __csh;
404 unsigned short ss, __ssh;
405 unsigned short ds, __dsh;
406 unsigned short fs, __fsh;
407 unsigned short gs, __gsh;
408 unsigned short ldt, __ldth;
409 unsigned short trace, io_bitmap_base;
410 /*
411 * The extra 1 is there because the CPU will access an
412 * additional byte beyond the end of the IO permission
413 * bitmap. The extra byte must be all 1 bits, and must
414 * be within the limit.
415 */
416 unsigned long io_bitmap[IO_BITMAP_LONGS + 1];
417 /*
418 * Cache the current maximum and the last task that used the bitmap:
419 */
420 unsigned long io_bitmap_max;
421 struct thread_struct *io_bitmap_owner;
422 /*
423 * pads the TSS to be cacheline-aligned (size is 0x100)
424 */
425 unsigned long __cacheline_filler[35];
426 /*
427 * .. and then another 0x100 bytes for emergency kernel stack
428 */
429 unsigned long stack[64];
430 } __attribute__((packed));
Full transcript