Computer Scientist

Saturday, 12 May 2018

libcurl in Linux

Learn to use libcurl in Linux

The objective of learning the libcurl is to implement a http access facility  in my crawler program for a set of specialised websites in C++. I have not obtained a more appropriate http request library yet for this purpose.  So libcurl is my first try. It will be tested and checked for this specific objective during the whole process of my learning. It is also worth to search for new alternative options that may help in more appropriate means. If it is in that case, new thread will be created. This thread only concentrates on the learning of the libcurl. 

Create a learning environment (developement environment):

I will use Linux (Ubuntu) as the main environment for learning and testing the libcurl. So, in order to ease and short the preparation phase and make the learning experience more joyful, the apt-get is relied on to install all necessary development packages of relevant libraries. The following packages are installed beforehand: 
  • libssl-dev
  • libssl-doc
  • libcurl4-openssl-dev
  • libcurl4-openssl-doc
these are the exact package names that apt-get install requires. The environment of the learning is listed as following: 
  • OS: Ubuntu 16.04 LTS (64-bit)
  • Compilers: GCC 5.4, G++ 5.4
  • CPU: i5-3570K
  • Memory: 8GB
  • Hard Drive: 256GB SSD
Strange enough that my libcurl and ssl are installed under the anaconda3 directory under my home directory. I just have no idea why this happened.

Initialisation before everything: 

A global initialisation for the library is necessary by using curl_global_init() function. There is also a corresponding clean up function curl_global_cleanup(). But keep in mind that these initialisation functions are NOT thread safe, even though most libcurl components are thread safe. These functions are expected to be invoked ONLY once for the entire life time of my program.

Run-time feature detection: 

The return structure of the function: curl_version_info() contains the details of what the currently running libcurl supports.

Easy-Interface vs Multi-Interface: 

The easy-interface is the synchronous transfer with blocked function calls. The multi-interface allows asynchronous transfer without blocking function call, which allows multiple simultaneous transfers. The easy-interface will come first in the following sections.

Easy-Interface:

All easy interface functions have the same prefix: 'curl_easy'
  • Handle: We should use one handle for each session in each one thread. DO NOT share a handle across multiple threads.
  • Options: 
    • Setting: the function curl_easy_setopt() can set options for a handle. Options are sticky, they will change only when they will be given a different value.
    • Resetting: curl_easy_reset() blank all previously set options.
    • Copy: curl_easy_duphandle() produces another handle with the same option settings.
  • Write back the result: 
    • Write function: 
      • if the option CURLOPT_WRITEFUNCTION is set, the response will be processed by the denoted write function with the signature, size_t func_name (void *buffer, size_t size, size_t nmemb, void *userp)
      • if the option CURLOPT_WRITEDATA is set to a given structure. This type will be passed into the write function as the fourth parameter.
    • No Write function: 
      • Output to stdout, if no write function is given, the response defaults to output to stdout.
      • Output to a file: if an opened file handle (FILE *) is passed to the curl handle as the option of CURLOPT_WRITEDATA. The file will store the response, rahter than the stdout.
      • <WARNING>: in some systems, passing opened file handle with CURLOPT_WRITEDATA crash the libcurl.
  • Make the transfer: 
    • The function: curl_easy_perform() connects to the remote site and do the necessary commands and receives the response.
    • The given write function may get one byte at a time or it may get many kilobytes at once. libcurl delivers as much as possible, as often as possible. 
    • The perform function returns a status code. But CURLOPT_ERRORBUFFER option can provide a buffer to keep the human-readable error message.
    • It is encouraged to re-use the transfer handle.
  • EXAMPLE and things to notice: 
    • Not only -lcurl is required, -lssl and -lcrypto are also required to link to the openssl and libcrypto.so librarys. If they are not provided, the following error messages may be shown: 
      • no -lssl: lib/libcurl.so: undefined reference to `SSLv2_client_method'
      • no -lcrypto: lib/libssl.so: undefined reference to `EVP_idea_cbc'

USE_CASEs:

Upload data to a remote site
  • Read data callback function: size_t read_function(char *bufptr, size_t size, size_t nitems, void *userp) will tell libcurl which data is going to transfer to the remote site.
  • Set read function: curl_easy_setopt(easyhandle, CURLOPT_READFUNCTION, read_function);
  • Set customer user data to be passed to the read function if it is needed: curl_easy_setopt(easyhandle, CURLOPT_READDATA, &filedata)
  • Set the operation of the perform is upload: curl_easy_setopt(easyhandle, CURLOPT_UPLOAD, 1L)
  • <WARNING>: a few protocols requires the expected file size as a prior knowledge of the transfer. This can be set by: curl_easy_setopt(easyhandle, CURLOPT_INFILESIZE_LARGE, file_size), where file_size must be a type of curl_off_t.
Providing username and password
  • Username and password can be provided in the URL: http://myname:thesecret@example.com/path
  • They can also be provided by setting handle's option: 
    • curl_easy_setopt(easyhandle, CURLOPT_USERPWD, "myname:thesecret"). This is same as providing username and password in the URL.
    • curl_easy_setopt(easyhandle, CURLOPT_PROXYUSERPWD, "myname:thesecret"). This is to provide username and password for proxy only.
    • Back to the UNIX popular era, the file $HOME/.netrc file was usually used to keep the username and password for user's FTP credential in form of plain text. libcurl provides a method to use this file for not only FTP, but also HTTP: curl_easy_setopt(easyhandle, CURLOPT_NETRC, 1L)
    • The form of the .netrc file is as following: 
machine myhost.mydomain.com
login userlogin
password secretword 

Multi-Interface:

Cautions: 
  • There is no internal thread synchronisation in libcurl, even though libcurl is thread safe.
  • Handles: never share the same handle in multiple threads. But you can pass the handles around among threads. But never use a single handle from more than one thread at any given time. (It looks useless for just passing a handle but not using it).
  • Shared objects: Certain data can be shared between multiple handles by using the share interface. But a locking mechanism (libcurl doesn't provide it internally) is to be provided by using the function: curl_share_setopt()
    • CURLSHOPT_LOCKFUNC 
    • CURLSHOPT_UNLOCKFUNC

DEBUGGING:

Deal with run-time errors:
  • CURLOPT_VERBOSE (set1): spew out the entire protocol details the libcurl sends, some internal info, some received protocol data.
  • CURLOPT_HEADER (set 1): for HTTP to include headers in the normal body.
  • CURLOPT_DEBUGFUNCTION: for the situation where CURLOPT_VERBOSE is not enough.