Skip to content
  • Ed Morley's avatar
    NLTK support: Fix passing of multiple corpora identifiers (#460) · 4212e063
    Ed Morley authored
    * NLTK support: Update test to use multiple corpora
    
    So that the incorrect handling of multiple IDs seen in #444 would
    have been caught.
    
    Also switches to some of the smaller corpora, to reduce time spent
    downloading during tests (see sizes on http://www.nltk.org/nltk_data/).
    
    * NLTK support: Fix passing of multiple corpora identifiers
    
    As part of fixing the shellcheck warnigns in #438, double quotes had
    been placed around `$nltk_packages` passed to the `nltk.downloader`,
    which causes multiple identifiers to be treated as though it were just
    one identifier that contains spaces.
    
    The docs for the shellcheck warning in question recommend using arrays
    if the intended behaviour really is to split on spaces:
    https://github.com/koalaman/shellcheck/wiki/SC2086#exceptions
    
    As such, `readarray` has been used, which is present in bash >=4.
    The `[*]` array form is used in the log message, to prevent shellcheck
    warning SC2145, whereas `[@]` is used when passed to `nltk.downloader`
    to ensure the array elements are unpacked as required.
    
    Note: Both before and after this fix, using anything but unix line
    endings in `nltk.txt` will also cause breakage.
    4212e063