TREMoLo-Web is a corpus of 825K texts retrieved from the web representing a total of about 750M words. The web pages automatically crawled based register-specific queries but with no constraint on the source. Pages were segmented to segments (texts from the corpus) of 5K characters max. They were semi-automatically annotated in language registers (casual, neutral, formal).
Please contact us if you want to retrieve the corpus.